iSDM.species module¶

A module for all species layers functionality.

class iSDM.species.GBIFSpecies(**kwargs)¶

Bases: iSDM.species.Species

A class for encapsulating the GBIF (Global Biodiversity Information Facility) species layer functionality. Uses the pygbif python API for querying the GBIF backbone and acquiring observations data on species.

Variables:	data_full (pandas.DataFrame or geopandas.GeoDataFrame) – Data frame containing the full data for the species occurrences.

find_species_occurrences(name_species=None, **kwargs)¶

Finds and loads species occurrence data into pandas DataFrame. The data comes from GBIF backbone API requests, based on the name of the species (name_species). If the name_species parameter is not provided, it is attempted to query the GBIF backbone using the species object ID (if already set) as a taxonomical key. The GBIF API provides services for searching occurrence records that have been indexed by GBIF. The results from the search are paginated, and in order to retrieve them all, individual requests are issued for each page. The returned results are limited to a maximum of 300 records per page, at the time of writing this. The method below will loop until there are no more “next” pages (endOfRecords is reached), and combine all species occurrence (meta-)data in a single data structure. The pygbif.occurrences.search(...) returns a list of json structures which are loaded into pandas.DataFrame for easier manipulation.

Parameters:	name_species (string) – The taxonomical name of the species to use for querying the GBIF backbone.
Returns:	Data frame containing all species occurrences (meta-)data.
Return type:	pandas.DataFrame

geometrize(dropna=True, longitude_col_name='decimallongitude', latitude_col_name='decimallatitude', crs=None)¶

Converts the species data from pandas.DataFrame contents to geopandas.GeoDataFrame format. GeoDataFrames inherit basic DataFrames, and provide more functionality on top of pandas. The biggest difference in terms of the data layout is the addition of a ‘geometry’ column which contains Shapely geometries in geopandas. The decimallatitude and decimallongitude columns are converted into shapely Point geometry, one Point for each latitude/longitude record.

Parameters:

dropna (bool) – Whether to drop records with NaN values in the decimallatitude or decimallongitude columns in the conversion process.
longitude_col_name (string) – The name of the column carrying the decimal longitude values. Default is ‘decimallongitude’.
latitude_col_name (string) – The name of the column carrying the decimal latitude values. Default is ‘decimallatitude’.
crs (string or dictionary.) – The Coordinate Reference System of the data. Default is “EPSG:4326”.

Returns:

None

load_csv(file_path)¶

Load data from a CSV file into a pandas.DataFrame. The records are expected to contain (meta-)data on individual species occurrences. Examples of expected columns: decimallatitude, decimallongitude, specieskey etc. If the file contains data on one particular species (all values in the specieskey column are equal), the ID of the GBIFSpecies object is updated to the specieskey value. The data for the GBIFSpecies object is also updated to contain the CSV file contents, so be careful not to overwrite existing data. All column names are converted to lower-case, for consistency.

Parameters:	file_path (string) – The full path to the file (including the directory and filename in one string).
Returns:	Data frame loaded with data from the CSV file.
Return type:	pandas.DataFrame

overlay(species_range_map)¶

Overlays the point records with a species range map. The map can be an instance of IUCNSpecies, or directly a GeoSeries datastructure containing Shapely geometries. This overlaying effectively crops the point records to the area within the range map, i.e., drops those points that fall outside the union of range polygon(s). If the data is not already in a geopandas format, the geometrize() method is called first. The geometries are first “prepared” (Prepared Geometries), for faster operations, such as checking if a polygon contains a point. Careful, the species data is updated to contain only the filtered-out occurrences. The other records are lost.

Parameters:	species_range_map (geopandas.GeoSeries or IUCNSpecies) – The species range-map geometry to crop point-record occurrences to.
Returns:	None

polygonize(buffer_distance=1, buffer_resolution=16, simplify_tolerance=0.1, preserve_topology=False, with_envelope=False)¶

Helper method: expands each Shapely Point of the geopandas.GeoDataFrame species data into its “polygon of influence” (buffer). If the data is not already in a geopandas format, the geometrize() method is called first. Further merges the polygons that overlap into a cascaded union (multipolygon). The polygon is further simplified, also (optionally) by using an envelope around the buffer. An envelope is the smallest rectangular polygon (with sides parallel to the coordinate axes) that contains the buffer geometry. The original species data is un-altered.

Parameters:	buffer_distance (int) – Unitless distance from the Point geometry, specifying the amount of “influence”. Default is 1. buffer_resolution (int) – The resolution of the buffer around each Point. It is used for approximation of a unit radius circle. For example, 16-gon approximation, 3 - triangle approximation etc. The higher the resolution, the closer the approximation of the buffer to a circle shape around the point. Default is 16. simplify_tolerance (int) – All points in the simplified geometry will be within the tolerance distance of the original geometry. preserve_topology (bool) – If set to False the much quicker Douglas-Peucker algorithm is used in the simplification process. Note that invalid geometric objects may result from simplification that does not preserve topology. Default is False. with_envelope (bool) – Whether to use an envelope in the simplification of the geometry. Default is false.
Returns:	Data frame containing all polygons of the simplified geometries.
Return type:	geopandas.GeoDataFrame

rasterize(raster_file=None, pixel_size=None, all_touched=False, no_data_value=0, default_value=1, crs=None, cropped=False, *args, **kwargs)¶

Rasterize (burn) the species point-record occurrences into pixels (cells), i.e., a 2-dimensional image array of type numpy ndarray. Uses the Rasterio library for this purpose. If the dataframe does not contain a geometry column (GeoDataFrame), geometrize() is called, to convert the decimallatitude/decimallongitude columns into a geometry column containing Point geometrical shapes. All the point-records are burned in a single “band” of the image. (point to grid) Rasterio datasets can generally have one or more bands, or layers. Following the GDAL convention, these are indexed starting with 1.

Parameters:	raster_file (string) – The full path to the targed GeoTIFF raster file (including the directory and filename in one string). pixel_size (int) – The size of the pixel in degrees, i.e., the resolution to use for rasterizing. all_touched (bool) – If true, all pixels touched by geometries, will be burned in. If false, only pixels whose center is within the polygon or that are selected by Bresenham’s line algorithm, will be burned in. no_data_value (int) – Used as value of the pixels which are not burned in. Default is 0. default_value (int) – Used as value of the pixels which are burned in. Default is 1. crs (dict) – The Coordinate Reference System to use. Default is “ESPG:4326” cropped (bool) – If true, the resulting pixel array (image) is cropped to the region borders, which contain the burned pixels (i.e., an envelope within the range). Otherwise, a “global world map” is used, i.e., the boundaries are set to (-180, -90, 180, 90) for the resulting array.
Returns:	Rasterio RasterReader file object which can be used to read individual bands from the raster file.
Return type:	rasterio._io.RasterReader

class iSDM.species.IUCNSpecies(**kwargs)¶

Bases: iSDM.species.Species

A class for encapsulating the IUCN Red List of threatened species layer functionality. The data for this layer is expected to be shapefiles (ESRI native format) and contains the known expert ranges of species. Ranges are depicted as polygons. One shapefile can contain distribution maps of an entire species group, i.e., all geometries, or alternatively, contain individual species ranges. The shapefiles can be downloaded from the website, as currently there is no IUCN API to directly query the IUCN backend database for particular taxonomical species. The data is always loaded in geopandas.GeoDataFrame format, suitable for geometries and operations on them.

Variables:

shape_file (string) – Location of the shapefile from which the data is loaded.
raster_file (string) – Location of the raster file to which the corresponding rasterized data is stored.
raster_affine (rasterio.transform.Affine) – Affine translation used in the species raster map.
raster_reader (rasterio._io.RasterReader) – file reader for the corresponding rasterized data.

drop_extinct_species(presence_column_name='presence', discard_bad=False)¶

According to the current IUCN Coded Domain Values for Presence:

Code	Presence
1	Extant
2	Probably Extant (discontinued)
3	Possibly Extant
4	Possibly Extinct
5	Extinct (post 1500)
6	Presence Uncertain

Species can have both areas (polygons) in which they are extinct (5) AND areas in which they are not. Such species are kept, and only species for which all areas are extinct, are filtered-out.

Parameters:

presence_column_name (string) – The column name which contains the presence code values. Default is ‘presence’.
discard_bad (bool) – Whether to keep or discard species with “unknown only” areas (code==0). By default they are kept (discard_bad=False). There are currently (july 2016) four such problematic species: Acipenser baerii, Ambassis urotaenia, Microphysogobio tungtingensis, Rhodeus sericeus.

Returns:

None

find_species_occurrences(name_species=None, **kwargs)¶

Filters the (previously loaded) geopandas.GeoDataFrame data to contain only records for a particular species (binomial). Careful, other records will be lost from the IUCNSpecies object upon calling this method.

Parameters:	name_species (string) – The binomial name of the species to use for filtering out records.
Returns:	A data frame with filtered-out species records.
Return type:	geopandas.GeoDataFrame

load_shapefile(file_path)¶

Loads the data from the provided file_path shapefile into a geopandas.GeoDataFrame. A GeoDataFrame is a tablular data structure that contains a column called geometry which contains a geopandas.GeoSeries of Shapely geometries. All other meta-data column names are converted to a lower-case, for consistency.

Parameters:	file_path (string) – The full path to the shapefile file (including the directory and filename in one string).
Returns:	None

random_pseudo_absence_points(buffer_distance=2, buffer_resolution=16, simplify_tolerance=1, preserve_topology=True, fast=False, count=100)¶

Draw random pseudo-absence points from within a buffer around the geometry. First it simplifies the geometry with a buffer around the original geometry. Then calculates the difference between this one, and the original geometry, to determine a geometry from which to sample random points. Finally, generates random points one by one and tests if they fall in that difference-geometry, until a count number of points are generated. If the “buffered” geometry is invalid (which could happen), it gradually tries to simplify it by applying a bigger value for the simplify_tolerance parameter, until the geometry becomes valid. The reason is that an operation like difference/intersection is problematic to apply on an invalid geometry. The value is increased by maximum of 100.

A more efficient approach would be to just generate a count number of points from the first step, i.e., from the buffer. Some points will fall within the original shape, and they can be discarded, so the number of pseudo-absence points will not actually be equal to count. If precision is not an issue, we could provide a count number that is larger but calculated according to the original_area/buffered_convex_hull ratio.

Update: Maybe not even necessary, given that shapely’s prep(..) operation speeds up a factor of 100 to 1000!

rasterize(raster_file=None, pixel_size=None, all_touched=False, no_data_value=0, default_value=1, crs=None, cropped=False, *args, **kwargs)¶

Rasterize (burn) the species rangemaps (geometrical shapes) into pixels (cells), i.e., a 2-dimensional image array of type numpy ndarray. Uses the Rasterio library for this purpose. All the shapes from the IUCNSpecies object data are burned in a single “band” of the image. Rasterio datasets can generally have one or more bands, or layers. Following the GDAL convention, these are indexed starting with 1.

Parameters:	raster_file (string) – The full path to the targed GeoTIFF raster file (including the directory and filename in one string). pixel_size (int) – The size of the pixel in degrees, i.e., the resolution to use for rasterizing. all_touched (bool) – If true, all pixels touched by geometries, will be burned in. If false, only pixels whose center is within the polygon or that are selected by Bresenham’s line algorithm, will be burned in. no_data_value (int) – Used as value of the pixels which are not burned in. Default is 0. default_value (int) – Used as value of the pixels which are burned in. Default is 1. crs (dict) – The Coordinate Reference System to use. Default is “ESPG:4326” cropped (bool) – If true, the resulting pixel array (image) is cropped to the region borders, which contain the burned pixels (i.e., an envelope within the range). Otherwise, a “global world map” is used, i.e., the boundaries are set to (-180, -90, 180, 90) for the resulting array.
Returns:	Rasterio RasterReader file object which can be used to read individual bands from the raster file.
Return type:	rasterio._io.RasterReader

save_shapefile(full_name=None, driver='ESRI Shapefile', overwrite=False)¶

Saves the current geopandas.GeoDataFrame data in a shapefile. The data is expected to have a geometry as a column, besides other (meta-)data. If the full location and name of the file is not provided, then the overwrite should be set to True, to overwrite the existing shapefile from which the data was previously loaded.

Parameters:

file_path (string) – The full path to the targed shapefile file (including the directory and filename in one string).
driver (string) – The driver to use for storing the geopandas.GeoDataFrame data into a file. Default is “ESRI Shapefile”.
overwrite (bool) – Whether to overwrite the shapefile from which the data was previously loaded, if a new file_path is not supplied.

Returns:

None

class iSDM.species.ObservationsType¶

Bases: enum.Enum

Possible observation types for the global species data.

class iSDM.species.Source¶

Bases: enum.Enum

Possible sources of global species data.

class iSDM.species.Species(**kwargs)¶

Bases: object

A generic Species class used for subclassing different global-scale species data sources.

Variables:	ID (integer) – a unique ID for a particular species. For example, for GBIF sources, it is the `gbifid` metadata field. name_species (string) – initial value: ‘Unknown’.

get_data()¶

Returns the (pre)loaded species data in a (geo)pandas DataFrame.

Returns:	`self.data_full`
Return type:	geopandas.GeoDataFrame or pandas.DataFrame

load_data(file_path=None, method='pickle')¶

Loads the data from the serialized species file into a pandas DataFrame. If the file_path parameter is not supplied, it will try to deduce the file name from the name of the species by default.

Parameters:	file_path (string) – The full path to the file (including the directory and filename in one string), where the data is serialized to. method (string) – The type of serialization that was used to serialize the data in the data frame. Default is pickle. Another possibility is “msgpack”, as it has shown as 10% more efficient in terms of time and memory, for the type of data we are dealing with.
Returns:	Data loaded into (geo)pandas Dataframe.
Return type:	geopandas.GeoDataFrame

load_raster_data(raster_file=None)¶

Loads the raster data from a previously-saved raster file. Provides information about the loaded data, and returns a rasterio file reader.

Parameters:	raster_file (string) – The full path to the targed GeoTIFF raster file (including the directory and filename in one string).
Returns:	Rasterio RasterReader file object which can be used to read individual bands from the raster file.
Return type:	rasterio._io.RasterReader

pixel_to_world_coordinates(raster_data=None, no_data_value=0, filter_no_data_value=True, band_number=1)¶

Map the pixel coordinates to world coordinates. The affine transformation matrix is used for this purpose. The convention is to reference the pixel corner. To reference the pixel center instead, we translate each pixel by 50%. The “no value” pixels (cells) can be filtered out.

A dataset’s pixel coordinate system has its origin at the “upper left” (imagine it displayed on your screen). Column index increases to the right, and row index increases downward. The mapping of these coordinates to “world” coordinates in the dataset’s reference system is done with an affine transformation matrix.

Parameters:	raster_data (string) – the raster data (2-dimensional array) to translate to world coordinates. If not provided, it tries to load existing rasterized data from the IUCNSpeices object. no_data_value (int) – The pixel values depicting non-burned cells. Default is 0. filter_no_data_value (bool) – Whether to filter-out the no-data pixel values. Default is true. If set to false, all pixels in a 2-dimensional array will be converted to world coordinates. Typically this option is used to get a “base” map of the coordinates of all pixels in an image (map). band_number (int) – The index of the band from which to load raster data.
Returns:	A tuple of numpy ndarrays. The first array contains the latitude values for each non-zero cell, the second array contains the longitude values for each non-zero cell.
Return type:	tuple(np.ndarray, np.ndarray)

plot_species_occurrence(figsize=(16, 12), projection='merc', facecolor='crimson')¶

Visually plots the species data on a Basemap. Basemap supports projections (with coastlines and political boundaries) using matplotlib. The species data must be in a geopandas DataFrame format. If it is not, the geometrize() method is called, to convert the dataframe into a geopandas.GeoDataFrame format (with a geometry column). The following geometrical (Shapely) shapes are supported: Polygon, MultiPolygon, and Point (plotted with a buffer around it).

Parameters:	figsize (tuple) – tuple containing the (width, height) of the plot, in inches. Default is (16, 12). projection (string) – The projection to use for plotting. Supported projection values from Basemap. Default is `merc` (Mercator). facecolor (string) – Fill color for the geometries. Default is `crimson` (red).
Returns:	A map with geometries plotted, zoomed to the total boundaries of the geometry Series (column) of the DataFrame.

save_data(full_name=None, dir_name=None, file_name=None, method='pickle')¶

Serializes the loaded species dataset (pandas or geopandas DataFrame) into a binary pickle (or msgpack) file.

Parameters:	full_name (string) – The full path of the file (including the directory and filename in one string),

where the data will be saved.

Parameters:	dir_name (string) – The directory where the file will be stored. If `file_name` is not specified, the default one `name_species` + `.pkl` (or `.msg`) is given by default. file_name (string) – The name of the file where the data will be saved. If `dir_name` is not specified, the current working directory is taken by default. method (string) – The type of serialization to use for the data frame. Default is pickle. Another possibility is msgpack, as it has shown as 10% more efficient in terms of time and memory, for the type of data we are dealing with.
Raises:	AttributeError: if the data has not been loaded in the object before. See `load_data()` and `find_species_occurrences()`
Returns:	None

set_data(data_frame)¶

Set the species data to the contents of data_frame. The data passed must be in a pandas or geopandas DataFrame. Careful, it overwrites the existing data!

Parameters:	data_frame (pandas.DataFrame) – The new data.
Returns:	None