iSDM.species module¶
A module for all species layers functionality.
-
class
iSDM.species.GBIFSpecies(**kwargs)¶ Bases:
iSDM.species.SpeciesA class for encapsulating the GBIF (Global Biodiversity Information Facility) species layer functionality. Uses the pygbif python API for querying the GBIF backbone and acquiring observations data on species.
Variables: data_full (pandas.DataFrame or geopandas.GeoDataFrame) – Data frame containing the full data for the species occurrences. -
find_species_occurrences(name_species=None, **kwargs)¶ Finds and loads species occurrence data into pandas DataFrame. The data comes from GBIF backbone API requests, based on the name of the species (
name_species). If thename_speciesparameter is not provided, it is attempted to query the GBIF backbone using the species objectID(if already set) as a taxonomical key. The GBIF API provides services for searching occurrence records that have been indexed by GBIF. The results from the search are paginated, and in order to retrieve them all, individual requests are issued for each page. The returned results are limited to a maximum of 300 records per page, at the time of writing this. The method below will loop until there are no more “next” pages (endOfRecordsis reached), and combine all species occurrence (meta-)data in a single data structure. The pygbif.occurrences.search(...) returns a list of json structures which are loaded intopandas.DataFramefor easier manipulation.Parameters: name_species (string) – The taxonomical name of the species to use for querying the GBIF backbone.
Returns: Data frame containing all species occurrences (meta-)data. Return type: pandas.DataFrame
-
geometrize(dropna=True, longitude_col_name='decimallongitude', latitude_col_name='decimallatitude', crs=None)¶ Converts the species data from pandas.DataFrame contents to geopandas.GeoDataFrame format. GeoDataFrames inherit basic DataFrames, and provide more functionality on top of pandas. The biggest difference in terms of the data layout is the addition of a ‘geometry’ column which contains Shapely geometries in geopandas. The
decimallatitudeanddecimallongitudecolumns are converted into shapely Point geometry, one Point for each latitude/longitude record.Parameters: - dropna (bool) – Whether to drop records with NaN values in the decimallatitude or decimallongitude columns in the conversion process.
- longitude_col_name (string) – The name of the column carrying the decimal longitude values. Default is ‘decimallongitude’.
- latitude_col_name (string) – The name of the column carrying the decimal latitude values. Default is ‘decimallatitude’.
- crs (string or dictionary.) – The Coordinate Reference System of the data. Default is “EPSG:4326”.
Returns: None
-
load_csv(file_path)¶ Load data from a CSV file into a
pandas.DataFrame. The records are expected to contain (meta-)data on individual species occurrences. Examples of expected columns:decimallatitude,decimallongitude,specieskeyetc. If the file contains data on one particular species (all values in thespecieskeycolumn are equal), theIDof the GBIFSpecies object is updated to thespecieskeyvalue. The data for the GBIFSpecies object is also updated to contain the CSV file contents, so be careful not to overwrite existing data. All column names are converted to lower-case, for consistency.Parameters: file_path (string) – The full path to the file (including the directory and filename in one string). Returns: Data frame loaded with data from the CSV file. Return type: pandas.DataFrame
-
overlay(species_range_map)¶ Overlays the point records with a species range map. The map can be an instance of IUCNSpecies, or directly a GeoSeries datastructure containing Shapely geometries. This overlaying effectively crops the point records to the area within the range map, i.e., drops those points that fall outside the union of range polygon(s). If the data is not already in a geopandas format, the
geometrize()method is called first. The geometries are first “prepared” (Prepared Geometries), for faster operations, such as checking if a polygon contains a point. Careful, the species data is updated to contain only the filtered-out occurrences. The other records are lost.Parameters: species_range_map (geopandas.GeoSeries or IUCNSpecies) – The species range-map geometry to crop point-record occurrences to. Returns: None
-
polygonize(buffer_distance=1, buffer_resolution=16, simplify_tolerance=0.1, preserve_topology=False, with_envelope=False)¶ Helper method: expands each Shapely Point of the
geopandas.GeoDataFramespecies data into its “polygon of influence” (buffer). If the data is not already in a geopandas format, thegeometrize()method is called first. Further merges the polygons that overlap into a cascaded union (multipolygon). The polygon is further simplified, also (optionally) by using an envelope around the buffer. An envelope is the smallest rectangular polygon (with sides parallel to the coordinate axes) that contains the buffer geometry. The original species data is un-altered.Parameters: - buffer_distance (int) – Unitless distance from the Point geometry, specifying the amount of “influence”. Default is 1.
- buffer_resolution (int) – The resolution of the buffer around each Point. It is used for approximation of a unit radius circle. For example, 16-gon approximation, 3 - triangle approximation etc. The higher the resolution, the closer the approximation of the buffer to a circle shape around the point. Default is 16.
- simplify_tolerance (int) – All points in the simplified geometry will be within the tolerance distance of the original geometry.
- preserve_topology (bool) – If set to False the much quicker Douglas-Peucker algorithm is used in the simplification process. Note that invalid geometric objects may result from simplification that does not preserve topology. Default is False.
- with_envelope (bool) – Whether to use an envelope in the simplification of the geometry. Default is false.
Returns: Data frame containing all polygons of the simplified geometries.
Return type: geopandas.GeoDataFrame
-
rasterize(raster_file=None, pixel_size=None, all_touched=False, no_data_value=0, default_value=1, crs=None, cropped=False, *args, **kwargs)¶ Rasterize (burn) the species point-record occurrences into pixels (cells), i.e., a 2-dimensional image array of type numpy ndarray. Uses the Rasterio library for this purpose. If the dataframe does not contain a geometry column (GeoDataFrame),
geometrize()is called, to convert the decimallatitude/decimallongitude columns into a geometry column containing Point geometrical shapes. All the point-records are burned in a single “band” of the image. (point to grid) Rasterio datasets can generally have one or more bands, or layers. Following the GDAL convention, these are indexed starting with 1.Parameters: - raster_file (string) – The full path to the targed GeoTIFF raster file (including the directory and filename in one string).
- pixel_size (int) – The size of the pixel in degrees, i.e., the resolution to use for rasterizing.
- all_touched (bool) – If true, all pixels touched by geometries, will be burned in. If false, only pixels whose center is within the polygon or that are selected by Bresenham’s line algorithm, will be burned in.
- no_data_value (int) – Used as value of the pixels which are not burned in. Default is 0.
- default_value (int) – Used as value of the pixels which are burned in. Default is 1.
- crs (dict) – The Coordinate Reference System to use. Default is “ESPG:4326”
- cropped (bool) – If true, the resulting pixel array (image) is cropped to the region borders, which contain the burned pixels (i.e., an envelope within the range). Otherwise, a “global world map” is used, i.e., the boundaries are set to (-180, -90, 180, 90) for the resulting array.
Returns: Rasterio RasterReader file object which can be used to read individual bands from the raster file.
Return type: rasterio._io.RasterReader
-
-
class
iSDM.species.IUCNSpecies(**kwargs)¶ Bases:
iSDM.species.SpeciesA class for encapsulating the IUCN Red List of threatened species layer functionality. The data for this layer is expected to be shapefiles (ESRI native format) and contains the known expert ranges of species. Ranges are depicted as polygons. One shapefile can contain distribution maps of an entire species group, i.e., all geometries, or alternatively, contain individual species ranges. The shapefiles can be downloaded from the website, as currently there is no IUCN API to directly query the IUCN backend database for particular taxonomical species. The data is always loaded in
geopandas.GeoDataFrameformat, suitable for geometries and operations on them.Variables: - shape_file (string) – Location of the shapefile from which the data is loaded.
- raster_file (string) – Location of the raster file to which the corresponding rasterized data is stored.
- raster_affine (rasterio.transform.Affine) – Affine translation used in the species raster map.
- raster_reader (rasterio._io.RasterReader) – file reader for the corresponding rasterized data.
-
drop_extinct_species(presence_column_name='presence', discard_bad=False)¶ According to the current IUCN Coded Domain Values for
Presence:Code Presence 1 Extant 2 Probably Extant (discontinued) 3 Possibly Extant 4 Possibly Extinct 5 Extinct (post 1500) 6 Presence Uncertain Species can have both areas (polygons) in which they are extinct (5) AND areas in which they are not. Such species are kept, and only species for which all areas are extinct, are filtered-out.
Parameters: - presence_column_name (string) – The column name which contains the presence code values. Default is ‘presence’.
- discard_bad (bool) – Whether to keep or discard species with “unknown only” areas (code==0). By default they are kept (discard_bad=False). There are currently (july 2016) four such problematic species: Acipenser baerii, Ambassis urotaenia, Microphysogobio tungtingensis, Rhodeus sericeus.
Returns: None
-
find_species_occurrences(name_species=None, **kwargs)¶ Filters the (previously loaded)
geopandas.GeoDataFramedata to contain only records for a particular species (binomial). Careful, other records will be lost from the IUCNSpecies object upon calling this method.Parameters: name_species (string) – The binomial name of the species to use for filtering out records. Returns: A data frame with filtered-out species records. Return type: geopandas.GeoDataFrame
-
load_shapefile(file_path)¶ Loads the data from the provided
file_pathshapefile into a geopandas.GeoDataFrame. A GeoDataFrame is a tablular data structure that contains a column calledgeometrywhich contains ageopandas.GeoSeriesof Shapely geometries. All other meta-data column names are converted to a lower-case, for consistency.Parameters: file_path (string) – The full path to the shapefile file (including the directory and filename in one string). Returns: None
-
random_pseudo_absence_points(buffer_distance=2, buffer_resolution=16, simplify_tolerance=1, preserve_topology=True, fast=False, count=100)¶ Draw random pseudo-absence points from within a buffer around the geometry. First it simplifies the geometry with a buffer around the original geometry. Then calculates the difference between this one, and the original geometry, to determine a geometry from which to sample random points. Finally, generates random points one by one and tests if they fall in that difference-geometry, until a
countnumber of points are generated. If the “buffered” geometry is invalid (which could happen), it gradually tries to simplify it by applying a bigger value for thesimplify_toleranceparameter, until the geometry becomes valid. The reason is that an operation like difference/intersection is problematic to apply on an invalid geometry. The value is increased by maximum of 100.A more efficient approach would be to just generate a
countnumber of points from the first step, i.e., from the buffer. Some points will fall within the original shape, and they can be discarded, so the number of pseudo-absence points will not actually be equal tocount. If precision is not an issue, we could provide acountnumber that is larger but calculated according to theoriginal_area/buffered_convex_hullratio.Update: Maybe not even necessary, given that shapely’s
prep(..)operation speeds up a factor of 100 to 1000!
-
rasterize(raster_file=None, pixel_size=None, all_touched=False, no_data_value=0, default_value=1, crs=None, cropped=False, *args, **kwargs)¶ Rasterize (burn) the species rangemaps (geometrical shapes) into pixels (cells), i.e., a 2-dimensional image array of type numpy ndarray. Uses the Rasterio library for this purpose. All the shapes from the
IUCNSpeciesobject data are burned in a single “band” of the image. Rasterio datasets can generally have one or more bands, or layers. Following the GDAL convention, these are indexed starting with 1.Parameters: - raster_file (string) – The full path to the targed GeoTIFF raster file (including the directory and filename in one string).
- pixel_size (int) – The size of the pixel in degrees, i.e., the resolution to use for rasterizing.
- all_touched (bool) – If true, all pixels touched by geometries, will be burned in. If false, only pixels whose center is within the polygon or that are selected by Bresenham’s line algorithm, will be burned in.
- no_data_value (int) – Used as value of the pixels which are not burned in. Default is 0.
- default_value (int) – Used as value of the pixels which are burned in. Default is 1.
- crs (dict) – The Coordinate Reference System to use. Default is “ESPG:4326”
- cropped (bool) – If true, the resulting pixel array (image) is cropped to the region borders, which contain the burned pixels (i.e., an envelope within the range). Otherwise, a “global world map” is used, i.e., the boundaries are set to (-180, -90, 180, 90) for the resulting array.
Returns: Rasterio RasterReader file object which can be used to read individual bands from the raster file.
Return type: rasterio._io.RasterReader
-
save_shapefile(full_name=None, driver='ESRI Shapefile', overwrite=False)¶ Saves the current geopandas.GeoDataFrame data in a shapefile. The data is expected to have a
geometryas a column, besides other (meta-)data. If the full location and name of the file is not provided, then theoverwriteshould be set toTrue, to overwrite the existing shapefile from which the data was previously loaded.Parameters: - file_path (string) – The full path to the targed shapefile file (including the directory and filename in one string).
- driver (string) – The driver to use for storing the geopandas.GeoDataFrame data into a file. Default is “ESRI Shapefile”.
- overwrite (bool) – Whether to overwrite the shapefile from which the data was previously loaded, if a new
file_pathis not supplied.
Returns: None
-
class
iSDM.species.ObservationsType¶ Bases:
enum.EnumPossible observation types for the global species data.
-
class
iSDM.species.Source¶ Bases:
enum.EnumPossible sources of global species data.
-
class
iSDM.species.Species(**kwargs)¶ Bases:
objectA generic Species class used for subclassing different global-scale species data sources.
Variables: - ID (integer) – a unique ID for a particular species. For example, for GBIF sources, it is the
gbifidmetadata field. - name_species (string) – initial value: ‘Unknown’.
-
get_data()¶ Returns the (pre)loaded species data in a (geo)pandas DataFrame.
Returns: self.data_fullReturn type: geopandas.GeoDataFrame or pandas.DataFrame
-
load_data(file_path=None, method='pickle')¶ Loads the data from the serialized species file into a pandas DataFrame. If the
file_pathparameter is not supplied, it will try to deduce the file name from the name of the species by default.Parameters: - file_path (string) – The full path to the file (including the directory and filename in one string), where the data is serialized to.
- method (string) – The type of serialization that was used to serialize the data in the data frame. Default is pickle. Another possibility is “msgpack”, as it has shown as 10% more efficient in terms of time and memory, for the type of data we are dealing with.
Returns: Data loaded into (geo)pandas Dataframe.
Return type: geopandas.GeoDataFrame
-
load_raster_data(raster_file=None)¶ Loads the raster data from a previously-saved raster file. Provides information about the loaded data, and returns a rasterio file reader.
Parameters: raster_file (string) – The full path to the targed GeoTIFF raster file (including the directory and filename in one string). Returns: Rasterio RasterReader file object which can be used to read individual bands from the raster file. Return type: rasterio._io.RasterReader
-
pixel_to_world_coordinates(raster_data=None, no_data_value=0, filter_no_data_value=True, band_number=1)¶ Map the pixel coordinates to world coordinates. The affine transformation matrix is used for this purpose. The convention is to reference the pixel corner. To reference the pixel center instead, we translate each pixel by 50%. The “no value” pixels (cells) can be filtered out.
A dataset’s pixel coordinate system has its origin at the “upper left” (imagine it displayed on your screen). Column index increases to the right, and row index increases downward. The mapping of these coordinates to “world” coordinates in the dataset’s reference system is done with an affine transformation matrix.
Parameters: - raster_data (string) – the raster data (2-dimensional array) to translate to world coordinates. If not provided, it tries to load existing rasterized data from the IUCNSpeices object.
- no_data_value (int) – The pixel values depicting non-burned cells. Default is 0.
- filter_no_data_value (bool) – Whether to filter-out the no-data pixel values. Default is true. If set to false, all pixels in a 2-dimensional array will be converted to world coordinates. Typically this option is used to get a “base” map of the coordinates of all pixels in an image (map).
- band_number (int) – The index of the band from which to load raster data.
Returns: A tuple of numpy ndarrays. The first array contains the latitude values for each non-zero cell, the second array contains the longitude values for each non-zero cell.
Return type: tuple(np.ndarray, np.ndarray)
-
plot_species_occurrence(figsize=(16, 12), projection='merc', facecolor='crimson')¶ Visually plots the species data on a Basemap. Basemap supports projections (with coastlines and political boundaries) using matplotlib. The species data must be in a geopandas DataFrame format. If it is not, the
geometrize()method is called, to convert the dataframe into ageopandas.GeoDataFrameformat (with ageometrycolumn). The following geometrical (Shapely) shapes are supported: Polygon, MultiPolygon, and Point (plotted with a buffer around it).Parameters: - figsize (tuple) – tuple containing the (width, height) of the plot, in inches. Default is (16, 12).
- projection (string) –
The projection to use for plotting. Supported projection values from Basemap. Default is
merc(Mercator). - facecolor (string) – Fill color for the geometries. Default is
crimson(red).
Returns: A map with geometries plotted, zoomed to the total boundaries of the geometry Series (column) of the DataFrame.
-
save_data(full_name=None, dir_name=None, file_name=None, method='pickle')¶ Serializes the loaded species dataset (pandas or geopandas DataFrame) into a binary pickle (or msgpack) file.
Parameters: full_name (string) – The full path of the file (including the directory and filename in one string), where the data will be saved.
Parameters: - dir_name (string) – The directory where the file will be stored. If
file_nameis not specified, the default onename_species+.pkl(or.msg) is given by default. - file_name (string) – The name of the file where the data will be saved. If
dir_nameis not specified, the current working directory is taken by default. - method (string) – The type of serialization to use for the data frame. Default is pickle. Another possibility is msgpack, as it has shown as 10% more efficient in terms of time and memory, for the type of data we are dealing with.
Raises: AttributeError: if the data has not been loaded in the object before. See
load_data()andfind_species_occurrences()Returns: None
- dir_name (string) – The directory where the file will be stored. If
-
set_data(data_frame)¶ Set the species data to the contents of
data_frame. The data passed must be in a pandas or geopandas DataFrame. Careful, it overwrites the existing data!Parameters: data_frame (pandas.DataFrame) – The new data. Returns: None
- ID (integer) – a unique ID for a particular species. For example, for GBIF sources, it is the