With the increasing development of information technology, the amount of data in the world has been exploding. On top of that, there has been a large movement towards open data, which is the idea that some data should be freely available to everyone to use and republish. That means there is a lot of computing infrastructure and data available, which makes it easier to build and run species distribution models. This article will cover:
- What the different types of data are that a user needs to run a species distribution model,
- Where to get this data from, and
- The things to be aware of and some standard good practices when dealing with data.
To run a species distribution model, you need two types of data: species data, which are the coordinates of the locations where the species of interest occurs, and environmental data, that describe the environmental conditions of those locations.
Data about where species have been observed are collected and stored by many different providers. You can either go out in the field and collect this data yourself, or use data previously collected by others. Data collections can come from museum records, large research surveys, or citizen science initiatives such as annual bird counts. There are an increasing number of resources online where such data is collated, and where you can visualise or download an occurrence dataset of a particular species. In Australia for example, there is the Atlas of Living Australia (ALA), which has 194 dataset collections with occurrence records for more than 100,000 species. At a global level, the Global Biodiversity Information Facility (GBIF), is a valuable resource with free and open access to occurrence records of more than 1.5 million species. If the project you are working on is more focused on a particular taxa or group of species, there are a variety of resources such as: the Australian Data Discovery Portal of the Terrestrial Ecosystem Research Network (TERN), Track, which has data about Australian fish distributions, NatureServe Explorer with information about >70,000 plants, animals, and ecosystems of the United States and Canada, or the Catalogue of Life.
In the BCCVL, you can import occurrence data directly from ALA and GBIF, and also species trait data from AEKOS. There is also a direct link from ALA to BCCVL, so rather than using all available occurrence records in ALA's collection, you can first clean and select the data you want, and push this cleaned dataset immediately to BCCVL for analysis.
Please be aware that the outcome of a model is only as good as the data you put into it, or as they say Garbage in, Garbage out and therefore checking your species data before you built your model is recommended. A few things to check in your data:
- Check for outliers. These can be the result of typos in coordinates and might bias your model.
- Check for duplicate records. Some algorithms handle duplicates well, others maybe not, so check whether removal is needed.
- Check in which years the data was recorded and remove data from years that are not relevant to your model. For example, if you want to model a species distribution using climate data from particular years, you might only want to use the occurrence records from matching years.
- Be aware of alternative names for the species as there might be different datasets for different names according to the local usage of a species’ name. It is recommended to use the latin name of the species as this is a global identity.
- Be aware that occurrence data can sometimes be biased towards the accessibility of sampling locations. Species are more likely to be observed near places where people go, and thus occurrence data can be lacking for remote areas. This can lead to a non-representative sample of the environmental conditions, although this is not necessarily the case.
Some algorithms also need data about where a species does not occur: absence data. This can either be true absence data, if data is collected on locations where a species definitely does not occur, or pseudo-absence data, if we infer that a species cannot occur in particular locations because of unsuitable conditions. Because this is quite a complex topic, we have devoted a separate article that explains the different options for absence data.
Environmental data describes the conditions of the locations where a species is present or absent. The most common types of environmental variables that are used in species distribution modelling are described by four classes of physical conditions, called the primary environmental regimes: moisture, thermal, radiation, and mineral nutrients. The moisture regime is mostly described by measures of rainfall and evaporation, the thermal regime by temperature measures. The radiation regime refers to solar radiation or sunlight, which is usually measured by the photosynthetically active radiation, or PAR. This is the spectral range of solar radiation that photosynthesising organisms, such as plants and algae, are able to use in the process of photosynthesis. This spectral region corresponds more or less with the range of light that is visible to the human eye. The mineral nutrients regime is described by factors such as soil type.
Other factors such as altitude can also affect the distribution of a species, but usually these only have an indirect effect on the species, as they are affecting environmental conditions within the primary regimes. For example, altitude has an effect on temperature, and therefore indirectly affects species distributions. For species distribution models it is better to use environmental variables that have a direct effect on survival, rather than indirect factors.
For species living in the ocean instead of on land there are oceanic variables such as sea surface temperature and salinity that can be used in species distribution models.
As with species data, there are a lot of online resources available that provide environmental data. For example, WorldClim provides a large collection of global climate layers of past, current and future climate. There is also a global soil database, and again there are also smaller scale national or regional databases. When you are designing a species distribution model, it is important to first think which environmental variables are likely to influence that species and then search for relevant environmental datasets. In the BCCVL, we provide access to >4000 current and future climate layers and >300 layers of non-climate environmental variables. You can have a look at these datasets without logging in on our public Data Portal.
It is good to be aware of how environmental data is generated. The data that is available in the BCCVL or other data collections is usually not the raw data collected. Raw data would be measurements such as daily rainfall or temperature. In Australia, there are 10,000 stations around the country that measure the amount of rainfall over 24 hours each morning at 9 am. There are also 1500 stations that continuously measure temperature, and report the maximum and minimum temperature over 24 hours each morning at 9 am.
The raw data is not very useful for a species distribution model as daily measurements are highly variable, and species respond to environmental conditions not so much on a daily basis but rather over longer time scales. Therefore, these raw data are processed to generate summary statistics, which are variables such as the mean annual temperature or the minimum or maximum of the warmest or coldest, the wettest or the driest month or season. Such minimum or maximum values make much more sense in species distribution modelling as the probability of a species occurring in a particular place is often influenced by a threshold of an environmental factor. For example, if a species cannot tolerate temperatures above or below a certain threshold, variables that represent minimum or maximum values are very useful to describe the environmental conditions under which a species is able to survive.
Like species occurrence data, environmental data is only collected in particular locations where measuring stations are situated. To use this data in a model, it needs to be converted to create what is known as a ‘raster surface’ in which each cell has a value for a particular environmental factor, including the cells for which no measurements of environmental factors exist. Because we don’t know the exact value for each cell, we use a technique called spatial interpolation that predicts values for unknown cells from a limited number of sample data points around that cell. This interpolation is based on the assumption that cells that are close together tend to have similar characteristics.
The resulting surface can be visually displayed in a two-dimensional graph, which is a contour graph or in a three-dimensional graph in which the x and y axes represent the longitude and latitude and the z-axis the value of the environmental factor measured.
Another important aspect of both species and environmental data is scale. Spatial scale has two components: grain, which is the resolution of your data and extent, which is the total study area. In the figure below, grain or resolution is the size of one individual grid cell, and refers to the sample resolution of a single observation. In other words, at what scale is occurrence data of a particular species or measurements of a particular environmental factor collected? For example, habitat type can be defined in grid cells of 1 km² or 10 km², which we refer to as fine versus coarse resolution. Extent refers to the total geographic area of a study.
It is important to think about resolution when environmental datasets are selected for your species distribution model. Ideally, you would choose the resolution of the dataset at a scale that is relevant to the species. For example, the appropriate resolution of a temperature dataset that is used to model the distribution of a plant species, which always remains in the same place, is different compared to one that models the distribution of a species with a daily home range of 20 or 30 km, such as large birds. Environmental variables from different regimes in models, such as temperature and soil, will likely have different resolutions among the datasets as climate data is often available at resolutions of 1 or 5 km², whereas soil datasets have a much finer resolution such as <100 m².
We encourage you to go and explore all the different datasets that are available and use them in your models, but just be aware that:
- Data collected by others needs to be properly referenced.
- When using data that is provided by an open online source, it is the modellers own responsibility to check whether all the data is accurate. Whilst a lot of the data providers have their own process of checking the quality of the data, there is always the possibility that a dataset includes a few inaccurate records.
- Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press.
- Hijmans RJ, Cameron SE, Parra JL, Jones PG, Jarvis A (2005) Very high resolution interpolated climate surfaces for global land areas. International journal of climatology, 25(15), 1965-1978.
- Xu T & Hutchinson MF (2013) New developments and applications in the ANUCLIM spatial climatic and bioclimatic modelling package. Environmental modelling & software, 40, 267-279.
You can also view this information in Module 3 of our Online Open Course in SDM: https://app.bccvl.org.au/training
Lead partners: Griffith University, James Cook University
Thanks to: University of New South Wales, Macquarie University, University of Canberra
Funded by: National eResearch Tools and Resources Project (NeCTAR)
The BCCVL is supported by the National eResearch Tools and Resources Project (NeCTAR), an initiative of the Commonwealth being conducted as part of the Super Science Initiative and financed from the Education Investment Fund, Department of Education. The University of Melbourne is the lead agent for the delivery of the NeCTAR project and Griffith University is the sub-contractor.