Exploring a Global Wildlife GIS database
Using Python to characterize the International Union for Conservation of Nature (IUCN)’s geospatial data base.
The International Union for Conservation of Nature (IUCN) launched several projects to protect wildlife. One of these efforts led to a high-quality global geospatial database containing the habitats of more than 100,000 species. In this article, I explore its subset, focusing on terrestrial mammals.
The IUCN’s Red List of Threatened Species database contains more than 150,000 species, with geospatial information about habitats attributed to 80%+ of them. This database’s mere size proposes several challenges, which I may address in a later article. For now, I focus on a smaller subset — the global database consisting of terrestrial mammals with 12,436 records, each corresponding to one habitat patch per species. This mammal-habitat database is based on around four hundred different sources and contains 5,626 species identified by their binomial names, registered between 2008 and 2022. Furthermore, the database includes detailed taxonomic information, such as the order and family of the species. Additionally, a primary strength of the database is that it has detailed geospatial information on habitats in the form of polygon files, which I will explore in more detail later.
First, I will introduce and explore the non-geometric features of this dataset and then conduct a few analytical steps specific to the geospatial distribution of the different species. With this analysis, I hope to popularize this data source and encourage future work with potential applications for wildlife protection policies.
You may find all the IUCN data sources here, from which I downloaded the Terrestrial Mammals polygon data (search date: 2023–10–02 at 15:30:02)
All images in this article were created by the author.
1. Statistical exploration
1.1. Parse the dataset
First, let’s parse the database using GeoPandas and see what it contains:
import geopandas as gpd # version: 0.9.0
gdf_iucn = gpd.read_file('MAMMALS_TERRESTRIAL_ONLY')
print('Number of records: ', len(gdf_iucn))
print('Number of attributes: ', len(gdf_iucn.keys()))
gdf_iucn.head(3)
The output of this cell:
The geospatial data file seems to have 29 attributes, which we may also learn more about in the metadata documentation. Let’s explore them here!
The number of species, each attached by a unique ID, we will deal with is the following:
print(len(set(gdf_iucn.id_no)))
print(len(set(gdf_iucn.sci_name)))
Which cell rertuns the value 5626 twice, ensuring that indeed, each species have one unique id_no and one unique sci_name.
1.2 Taxonomical categories
We have several columns describing taxonomical categories: kingdom, phylum, class, order_, family, and genus. Let’s see which categories are the most frequent. Here, I only keep species’ names and taxonomical categories as unique pairs, so the frequency is expressed as the number of unique species per-category level.
from collections import Counter
for a in ['kingdom', 'phylum', 'class', 'order_', 'family', 'genus']:
print(a, Counter(gdf_iucn[[a, 'sci_name']].drop_duplicates()[a].to_list()).most_common(3), len(set(gdf_iucn[a])))
The output of this cell:
These statistics read as follows: ANIMALIA, CHORDATA, and MAMMALIA all cover every species in the dataset, meaning that they are all animals, they all have a spinal cord, and they are all mammals.
The first distinction comes at the level of biological order, where the top 3 — out of 26 different orders — are “RODENTIA” (Rodents) with 2,275 entries, “CHIROPTERA” (Bats) with 1,317 entries, and “PRIMATES” (Primates) with 521 entries.
Then comes 136 families in the data set, where topping the list are “MURIDAE” (Mouse and Rat Family) with 763 entries, “CRICETIDAE” (Hamster and Vole Family) with 659 entries, and “VESPERTILIONIDAE” (Vesper Bats) with 461 entries, these are smaller subdivisions of the order level, grouping similar species.
Finally comes the genus, where we have 1171 distinct categories, and on the top of the list are “Crocidura” (Shrews) with 196 entries, “Myotis” (Mouse-eared bats) with 120 entries, and “Rhinolophus” (Horseshoe bats) with 92 entries. Batman is getting pretty real, isn’t it?
After the top-list breakdown and not seeing too many familiar species from our everyday lives, we can also do a reverse search for some of the better-known species:
gdf_iucn[gdf_iucn.genus=='Canis'].head(5)
This quick search query shows the results for the genus of domestic dogs, which is “Canis.” In scientific classification, domestic dogs are part of the genus Canis, which includes various species and subspecies of canids. The scientific name for the domestic dog is Canis lupus familiaris. Other members of the Canis genus include the gray wolf (Canis lupus), the coyote (Canis latrans), and the golden jackal (Canis aureus), among others.
Additionally, the columns subspecies and subpop belong here as well; however, in this subset, they are practically empty fields.* Similarly, the column tax_comm only contains a few comments on the taxonomical data, so I disregard it.
1.3 How are habitat types characterized?
Let’s turn to their habitat types once we know about the different species and their categorizations. The columns presence, origin, seasonal, and their summary in *legend *contain expert-based quantifications of these other properties, scoring the first three on a scale from 1 to 6 (from extant to extinct / presence uncertain) and giving a textual description to the legend parameter. An example on their website shows how this works:
- presence = 1 (Extant); origin = 2 (Reintroduced); seasonal = 2 (Breeding season) maps to ‘Extant & Reintroduced (breeding)’
- presence = 3 (Possibly Extant); origin = 1 (Native); seasonal = 1 (Resident) maps to ‘Possibly Extant (resident)’
Read more about the scoring here and here.
gdf_iucn[['presence', 'origin', 'seasonal', 'legend']].head(5)
Let’s see how these habitat attributes are distributed:
import matplotlib.pyplot as plt
from collections import Counter
def get_distribution(x):
return list(zip(*[(k, v) for k, v in Counter(x.to_list()).most_common()]))
f, ax = plt.subplots(1,3,figsize = (15,4))
for idx, feat in enumerate(['presence', 'origin', 'seasonal']):
values, frequencies = get_distribution(gdf_iucn[feat])
ax[idx].bar(values, frequencies)
ax[idx].set_yscale('log')
ax[idx].set_title(feat)
The output of this cell:
f, ax = plt.subplots(1,1,figsize = (15,5))
values, frequencies = get_distribution(gdf_iucn['legend'])
ax.bar(values, frequencies)
ax.set_title('Frequency of habitat-characteristics', fontsize = 20)
ax.set_yscale('log')
ax.set_xticks(range(len(values)))
ax.set_xticklabels(values, rotation = 60, ha = 'right')
plt.show()
The output of this cell:
This graph shows that the most frequent tag is extant. However, the categories extinct and possibly extinct are in the top 10 as well, which is already a worrying sign.
Additional columns we have describing particular features of the habitats are island and marine; however, due to the nature of this subsample, these are non-applicable here.
1.4. Meta-information
When diving into the data table, one may also find information on the origin and type of the data itself, captured by compiler, yrcompiled, citation and source. The following quick toplists show which entities were the busiest compiling this database, which publications were the most frequent sources, and who these records are citing the most:
Counter(gdf_iucn.compiler).most_common(5)
Counter(gdf_iucn.source).most_common(5)
Counter(gdf_iucn.citation).most_common(5)
Additionally, the year of compilation column can hint at how updated the dataset is, showing values recorded between 2008 and 2022. The massive peak in 2008 probably corresponds to the launch of the database, and then, probably a few years of limited funding and/or less need for updates after the launch.
min(gdf_iucn.yrcompiled), max(gdf_iucn.yrcompiled)
f, ax = plt.subplots(1,1,figsize = (15,5))
ax.set_title('The number of records per year', fontsize = 20)
values, frequencies = get_distribution(gdf_iucn['yrcompiled'])
ax.bar(values, frequencies)
1.5. Endangerement information
Probably the most interesting categorical variable in this data set contains the information on how critical a species’ situation is. To characterize this, the IUCN Red List introduced nine categories about the conservatory status of different species, which are recorded in the category column. Out of these nine categories, eight are present in this dataset:
- Critically Endangered (CR),
- Data Deficient (DD),
- Endangered (EN),
- Extinct in The Wild (EW),
- Extinct (EX),
- Least Concern (LC),
- Not Threatened (NT),
- Regionally Extinct (RE),
- Vulnerable (VU)
From this categorization, Critically Endangered (CR), Endangered (EN) and Vulnerable (VU) species are considered.
Let’s remap into a human-readable format and count the categories, making sure we only count each species once:
category_d = { 'EX' : 'Extinct',
'EW' : 'Extinct in The Wild',
'RE' : 'Regionally Extinct',
'CR' : 'Critically Endangered',
'EN' : 'Endangered',
'VU' : 'Vulnerable',
'DD' : 'Data Deficient',
'LC' : 'Least Concern',
'NT' : 'Not Threatened'
}
gdf_iucn['category'] = gdf_iucn['category'].map(category_d)
Counter(gdf_iucn[['sci_name', 'category']].drop_duplicates().category).most_common()
The result of this cell:
These statistics show that 3205 (56%) of the species are of least concern; they are safe for now. However, 22% are endangered (Vulnerable, Endangered, or Critically Endangered). Additionally, we lack data for 14%, and there are about 15 species already extinct. Let’s commemorate them here:
sorted(set(gdf_iucn[gdf_iucn.category.isin(['Extinct', 'Extinct in The Wild'])].sci_name.to_list()))
The already lost species (thanks to ChatGPT for the English translations):
- Dusicyon australis (Falkland Islands Wolf) - Dusicyon avus (Darwin’s Fox) - Juscelinomys candango (Cerrado Mouse) - Leporillus apicalis (Lesser Stick-nest Rat) - Melomys rubicola (Bramble Cay Melomys) - Nesoryzomys darwini (Darwin’s Rice Rat) - Nyctophilus howensis (Lord Howe Long-eared Bat) - Oryx dammah (Scimitar-horned Oryx) - Palaeopropithecus ingens (Giant Aye-aye) - Pennatomys nivalis (Mountain Pygmy Possum) - Pipistrellus murrayi (Eastern Pipistrelle) - Pteropus subniger (Black Flying Fox) - Pteropus tokudae (Mariana Fruit Bat) - Sus bucculentus (Javan Warty Pig) - Xenothrix mcgregori (McGregor’s Primate)
Let’s also visualize the frequency distribution:
def get_color(x):
if x in ['Critically Endangered', 'Endangered', 'Vulnerable']:
return 'red'
elif x in ['Extinct in The Wild', 'Regionally Extinct', 'Extinct']:
return 'k'
else:
return 'green'
f, ax = plt.subplots(1,2,figsize = (12,6))
ax[0].set_title(70 * ' ' + 'The number of species category', fontsize = 20, pad = 30)
values, frequencies = get_distribution(gdf_iucn[['sci_name', 'category']].drop_duplicates().category)
colors = [get_color(v) for v in values]
for idx in range(2):
ax[idx].bar(values, frequencies, color = colors)
ax[idx].set_xticks(range(len(values)))
ax[idx].set_xticklabels(values, rotation = 60, ha = 'right')
ax[1].set_yscale('log')
plt.tight_layout()
2. Geospatial exploration
2.1. Habitat-size distribution
The first two geometric information we may lay our eyes upon are SHAPE_Leng and SHAPE_Area, corresponding to the total length of the habitat’s borders and the entire area of it, respectively. It’s convenient to prepare these figures, as computing length and area right can be tricky based on map projections. More on this latter topic here.
Now, let’s see the most significant and smallest areas — which species can be found everywhere and which need close observation and expeditions to be traced down?
# lets sum up the area of each patch a species may have
gdf_iucn.groupby(by = 'sci_name').sum().sort_values(by = 'SHAPE_Area').head(10)
The output of this cell:
Now get the top and bottom 10 species based on habitat area:
gdf_iucn.groupby(by = 'sci_name').sum().sort_values(by = 'SHAPE_Area').head(10).index.to_list()
The output of this cell, also with the English translation, looks as follows:
1. Melomys rubicola (Bramble Cay Melomys) 2. Eudiscoderma thongareeae (Thongaree’s Discoderma) 3. Murina balaensis (Bala Long-tailed Mouse) 4. Nyctophilus nebulosus (Eastern Tube-nosed Bat) 5. Cavia intermedia (Santa Catarina’s Guinea Pig) 6. Fukomys livingstoni (Livingstone’s Mole-rat) 7. Rhinolophus kahuzi (Kahuzi Horseshoe Bat) 8. Microtus breweri (Brewer’s Vole) 9. Myotis nimbaensis (Nimba Myotis) 10. Hypsugo lophurus (Steppe Serotine)
And now the other end — the largest habitats:
gdf_iucn.groupby(by = 'sci_name').sum().sort_values(by = 'SHAPE_Area', ascending = False).head(10).index.to_list()
1. Mus musculus (House Mouse) 2. Vulpes vulpes (Red Fox) 3. Canis lupus (Gray Wolf) 4. Mustela erminea (Ermine) 5. Mustela nivalis (Least Weasel) 6. Ursus arctos (Brown Bear) 7. Gulo gulo (Wolverine) 8. Alces alces (Moose) 9. Rangifer tarandus (Reindeer) 10. Lepus timidus (Mountain Hare)
2.2. Visualize the habitat polygons globally
There is only one column left that I haven’t mentioned, which is the geometry. Yet, it might contain the richest information of them all. First, just draw a simple GeoPandas map showing the location of each habitat polygon. In this visualization, I mark every habitat with a shaded, colored area with thin edges. Colors are assigned randomly based on the tab20 colormap.
Additionally, I transformed the geospatial data into the Mollweide projection to make my map look nicer. More on global map projections here.
# transform the coordinate reference system
gdf_iucn_t = gdf_iucn.copy()
gdf_iucn_t = gdf_iucn_t.to_crs('+proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs')
f, ax = plt.subplots(1,1,figsize=(15,10))
gdf_iucn_t.sample(200).plot(ax=ax, edgecolor = 'k', linewidth = 0.5, alpha = 0.5, cmap = 'tab20')
The output of this code block:
Now, the full data set! Also, to further improve my visualization, I add a base map with the library contextily. I added two different versions to let our personal taste decide!
import contextily as ctx
f, ax = plt.subplots(1,1,figsize=(15,10))
gdf_iucn_t.plot(ax=ax, edgecolor = 'k', linewidth = 0.5, alpha = 0.15, cmap = 'tab20')
ctx.add_basemap(ax, alpha = 0.8, crs = gdf_iucn_t.crs, url = ctx.providers.Esri.WorldPhysical)
ax.axis('off')
ax.set_ylim([-9.5*10**6, 9.5*10**6])
The output of this code block:
f, ax = plt.subplots(1,1,figsize=(15,10))
gdf_iucn_t.plot(ax=ax, edgecolor = 'k', linewidth = 0.5, alpha = 0.15, cmap = 'tab20')
ctx.add_basemap(ax, alpha = 0.8, crs = gdf_iucn_t.crs, url = ctx.providers.Esri.WorldGrayCanvas)
ax.axis('off')
ax.set_ylim([-9.5*10**6, 9.5*10**6])
#plt.savefig('worldmap_habitats_WorldGrayCanvas.png', dpi = 600, bbox_inches = 'tight')
2.3. Visualize the habitat polygons locally
Plotting each habitat on the same map, thousands of them, indeed leads to some exciting figures; however, it’s not straightforward to derive insights from them. On the road to deriving such insights, let’s zoom in and visualize the habitats of a few selected species.
For instance, when we search for giraffes (Giraffa camelopardalis) we will find ten different habitat patches:
gdf_iucn[gdf_iucn.genus.str.contains('Giraffa')].head(5)
The output of this code block:
Now, let’s use the code block below to produce habitat plots for giraffes, gorillas, lions, and african elephants!
f, ax = plt.subplots(1,1,figsize=(15,10))
gdf_iucn[gdf_iucn.sci_name=='Giraffa camelopardalis'].plot(ax=ax, edgecolor = 'k', linewidth = 0.5, alpha = 0.9, color = '#DAA520')
ax.set_xlim([-5, 55])
ax.set_ylim([-38, 15])
ax.set_title('The habitat patches of Giraffa camelopardalis (Giraffes)', fontsize = 18, y = 1.03)
ctx.add_basemap(ax, alpha = 0.8, crs = gdf_iucn.crs, url = ctx.providers.Esri.WorldPhysical)
plt.savefig('1_giraffe.png', dpi = 200)
2.4. Mapping into countries
After carefully studying a few selected habitats, now let’s go for a country-level aggregation. Specifically, I will map each habitat polygon into the admin boundaries of countries, then compute the total number of species, the total number of endangered species, and the ratio of these two.
I used the Natural Earth database’ Admin 0 — Countries’ file from here to obtain country-level admin boundaries.
world = gpd.read_file('ne_10m_admin_0_countries')
print(len(set(world.ADMIN)))
world.plot()
The result of this code block:
Let’s group endangerement categories together:
def is_endangered(x):
if x in ['Critically Endangered', 'Endangered', 'Vulnerable']:
return True
else:
return False
gdf_iucn['endangered_species'] = gdf_iucn.category.apply(is_endangered)
print(Counter(gdf_iucn['endangered_species']))
Based on this, the number of habitat patches that belong to endangered species is 2680, while the rest is 9756.
Now build country-level dictionaries measuring the number of species and endangered species in each country:
number_of_all_species = gpd.overlay(world, gdf_iucn).groupby(by = 'ADMIN').count().to_dict()['geometry']
number_of_end_species = gpd.overlay(world, gdf_iucn[gdf_iucn.endangered_species==True]).groupby(by = 'ADMIN').count().to_dict()['geometry']
world['number_of_all_species'] = world.ADMIN.map(number_of_all_species)
world['number_of_end_species'] = world.ADMIN.map(number_of_end_species)
world['number_of_all_species'] = world['number_of_all_species'].fillna(0)
world['number_of_end_species'] = world['number_of_end_species'].fillna(0)
world['ratio_of_end_species'] = world['number_of_end_species'] / world['number_of_all_species']
Finally, use these updated to visualize the global distributions on the level of countries:from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.colors import LogNorm
f, ax = plt.subplots(1,1,figsize=(15,7))
ax.set_title('Total number of species', fontsize = 20, pad = 30)
world.plot(ax=ax, color = 'grey', alpha = 0.5, linewidth = 0.5, edgecolor = 'grey')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=-0.01)
world[world.number_of_all_species>0].plot(column = 'number_of_all_species',ax=ax,legend_kwds={'label': "Total number of species"}, edgecolor ='k', linewidth = 1.5, cax=cax, cmap = 'Greens', legend=True, norm=LogNorm(vmin=1, vmax=world.number_of_all_species.max()))
f, ax = plt.subplots(1,1,figsize=(15,7))
ax.set_title('Number of endangered species', fontsize = 20, pad = 30)
world.plot(ax=ax, color = 'grey', alpha = 0.5, linewidth = 0.5, edgecolor = 'grey')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=-0.01)
world[world.number_of_end_species>0].plot(column = 'number_of_end_species',ax=ax, legend_kwds={'label': "Number of endangered of species"}, edgecolor ='k', linewidth = 1.5, cax=cax, cmap = 'RdYlGn_r', legend=True, norm=LogNorm(vmin=1, vmax=world.number_of_end_species.max()))
plt.savefig('2_map.png', dpi = 200)
f, ax = plt.subplots(1,1,figsize=(15,7))
ax.set_title('Ratio of endangered species', fontsize = 20, pad = 30)
world.plot(ax=ax, color = 'grey', alpha = 0.5, linewidth = 0.5, edgecolor = 'grey')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=-0.01)
world.plot(column = 'ratio_of_end_species',ax=ax, legend_kwds={'label': "Fraction of endangered of species"}, edgecolor ='k', linewidth = 1.5, cax=cax, cmap = 'RdYlGn_r', legend=True, vmin = 0, vmax = 0.22)
3. Concluding remarks
In this article, I introduced the IUCN’s geospatial data set consisting of thousands of species’ recorded habitats. After a short statistical overview, I showed how to access, visualize, and manipulate these geospatial data sets to inspire future work.
While at the end, I built my own simple index quantifying the level of wildlife endangerment in different countries, the IUCN provides ready-made maps to this end, building on much more sophisticated methodologies. Make sure to check it out here!
Finally, the IUCN accepts donations for future funding here.
Reference: IUCN. 2022. The IUCN Red List of Threatened Species. Version 2022–2. https://www.iucnredlist.org. Accessed on 02–10–2023.