Weeknotes: 14th April 2025
Last Week
LIFE vs GDAL
I talked last week about the struggles I was having with my naive approach to running some higher resolution map data through the LIFE pipeline, and how whilst I'd allowed for scaling in time, I hit problems in terms of scaling up in space. Thankfully everything ran slowly but smoothly from then on. The main lesson is that if you're using GDAL, you need to set some cache limits on it every time you use it, as data always gets bigger over time, and if you don't do it at the onset you will be bitten by it later when you do scale up. It doesn't need to be precise, I think a few heuristics like: this code goes through the map data once, so I don't need to cache any of it as I'll never read it a second time, or this code is using GDAL.warp or other opaque function where I can't chunk the data, so use 75% of free memory. I think I need to make my geospatial library that wraps, Yirgacheffe take a stance on this also, as Yirgacheffe tries to abstract away problems of memory management in terms of loading data in chunks etc., but the caching of GDAL keeping all the data in memory until you run out of it makes a mockery of my efforts.
LIFE vs coastlines
Having generated a new set of maps integrating some different local data, we then wanted to take a look at species impacted by the change, and see if we though the new values made sense. As part of this I was asked to classify species impacted as being endemic to the region we updated (i.e., they live only within the update area), non-endemic (they are partially in the area but also outside of it), or external (have no occurrence range in the updated area). This is where once again I hit upon the challenge that some of the range maps in the IUCN redlist follow coastlines closely.
First we should consider what is a species range map. You can see an example on most pages for species on the IUCN Red List, like the map here on this page for the Ariel Toucan:

The range is the rough area a species is thought to occur in. There doesn't seem to be one standard definition, but we can start with this one from An evaluation of the robustness of global amphibian range maps by Ficetola et al, 2014:
Geographical range maps encompass the broad areas where a species is thought to be found, and assume the species’ presence inside the range and absence out-side.
So let's say I want to ask if the Ariel Toucan is endemic to Brazil. At the high level, this would appear to be true:

But when I do the computation on this, I get told no, the Ariel Toucan isn't endemic to Brazil. This is because, as far as I can tell, the definition of the coastline isn't consistent between sources:

There's both bits where the we have the Ariel Toucan out over the ocean, and bits of land not covered by the range. Perhaps this is what was intended by the expert assessor that drew the range map, but I think the simpler explanation is that I have two definitions of what the coastline of Brazil is. This wouldn't be a surprise, as the coastline changes over time as lands shift with tidal erosion etc., or even the tides itself moving the coast line twice a day. The range map will have been made at point of time a, and the other map I'm using at another point of time several years apart and by other people, so I'd be surprised if they were identical.
When dealing with species presence maps there are two styles of errors to be concerned about, referred to as commission errors and omission errors. Commission errors are where you include land the species wouldn't found in (i.e., the area is too big), and omission is where you exclude land the species would be found in (i.e., the area is too small). In general (as my naïve understanding as a computer scientist anyway) range maps err on the side of minimising omission errors (i.e., including more land than necessary to ensure we don't miss any bits where the species might be), and we then use Area of Habitat maps to take the range and narrow it down to only include land we know meets certain requirements based on know species preferences (like elevation and habitat types), and thus by try to minimise commission errors.
And thus I find this adherence of range maps to coast-lines a little confusing, as for a range map to minimise omission errors I'd expect it to just be a smooth line just outside the coast (for a terrestrial species, and inside the coast if it was a marine species of bird), so as to safely include all the coastal edge no matter which particular map I'm using with its coastline definition.
You could rightly say that I'm asking the wrong question by using the map I did for Brazil there: I should have used one where it includes the sea border, which is actually the technique I've switched to for this sort of question. But I do think following the coastline gives a false level of precision to the data that is (again as a naïve computer scientist) disingenuous, particularly given the idea of minimising omission. Indeed, when it goes to the AoH map, we can see from the screenshot above the Ariel Toucan lives in Forest, Savannah, and artificial terrestrial habitats, so the sea would be removed from the range when do the next standard method of trying to better defined where the species is found.
The other significant challenge is that this detailed following of coastlines is a lot of data to both store and process. In my implementation of the STAR pipeline, which like LIFE is based on AoH maps, I process 2164 bird species, and the average size of the range GeoJSON file is 2.1 MB. But if we take a look at the file listing, we can see that for some species we have a few much larger sizes, going up to 183 MB. This graph plots all the GeoJSON file sizes for the birds:

The specific numbers don't really matter, but what is important is that we have a few files that take up most of the storage space for this evaluation. The largest of these is the Sooty Shearwater which follows a large amount of global coastlines. The next few larges files are similarly down to coastal edges being in the range.
Not only are these files big, but also slow to rasterize when calculating AoH. I have a compute server that lets me calculate 200 species in parallel, and calculating all the AoHs for STAR takes about two hours. But at least the last 40 minutes of that is just processing the AoH for the Sooty Shearwater, as its range is so detailed around the coast line just rasterizing it is incredibly slow. This is ironic (borrowing from the dictionary of Alanis Morrisette) because the Sooty Shearwater has a tiny terrestrial AoH which is actually what we care about for STAR - it mostly lives over the oceans, so we spend all this time to throw most of the data away, and I'd argue that what little we do get I'd treat with strong suspicion due to alignment issues between the range layer and whatever habitat and elevation layers we use for calculating AoH.
Trying to capture a shifting world and shifting species is hard, and so there isn't going to be a perfect answer, but it'd be interesting to know from ecologists if there is something I'm missing here with these coastal definitions that means it is important they follow some presumed coastline in detail, or if there's a chance we an simplify these ranges to the betterment of both working with them more easily and getting more accurate results once processed.
Yirgacheffe and MLX
I said many weeks ago that I'd like to add GPU support to Yirgacheffe, via MLX and CuPY support (that are Metal and CUDA Python libraries respectively), and I finally made a start of turning my early proof-of-concept into a reality. Initially I'm going to just get MLX working (given I work on a Mac most of the time), and then I'll add CuPY support for NVidia GPUs once that is working.
At the moment it's a lot of plumbing work as I need to add an abstraction layer where I've been before calling Numpy, and then mapping to MLX calls directly where possible, or writing bridging wrappers where necessary. The one good thing is I have a lot of test coverage for Yirgacheffe, most of which is currently lit up red to tell me I'm not done, but it does give me a good way to track my progress :)
Soil
I went to see an exhibition on soil at Somerset House (now finished alas). It was interesting to see an attempt to engage the general public on many of the things we've talked about, such as how fungi live symbiotically with plants trading nutrients for carbon, geographic variance in soil composition shown via tiny crude clay bowls from around the world and soil turned into a "paint" then applied to paper. There was also some linkage between soil and then the power to control that resource, which I think wsa interesting as a layer not really present in the work I do in say land usage change: the maps I produce are generally without borders, ignoring that very real aspect of land usage change being tied to the whims of the regional governments.
Next week
This week and next I'm working remotely from up on The Wirral, which is why I'll not be in the dab or the lab for meetings. My goals for this coming week are:
- Get my MLX Yirgacheffe to pass all tests (fingers crossed)
- Add anchors to sections for my blog software, so I can share links to specific topics within my weeknotes
- Do some more analysis on these modified LIFE layers
- Do some more playing with GBIF data and get Frank occurrence data for testing with his foundation model
Tags: life, yirgacheffe, maps