Weeknotes: 10th November 2025
Last week
Habitat maps
A short while ago I implemented an attempt to fuse GAEZ and HYDE farming land use data with the Jung habitat map that we use for the LIFE biodiversity metric. The challenge here is that the GAEZ and HYDE datasets are at 10km resolution whilst the Jung map is at 100m resolution and LIFE is calculated at 1.8km resolution. Whilst I implemented a method to achieve this, I had some concerns around the impact of the random sampling method that we use to redistribute the changes of habitat from GEAZ/HYDE to Jung. Whilst macroscopically everything is fine when we sum up the total pixels, at a per-pixel layer I was concerned that we might be impacting the rare species with small ranges: i.e., the ones close to extinction that we really care about.
To try and quantify this, I spent this week doing an evaluation on this map method, by running it ten times and then processing all 30k species from the IUCN Redlist that LIFE considers, and working out not just the change in species that go extinct (if any), but also trying to assess the quality of their Area of habitat using the Dahal et al validation method.
Generating the maps
The first step is generating and processing the maps ready for use by the LIFE pipeline. Whilst I often claim that the LIFE pipeline takes about a day to run from fill, it’s rare that I do the entire thing end to end, because the habitat maps rarely update. Processing the habitat layers takes about half a day or so for LIFE, and so most times I update LIFE it takes just half a day as I’m re-using the ready processed habitats.
My motivation to speed up those earlier sections has been low, because I do them so rarely, but because I needed to run many habitat iterations here, and I didn’t want it to take the better part of two weeks, I decided to spend a day looking at how I might increase performance: if I spent a day on this and halved the time, then I’m still ahead of where I would be otherwise. To save on suspense, I did indeed manage to speed it up significantly, and I did so via two changes to the code for how the code works.
Firstly, I took a look at how the parallelism was (or rather wasn’t) working. One job of the habitat processing code is to make a downsampled map from the Jung habitat map at 100m to a map per habitat class with fractional occupancy at 1.8km (for LIFE, this same code is used in my STAR pipeline and downsamples to 1km for that pipeline). My code attempted to parallelise the work by doing each habitat class in a different parallel thread, but this in practice isn’t very useful, as the memory requirements on each thread are very high. Instead I reworked things to do one habitat class after another, and then use parallelism within the processing code run. Some of these code just hands over to GDAL, which has limited parallelism options, but still this was faster. I also switched away from LZW compressing the result as its written to GeoTIFF as GDAL’s LZW compression is known to be a serialisation point, and doing that let GDAL warp run much faster.
Secondly, I was struggling to get GDAL to use as much memory has I had available, despite setting the obvious variables at my disposal. One suggest I saw that turned out to help both improve memory utility and code tidiness was to switch from using GDAL in memory datasets using it’s “memory” driver to using GDAL’s Virtual File System interface and using an in-memory file with the regular “geotiff” driver. I have to confess I’m not sure why that works better, but it did, and in doing so I was able to remove a hack I had in place that used a private API on Yirgacheffe, as I could just pass the virtual file address GDAL used to Yirgacheffe and have it work normally.
In the end a combination of those two changes let me get much better throughput and I could generate the maps within a single day on my Mac Studio working flat out.
In the end the machine was flat out for about three days generating both the habitat maps and AOH maps, but that's still impressively fast for such a small machine, and rivals the performance of the AMD EPYC servers we have access to thanks to the fact I can leverage the GPUs on the machine. However, using those GPUs did cause a little trouble this week...
Aggregation issues in Numpy vs MLX
In all this I stumbled upon an issue when generating the AOH maps: I was getting some negative areas for species when I ran the validation code, something that definitely shouldn't happen. Errors like this always cause me a moment of panic, as a lot of people trust that my code is doing the right thing, and it’d be embarrassing to have to explain to everyone I’d had some fundamental error.
Thankfully this error was unique to this run, where I was using my Mac Studio, and not something that was impacting runs on Linux where all of the real work has been done to date. This is because on my Mac Studio I use Yirgacheffe’s MLX backend to let me access the GPU power of that machine, which brings it on par with the hundreds of CPU cores the Linux machine has, which is using numpy for it’s calculation backend.
It turns out that MLX and Numpy did some different automatic type promotion for the same data when summing it, which worked out fine when using the numpy backend on Linux, but not on the MLX backend. It turned out that my AOH code was technically wrong, as it typecast one layer in the AOH calculation to a 16 bit signed integer rather than a 32 bit floating point type as I’d intended, but numpy’s behaviour when summing data meant the right thing happened (it summed the int16 layer in float anyway), which is why I’d not spotted this before. Moving to MLX I was getting integer overflows trying to sum the areas within an int16 type as it did what I’d actually told it to do.
So this was both a good outcome (old results not impacted, and I can fix the AOH code to have the correct type hinting in it), but it’s still concerning to me that Yirgacheffe does different things depending on which compute backend you use when aggregating geospatial layers. I suspect I can force MLX to be more like Numpy here, because I can query numpy for what types it'll use for a given operation without doing the operation, but that’s a task for another day.
GBIF API Failures
The Dahal et al validation method comes in two halves: a model checking part where it assesses the AOH maps based on statistical properties to spot outliers, and then an occurrence checking part where you take occurrence data from GBIF (i.e., "confirmed" sightings of an animal), and test your AOHs against those. Whilst the first one is relatively cheap to run, the second one requires that you download a reasonable amount of data based on your particular set of species.
Whilst I implemented the occurence part only relatively recently, it did successfully run for me when using the STAR species set (around 5K species), my week ended with struggling to download any data from GBIF for life (which is about 30K species). Initially GBIF's API failed straight away, telling me I had too many terms in my query, and so I tweaked it to request data by taxa, which got the individual queries down to 1 to 2 times the size of the STAR query I'd successfully ran before, but even those failed eventually, but only after 8 hours, so progress was somewhat slow. We do in theory have a mirror of GBIF somewhere in the group, however I wanted to use the GBIF APIs so I can get a DOI with the data to aid in reproducibility of results when we publish papers.
In the end I got it working by significantly restructuring my query to do fewer but more complex queries on GBIF, rather than thousands of simple queries per species, and over the weekend I was able to finally get the data I'd started trying to get late on Thursday. I find the GBIF query language isn't quite what I'd expect it to loo like, so I need to check the results are meaningful, before I then run validation based on it.
Status at the end of the week
So I failed to get the full analysis of the new habitat method before the week was out, because I really want to run the occurrence check before I pass full judgement. That said I am motivated to press on with that as:
- The earlier data implies that the random sampling is problematic
- I have plans to do machine learning derived habitat maps using Tessera in the near future, and this is the exact same analysis I’ll need to do for the output of that process/
The big migration
This website, the Yirgacheffe website, my many other websites, my GoToSocial instance, and my Matrix homeserver were all migrated this week from the Raspberry Pi they’d been on since march to a more performant VPS hosted by Hetzner. I’d picked the Pi on a whim as I thought it’d be nice and low energy and plenty to run the kind of things I do, but in practice it was just too slow, particularly for dynamic image processing that my websites do. I’d been putting this off for a while, but it was getting in the way of me working on adding new features to my websites I’ve been wanting to do for a while, so I finally put in the effort and got it done.
Whilst I was at it I also migrated from Nginx to Caddy, mostly because it meant I didn’t need to worry about migrating my certbot configuration, as Caddy will deal with letsencrypt directly. Not that I had any problems with certbot per se, it's just yet another thing to set up, which is also the only thing I need snap for, and so I'd rather just do without it.
For those curious, I picked Hetzner as I already have stuff there using their S3-alike storage, I like their management interface, and they run all their systems on sustainable energy.
This week
- Look at whether the GBIF data is what I expected
- Complete the validation run for these habitat maps
- Look into what’s new with GDAL 3.12 that was just released
- Patrick and I had a good chat about the Numpy/MLX type issues and what Yirgacheffe might do one day in the distant future with types, and I'd like to try scribble that down and do some related reading before it all leaves my head again
Tags: habitat maps, life, yirgacheffe