Weeknotes: 24th February 2025
Last Week
LIFE
As started last week, I finished the changes to let you run the LIFE pipeline using the bulk data downloads from the IUCN Red List website, which give you a bunch of CSV files of species data and then some shapefiles of range data. Previously I'd been using a PostGIS database with all the data in, but not everyone has one of those to hand, and I want to make it was easy as possible for others to run the pipeline to reproduce our work.
After being very impressed last week at the behaviour of DuckDB, particularly in the way it'd let me do spatial queries, I did realise that the performance of ST_UNION_AGG
, it's version of PostGIS's ST_UNION
was significantly slower than just pulling the different ranges for a species and using shapely to unify them in Python. It's still a hugely useful tool, but normally I'm happy to let PostGIS do such unions, but for DuckDB I just need to be a bit more cautious.
I also did as I intended, and in implementing the new species ingest code I've tweaked it to generate a report that says for all the species in the download, which ones were filtered out, and why that filtering out happened. It's a silly small thing, but it's so handy I'm kicking myself for not having thought to implement it earlier, instead only doing so once I'd seen how it'd made Chess's life easier for her STAR pipeline.
Sys admin
I managed to find a window in which to do some machine maintenance. Our compute machines are fairly busy, and if people have paper deadlines it can be a bit fraught trying to get last minute results, so I try not to disrupt people too much, but at some point one does need to do updates etc. This time my goals were getting all our compute servers up to date, and installing GDAL 3.10 on all of them. Normally I just let the machines run the GDAL that comes with the Linux distribution, as it's rare there is a feature or bug in GDAL that would warrant the extra hassle (for me) and disruption (to others) of having to maintain it myself. But with 3.10 they seem to have finally put to rest the issues that plagued us with the Python bindings never installing properly via pip. This was such a problem for us that I'd ended up writing a wrapper that behind the scenes ran Python in container with GDAL and the Python bindings installed, so as to not have out ecologists need to fight pip all the time. With luck I can retire that now.
I'm also going through the annual wave of trying to educate users that although our machines have a lot of memory, it's not infinite, particularly given they're shared machines. The Linux Out-Of-Memory (OOM) behaviour is problematic: the system doesn't know, can't know, who is using "too-much" memory, as that's subjective, it just knows there isn't any more memory. But it does need to free up memory in order to continue to operate, so to do that it'll terminate the next process that asks for more memory allocated. Usually that process that is the "greedy" one, and the person hogging the memory is the one impacted, but it could easily just be someone else's job, or about twice a year it's a system service that means I need to manually power-cycle the server 🤦
Estimating memory usage of a script you've written to process some data is a bit of a learned art, as the size can depend on the language, the data type, the algorithm being used and so forth (and at the extreme is just a variation on the halting problem) and so new users tend to just go for it and hope for the best, which invariably leads to the OOM handler being invoked and then me having to go nag people to explain all the above and ask them not to roll the dice on which process gets terminated by the OOM handler.
There's some things we could do here, like using cgroups to limit folk, but then that doesn't allow for statistical multiplexing, which is important on machines like these. What I really want is a way to suspend processes and switch them wholesale out to disk under high load, but that doesn't exist under Linux, at least readily. We could push for something more structured in terms of job scheduling also, but I think this happens infrequently enough that there is a general benefit to having the user pool self schedule between themselves rather than something more heavyweight. And to be brutally honest, dealing with this isn't something I really want to have taking up my time, as I do system administration out of a necessity rather than joy/interest.
Plants
A collaborator for a new project that is kicking off shared with us a bulk load of plant species distribution model output, and I've been finding ways to summarise that data to understand what it contains, generating various species richness maps, which tell you in pixel how many different species are present there, and endemism maps, which tell you how important a pixel is based on how rare the species in that pixel are.
I'd like to have shared some pictures of the data here, but because it's from a collaborator I can't really do that, which is sad. It's a kind of odd quirk of the domain that all the software I write is open, but most of the "open" data I work with actually isn't "open open", and so I have to be careful with what I share here, which is a bit frustrating.
I also need to get my head around some of the SDM process, as it seems for plants this is a more common method for reasoning about where they are versus the Area-of-Habitat model I'm more familiar with for animals.
Scrappy Fiddles
The plant work finally made me merge the work-in-progress 1.0 branch for Yirgacheffe. The major part of that work I wrote about a few weeks ago, whereby Yirgacheffe can now automatically handle working out how to UNION/INTERSECT rasters of different sizes which removes a lot of boilerplate code when working with rasters. I had planned to tidy up some other bits of the API for 1.0, given it feels like an important milestone, but that work was just so useful I was already writing new code based on that, so I just merged the code to main and other improvements will need to wait for 1.1 or some other magic number.
There's always a danger of holding out for perfect with things you're building, but if they already make things better than they were before, it's generally best to get them out there. The fact that this simpler API was sufficiently nicer to me that I'd stopped coding against the old one in preference was a good signal that the new API was better enough to warrant release!
SPUN
I had a chat with some from from SPUN about species threat modelling and if there's any overlap with what we're doing here from writing tools to support LIFE and STAR metrics that model threats for animals and what they're trying to do modelling threats to fungi, but I suspect I was not the person they really needed to speak to. I'd initiated the conversation after Toby Kiers gave a talk in the department, as I wondered if there was anything in our code that they might be able to leverage in their work, but when speaking to the SPUN folk I think they were more interested in the LIFE metric process, and so I need to connect them up with the ecologists from the project.
Next week
Go and WASM toolchains
One of the part II students I'm supervising is working with Go to build some WASM components for visualising GeoTIFFs, but has spent a bunch of time fighting the Go toolchains, so I want to just do some testing around that so I can unblock them.
LIFE
We had a meeting this week about future LIFE metric work, and so I need to write up bits of that, as we did sensibly set a deadline for some progress updates in a month!
STAR
I still owe Simon Tarr a way to run the AoH pipeline via Docker that I said I'd do a few weeks ago, and I need to follow up with Chess on our differences in species filtering - though hopefully an easier thing to track with my new reporting output.
Tags: life, duckdb, yirgacheffe, linux