Weeknotes: 14th July 2025

This week

Yirgacheffe

The short paper on the design and use of Yirgacheffe was submitted to PROPL was submitted on time, but not without a little stressing to the end, which is the downside of paper deadlines: something always turns up that makes them a rush, even if you felt you had things mostly in hand the week before.

Context: for those who haven't seen it before, one of the main features of Yirgacheffe is that you can specify numerical operations directly on geospatial datasets, so you can add/multiply/filter these large rasters or polygons directly, and it'll do all the book keeping about aligning pixels, rasterizing polygons, etc., and at the end you either save the result to another raster layer, or you perform some aggregation like summing all the pixels or finding the min/max.

One of the less used features of Yirgacheffe, at least by me, is that when doing that save or aggregation, Yirgacheffe can attempt to do so in parallel using multiple CPU cores. Normally the pipelines I work on don't use this feature as they tend towards data flows that work better if I run the same script many times in parallel, rather than one script that does everything within it. Partly this is down to Python being generally poor at parallelism, but mostly down to the data flows, e.g., processing thousands of area of habitat calculations at a time, it's jsut easier to run the AoH script once per species, and I can use an external tool like GNU Parallel or Littlejohn to orchestrate that.

But, there are times when you just one script to do some calculation on a big raster as fast as possible, and for that I added the option to use multiple cores for the calculations. Internally you can imagine Yirgacheffe breaks down each calculation into say rows of pixels and does them one at time to avoid having to load too much data into memory, so it's a small logical leap to say we'll do several of those rows at a time in parallel, as they're independent of each other. Yirgacheffe doesn't try to do anything very clever here, but I found when I benchmarked the feature it performed much poorer than I'd expected, actually being several times slower than just using a single thread in some instances, one being over 6 times slower!

My test case was processing 277 different species AoHs. I did specifically go for a mix of ranges, but the data for species sizes does tend to skew small, so don't process much data. Whilst I said above you could imagine Yirgacheffe processes a row of pixel data at a time, it actually does larger chunks than that: partly to get better disk behaviour and partly because polygon rasterization works very poorly at that scale, as it still has to process the entire polygon each time you want to rasterize a small chunk of it, and for species with ranges defined by detailed coastlines that can be a lot of data.

So I realised that for many small species it was doing a single chunk of data, and if I set the parallelize flag it was still trying to do that work on a worker thread, which in Python is quite expensive to set up. So I added some checks to see if you would actually need parallelism, and if the calculation was just one chunk of data, then it'd revert to the single thread code path.

This still isn't great, with still quite a few instances being slower than single threaded, but did bring the mean down taking less than a third of the original performance, with the min being around 12% of the original run.

The overhead of processing one chunk like this did make me then wonder about how I was defining the chunk size, and whether I should look at the current default work unit size. I played a little with reducing it to encourage more parallelism, but that only seemed to make things worse, as the rasterization overheads kicked in, and given paper deadline, I didn't really have the time to try explore that space nor work out how to automatically infer what might be reasonable, so I had to park that. I also tried another, larger dataset, processing all 1600 odd mammals from the STAR metric, and this also gave me mixed results performance wise, and I didn't have time to dig into that: I assume the species' range distribution was different from my normal test sample set.

Ultimately, on average the parallel save feature on Yirgacheffe does better than not having it, but is pretty poor given how many CPU cores it can use, and so overall I'm left quite unhappy with the feature. I feel that even allowing for Python related problems, something better could be done, but there was no time to look before the deadline passed 😠

It's not like this was even a critical part of the narrative to the paper, and isn't a feature I use that much, but the process made me realise there's something going wrong and I don't understand why, and I don't have time to figure it out, and that is deeply frustrating.

LIFE

I started generating a new LIFE run using the latest RedList update from 2025. All the LIFE paper work was done with RedList data from when the project started in 2023, and there's now a 2025 update out, so we want to publish updated layers. I did a visual inspection of the new maps, and there's some differences, particularly around amphibians, but they generally look good, but I've passed them over to Alison who as a zoologist is actually capable in interpreting the results properly.

Whilst doing this I'm also doing a little modernisation of the code, and changing the default results you get when you use the script that comes with the repo so that it just runs things we're still interested in, rather than everything that was in the original LIFE paper.

Claudius

Shreya, the Outreachy intern working on Claudius has been working for the last few weeks on getting a feature to record animations out to an animated-GIF file, and that's now merged. I'd include an example here, but my self-written website publishing tool doesn't have a way to let me include it, so I'll try fix that for next week 🤦 We made some progress to getting Claudius into opam, as I got the OCaml-GIF library that it depends on that we maintain into opam.

The next challenge will be getting Claudius in, as the obvious paths don't quite work due to Claudius using a submodule to add a resource dependancy. Specifically, github releases don't include submodules in the produced tarball, which means Claudius won't build from a github release unfortunately, which is how I did the release for the GIF library.

3D-Printing maps

UROP studently Finley started, and impressed me by very quickly getting up and running generating models for 3D printing from digital elevation maps:

A screen shot of a square area of hilly land rendered in some 3D-printer slicer software.

Finley is going to try write up some weeknotes, so I'll link to those here as and when and not spoil his work, but I'm super excited about what we might get done this summer. I was working out of DoES Liverpool for part of last week, and I did spot this lovely CNC-routed landscape and I must resist trying to derail this project into even more time-consuming construction methods :)

A photo of me holding a wooden block into which a mountain range has been carved

I did find out the computer lab has some Prusa 3D-printers, so hopefully Finley and I can get trained on those.

This week

Make sure we have everything we need for the next LIFE manuscript ready for zenodo.
Get some of Finley's results 3D-printed and try get him able to print on his own.
Try to schedule a meeting on AoH validation with interested peeps. This was discussed around the IUCN workshop a few weeks back, and I need to try arrange that before people vanish for summer holidays (myself included).
Look into TESSERA if there's any free time

Tags: yirgacheffe, life, propl, opam, 3D-printing, claudius

Tech notes by Michael Winston Dales

Weeknotes: 14th July 2025

This week

Yirgacheffe

LIFE

Claudius

3D-Printing maps

This week