Weeknotes: 1st December 2025

Last Week

STAR Workshop

I spent two days in a mini STAR workshop with Chess Ridley and Louise Mair from Newcastle Uni and Simon Tarr and Thomas Starnes from the IUCN.

A photo of five people sitting together in an small meeting room, smiling at the camera, with a large whiteboard covered in diagrams and notes visible on the wall behind them and a desk with papers in the foreground.

We spent two days locked (not really) in a room in the DAB talking about how we'd implement a unified STAR pipeline that deals not just terrestrial species as in the original paper and is what my current implementation deals with, but also brining in other work for both fresh-water and marine species. There are STAR methods for this already documented and implemented by others, but the question we were discussing was how could we build one pipeline that would process all three realms given the differences in input resolutions (e.g., for terrestrial we have land cover at 100m resolution, but for marine the equivalent dataset is 5km), and some are spatially overlapping (e.g., the fresh-water analysis uses more accurate wetlands data sets that overlap with the terrestrial land cover maps).

Thankfully we're not starting from scratch here, as back in May 2024 (in a hiatus from weeknotes, so no link) I attended a STAR workshop up in Newcastle where as part of a larger group we discussed that we'd want to get to this place, and so my AOH code already supports the idea that land cover isn't just a 2D space is it assumed for terrestrial species (caves being mostly ignored by the biodiversity pipelines I'm working on), but for marine there are many layers of habitat, so we need to allow for overlapping land classes. From that meeting I reworked both my STAR implementation and the LIFE pipeline to be ready for such futures.

This time however the key was we spent time sweating the details, drawing up tables of all the possible inputs and outputs we needed, where are we getting them from, where are the things we can't readily reproduce due to differences in the IUCN datasets from various endpoints, at what stage they'd get processed and aligned, etc. It's interestingly interdisciplinary work: often there's reasons to do things a certain way due to limitations of data or compute, and then there's times where you really need to do things a certain way that might not suit those as it'd impact the actual science being worked on, and there's times where it's not clear which way wins. So from that end I very much enjoyed the process of working with Chess and Louise who are experts in ecology (and Chess also has a lot of pipeline building expertise too), but I still get to contribute as a computer scientist type.

Over the two days I think we made some good progress, and we plan to sync up again in a few weeks virtually and attempt to thrash out some of the remaining details. Before then I need to write down what I thought we agreed to so I don't just rejoin the discussion and waste time trying to recap what we did before. I also want to see if I can assess some of the inputs we talked about where there were multiple options, I want to get into the habit of us using the little automated assessment and validation we have for these pipelines to start letting us do quantitive "what if" testing rather than a more qualitative manual looking at the results to see if they seem correct (for all my pipelines, not just STAR). I feel this is going to be a key push going into 2026, as we've seen this last month or so as I've worked on evaluating integrating the GAEZ/HYDE data into the Jung habitat map, and in the coming months I plan to look at how things like the Tessera foundation model can be applied to generate better habitat maps.

Yirgacheffe

I mentioned last week that I was dealing with an issue that Tom Ball had found with one of the maps we'd been generating which I tracked down to how I was trying to resolve alignment with slightly unaligned layers. Although this is one of the key things I use Yirgacheffe to deal with, just the approach I started with originally tried to deal with several issues like this a bit at a time, evolving over time as I hit various issues, and rather than pile another patch on top of this, over the last week I've effectively rebuilt how Yirgacheffe deals with geospatial alignment of rasters and polygons.

Ultimately, if you have multiple rasters at the same pixel scale, but that exist on grids that have a slightly different origin is to apply a nearest neighbour approach when selecting pixels to compute against. This extends also to when trying to work out how to do a union or intersection operation on multiple rasters, always ensuring that you prioritise pixel accuracy when you have pixel data, whereas before I tended to work in a more "geospatial pure" approach and work out the pixel mappings at the end. All this sounds someone navel gazing, but it very much matters when trying to throw together datasets from many different sources as in practice these rarely use a common origin, particularly if they've been re-projected somewhere along the way.

It's also the kind of thing I am bad at keeping in my head, which is by and large why Yirgacheffe exists: I made it not because I'm good at this and I wanted everyone to benefit from my awesomeness, but rather I'm bad at this so I want all the logic for this in one place where I can just write it once, test the hell out of it to ensure its as robust as I can make it, and then get on building the more interesting things on top of it. When I do things like this I end up with lots of bits of paper everywhere as I try to explain to myself how the math needs to work:

A photo showing various aligned boxes scribbled in pencil on two sheets of paper.

As a corollary to that over the last week the number of unit tests in Yirgacheffe went from 835 to 926 as I try to both ensure the original problem is captured in the tests, and I add more cases to try build confidence as I rework what is a key part of what Yirgacheffe is. Actually, the test number doubles as I run them both for CPU and GPU backends.

Anyway, the end result is that I now have things done, but before I release it I want to do a full pipeline run on one of the biodiversity metrics I've built upon it to try and ensure I've not introduced any regressions with this update. A lot of people make papers on pipelines ran on my code, and so I feel a burden to ensure that I'm not letting them down, despite 926 tests that say otherwise. Similarly I've started to look up which data sources I use regularly have permissive licenses so I can add some integration style testing to Yirgacheffe that processes real data, rather than just the unit tests I have so far which are all based on synthetic data.

This week

  • Wrap up this Yirgacheffe work and release 1.11
  • Catch up with Shane and Ali
  • Prep for a meeting the following week with various people about AOH validation

Tags: yirgacheffe, star