Weeknotes: Work LIFE balance
22 Apr 2026
I've had quite a frustrating time on trying to land a set of changes to the LIFE pipeline. I don't want this post to be a rant, but I also do believe in documenting when things don't work as well as when they do, so I'll try keep this concise :)
What was I trying to do
Last year we did some work in looking at how to improve the Jung et al habitat map we use in the LIFE extinction risk metric. The Jung et al map notably under-represents farmland, and in his paper using LIFE to look at the impact of food production on extintion rates Tom Ball created a hybrid map based on integrating data from Global Agro-Ecological Zones from the UN and the HYDE anthropogenic land use estimates that attempted to address this under-representation.
Then in autumn last year we as a group looked at how we might do an improved version of this, and I did some implementation work on this. Whilst overall I liked the approach, one thing that I wanted to improve on was how we redistributed the updated habitats. Both GAEZ and Hyde datasets are much more coarse than the 100m per pixel Jung map. This means we know that say in a 10x10km area we need to increase farmland by 50%, but how do we know which 100x100m pixels to update to do that? In the original FOOD paper, Tom did the obvious thing, and after selecting suitable candidate cells did a random redistribution. This is defensible I think, but I think not ideal, because it means the distribution depends on the random seed used, and if you're worried about the extinction of rare species with small ranges, you might erase them based on the role of a dice.
We debated this in the team, and I generated some data to show that whilst the impact was small, it wasn't zero. That same data however did show that the results we generated with these hybrid maps were overall much better than Jung alone (based on the Dahal et al validation process for Area of Habitat maps), and so we were motivated to find a way forward. In the end Andrew Balmford suggested that rather than change individual pixels until we had enough, we'd just make proportional fractional change to all eligible pixels: that is to say, we never replace all pixels, just if we have ten eligible pixels to swap to farming and need 5 pixels of farming, all ten pixels become 50% what it was before and 50% farming. The result is then stable. Given the LIFE pipeline already works with fractional maps, this seemed the way forward, and I implemented an initial run through this, and we were happy with the results.
However, one downside is that making these proportional fractional maps is a lot more expensive than just random sampling is. The LIFE pipeline now has a lot more running at the full 100m per pixel resolution of the input data before moving down to the 5 arc second resolution we do for AOHs and beyond. That said, in theory we never rarely to update these maps once we're happy: instead we could just publish the layers on Zenodo and call it done. If it takes me the better part of a week to run it, that's okay as it's a one off cost.
What else was I trying to do
Like most data-science pipelines LIFE doesn't have a single Python script that does all the work: it has a couple of dozen small scripts and also calls a bunch of other tools like GDAL to do bits and pieces. I find decomposing the pipeline in this way works best for my ecologist colleagues as it means when they want to tweak the method to test a new theory, they can change just the one bit they're interested in without wading through all the others.
Initially, when we started LIFE, I was working with Patrick Ferris on a prototype pipeline execution tool called Shark. Whilst that was fun and I really like what we did, I didn't want to saddle my ecology colleagues with a half written data-pipeline tool, and so in the end I replaced it with a shell script. Shell scripts are bad for this sort of thing, but I chose it on pragmatic grounds, in that the data-science trained ecologists I work with know how to modify a shell script, and I didn't want to saddle them with say a complex Makefile that I as a computer scientist would turn to.
This was fine, but it didn't scale. I hit this first in my implementation of the IUCN's STAR metric. The STAR and LIFE pipeline share a lot of similarities under the hood, and so my implementations of both pipelines have a lot of similar patterns. I was helping someone at the IUCN set up my pipeline on their systems, and it was clear that the shell script just wasn't cutting it in terms of reducing unnecessary execution of stages that had been run before.
So, whilst it's not my favourite tool, I ported STAR to run on Snakemake. This is a Python-centric build system that handles proper dependancy analysis. I didn't really enjoy the experience, as I find Snakemake a bit more invasive that I want my build tools to be: for instance it injects itself into your Python scripts and expects them to get their arguments from a global snakemake object - but that then means you can't run your code any other way! I found my way around this (thanks to Cade Mirch's snakemake-argparse-bridge), but it sets the overall tone for how Snakemake works.
To be fair to Snakemake, I think all build systems ultimately encourage you to do things to suit them, and I'm just used to the set of compromises that make old-school tools like GNU Make happy. For instance, in both STAR and LIFE pipelines when I read the IUCN Redlist for species data, I generate a individual GeoJSON file per species, rather than what I see more commonly done, which is you'd write all the species data into a single GeoPackage file. Why do I generate thousands of individual species files rather than one single easy to move around file? It's because I know that on the annual Redlist updates only a small number of species will have been reassessed and tools like GNU Make or Shark or even Snakemake will know how to just regenerate downstream data for the few GeoJSON files that have updated. If all I had was a Geopackage file that had one entry in it updated, most build tools would need to recompute everything.
Which is a long way of saying I'm set in my ways, so I assumed perhaps it's not Snakemake, it's me.
So what's gone wrong?
Well, I remember a wise man once telling me about the dangers of innovating on too many fronts, but it's an easy trap to fall into, as I both tried to solidify this big update to the LIFE pipeline and migrate it to Snakemake. Because LIFE is now so much slower to run due to the preparation of the habitat maps being slower, it takes a lot longer to re-run once I realise I've made a mistake. At the same time, I keep having to tweak the code to make it snakemake compatible. I mentioned the arguments thing above, but there's more to it than that. For instance, Snakemake doesn't cope well when you have a stage that generates a lot of files but not in a way that is obviously linked to the inputs (e.g., maybe you're filtering data). For that I need to now generate a marker or sentinel file that Snakemake can use to track completion of such jobs. Because I was using a more rigid shell script in the past I got away without these as I coded the logic explicitly, but now I'm using a "proper" general purpose tool, I have a bunch of gaps to fill. The LIFE pipeline is more complex in its later stages than STAR is computationally, so I didn't need to do so much hand-holding for the my STAR port to Snakemake.
Another challenge is how Snakemake orchestrates the pipeline, which is much more clever than my shell script, but means I need to think about what it's going to try and do before it does it. What I mean by this is that Snakemake looks at the entire pipeline description, builds up a graph of what needs doing, and feels free to run those stages in any order so long as dependancies between stages are honoured. Sometimes this is just amusing how it thinks differently from me: it'll do the entire pipeline for birds before doing the entire pipeline for reptiles, etc. whereas when I think about the pipeline I'd be thinking of doing input processing for all taxa together, then habitat maps for all taxa, etc. It doing this isn't a problem per se, but it shows you that it can think differently. The challenge then comes from the resource usage of stages meaning you can't run them at the same time, and this re-arrangment (and non-predictability) of the execution order means I've had more machine reboots this last two weeks than I have had since I started working on the project, as I failed to spot places where Snakemake might re-order what I'm doing and cause two jobs that happen to need all the RAM at the same time to run simultaneously.
Thankfully there is a cludge you can put into Snakemake to somewhat control this, by indicating a stage needs all CPU cores, but it's not ideal. But then very few tools get this right, so it's hard to blame Snakemake, but my naive shell script with it's linear flow made it much more simple to reason about and prevent such issues.
Finally, I've had real issues with Yirgacheffe and MLX that also got in the way and caused failures when I already had enough. It never rains but it pours as they say.
The work has an unfortunate cadence: because the pipeline is now much slower to run, it's also slower to fall over when it turns out I've misunderstood how to wire up Snakemake, or I have a typo somewhere. This is all on me, obviously, but it's also how I work: I try to arm myself with type checkers and other tools to catch errors early, but with this pipeline things just do take time. So I try to fill that time with other work, but I never make much progress as I keep having to switch back, and so now I'm doing a bad job at two things. Had I not had a fun side-quest then I suspect I'd have gone spare.
The final bit that is frustrating is that no one else cares. The ecologists don't care about this sort of thing as they don't run the pipeline, they have me to do that. My computer scientist colleagues aren't interested as this is just not building new things like I am when say developing Yirgacheffe. I don't blame anyone for this - it really is quite boring stuff :) As I used to say to my team members when they'd complain about boring tasks when I was a manager: there's a reason it's paid work, you'd not want to do the dull bits otherwise.
Has anything gone right?
It's not all doom and gloom. Moving to Snakemake meant I got rid of some custom tooling (the venerable littlejohn is finally put out to pasture), so the pipeline depends less on things not used outside the immediate group, which helps with overall uptake and reproduction of results. I was pleasantly surprised to discover that although Snakemake is Python-centric, it will run R scripts just as smoothly as Python scripts, similarly injecting itself to map arguments etc. I'm not a huge R user, but I see the fact it can use both popular data-science scripting languages equally as a huge win, as my experience is most teams have a mix of Python and R expertise unevenly distributed in them.
So what now?
Thankfully, I'm slowly getting there with this particular port, at which point I think I'll need to take a break from this project for a bit to recharge, but what can I learn from this in terms of broader development of methods?
1: Reconsider parallelism strategies
Internal vs external parallelism, that is whether you have one script that does parallelism internally versus whether you run one script many times using an external tool like Snakemake or littlejohn, is something that I don't have a hard rule for generally. But I think now I'd favour doing more internal parallelism, and push that into Yirgacheffe's domain. This aligns I think with work Anil Madhavapeddy has been doing moving to file formats like zarr. If I was starting from scratch I'd use a single Geopackage for all species data, and let Yirgacheffe run AOH across all layers within that Geopackage to build a single Zarr or multi-band GeoTIFF with all the AOHs in. This keeps the complexity in single domain where I can reason about it rather than mixing it, and then having Snakemake accidentally align stages that use different approaches which causes my computer to explode. This aligns well with this old ticket I made for myself a while ago.
Even within a single stage, Yirgacheffe has very naive tricks around memory optimisations using fixed size chunks which are either inefficient or too greedy, depending on the scenario, but generally it tries to be very conservative because it doesn't know which model you're using, so it has to assume the worst case you're running many similar scripts at once, and so it should be frugal as there's many of it doing the same thing. By moving to more internal parallelism I can be more proactive in managing memory.
2: Fail fast, fail often
A pipeline stage needs to fail fast, and so the tooling should support that. I need ways to find if a slow stage will fail not after it's done expensive computation, but before that. Yirgacheffe's lazy evaluation helps a lot with this, as more of the outer python code gets executed before any significant computation is done, but I wonder how much further I can push that? Can I move looping into Yirgacheffe for instance? At the moment if you loop in Python to create a bunch of rasters, then each loop iteration must do the computation before the next can start and potentially fail. What's going to be the idiomatic way to defend against that?
3: Remote working strategies
Don't get trapped doing all these things at once when no one else is around to share the load :) This is a social thing, but as I've now moved to even more remote working than usual (I've always worked from home, just now my home is further away from the rest of my team) I need to manage this better. I've only been further away for a month or so, and it was bad timing, but just something I need to be more cognisant of now! I can't just to grab a coffee with a team mate and get things like this off my chest, and no one wants a zoom call where I rant about this, so I need to structure my work to always have space to let me make progress on a second task in a dedicated fashion when I need a break.