Weeknotes: 28th April 2025

Last Week

LIFE data spelunking

As mentioned in last week's notes I needed to do some data-mining to compare/contrast two LIFE runs I'd done recently, all of which I did, though did end up in some data wrangling in Excel, so not ideal. I do miss the days when I knew gnuplot inside out for generating nice graphs.


One interesting observation I found in this was floating point, as ever, is tricksy. The regular LIFE maps are based on input data at 100m per pixel at the equator, and then for the second LIFE run I'd had to merge in some localised higher resolution data at 30m per pixel at the equator. Due to compute time being more available than my own time, rather than try to patch in the high resolution data at 30m and leave everything else at 30m, I just upscaled all the data to 30m so I could run the existing code unmodified (with the idea being that if we do more of this then I'll write the code to be more efficient in patching together mixed resolution datasets).

In the results we were surprised to see that the localised changes to our maps lead to global result variance, and this surprised as as we didn't think that many species intersected with the updated area, and so I needed to don my data-hard-hat-with-lamp and go find out what was going on.

Anyway, it's somewhat predictable, but the upscaling of the data meant that even in parts of the map where the new data had no impact, I got slightly different results from the original LIFE run. Even for areas where the data was simply going from 100m -> 1.8km resolution transforms, going 100m -> 30m -> 1.8km was enough to introduce rounding errors that then accumulated over 40k species to be noticeable.

Obviously I could have spent more time writing the code here, and if this work goes anywhere that's what I'll need to do, but it was yet another lesson (were one needed) that floating point numbers on computers are not to be trusted in terms of perfect reproduction of the same results even if you just add some stages that should otherwise cancel out.

Towards Yirgacheffe in OCaml

Last week I reported on the lack-luster results I got in attempting to speed up my Area-of-Habitat pipeline that underpins two of biodiversity metrics I work, which I'd done by adding GPU support to Yirgacheffe, my high-level geospatial Python library. To summarise, I was only getting about a 10% performance improvement by using the GPU over a CPU - and that's on a unified memory architecture using Apple's M series chips, I suspect it'd have been pointless on Nvidia GPUs with their dedicated video RAM given the per species raster sizes involved.

This then caused me to look over where time was being spent, and what could I do about it. If I'm to start working at 30m-per-pixel-at-the-equator on a regular basis, I need to find an order of magnitude performance gains somewhere if I want to keep my workflow somewhat agile.

This reflection reminded me that once again, Python is a silly language to be doing this sort of thing in. I totally understand why Python has ended up being a de-facto data-science language, but it is poorly suited to this sort of thing, even when you push a lot of your logic out of Python and into C++ numeric code using libraries like numpy and GDAL. I've spent a lot of life writing concurrent and parallel code, and so looking at the data pipelines I've built in Python, I can see lots of places where if Python supported proper in-process threading well, then I could unlock a bunch of performance by sharing more data between tasks. I try to do some of this in Yirgacheffe already: it has a parallel_save operator, that attempts to spread work out over many CPU cores and uses shared memory behind the scenes to tie it all together, but doing that and keeping it behind a "usable to non-computer scientists" API can only unlock so much.

And every other month it seems we hear that there's a new version of Python by some small group that is going to revolutionise Python parallelism, but none of them seem to have got to the stage where I'd be willing to be a three year research project's worth of code building on top of them.


And so, my thoughts go to using other languages for my pipelines, ones that are well suited to concurrency and parallelism. For a long while I wrote a few highly-concurrent things in Go, which is not the most readable (with my ecologist hat on) or exciting of languages (with my CS hat on), but was very well suited to highly-concurrent, highly-busy networked services. And there are numerical libraries out there for Go. But if I start writing biodiversity pipelines in Go then I'm definitely on my own, as neither my ecology colleagues nor my computer science colleagues would want to chip in.

I did briefly play with Julia, and explore speeding up part of the Tropical Moist Forest Evaluation Methodology pipeline by moving it to Julia, as that does have support for parallelism, but I found that you had to be really careful with how you implemented your algorithms, as if you did any memory allocations inside your parallel code then the threads all blocked on each other in the memory system. I gather that was being actively fixed, and may indeed be better now, but it didn't feel like a good home for my attempts to have a robust pipeline I could give to ecologists and not require that they understand computer architecture concerns. And again, no one on either side of me is currently Julia savy or interested in becoming so.

Part of the reason I'm trying to lean into the community aspect here is I've tried this before and lost momentum. A couple of years ago I did try building Yirgacheffe in Swift: I found an existing Swift TIFF library and added BIGTIFF support and made it easy to get GeoTIFF extension data, and then wrote some of the layering abstractions that are key to how Yirgacheffe works, and got a simple AoH calculation working. But the problem was I was pushing against a bunch of problems with the resource safety structure of the language and I had no one to talk to in either team about how best to resolve those, and so I got distracted on other things and it stagnated.


All of which is a long way of saying I spent a couple of days last week starting to get Patrick's GeoTIFF library for OCaml to be usable for the kind of pipelines I work on, and see if I can try use OCaml to realise the Yirgacheffe dream of trying to do "the right thing" for ecologists doing geospatial data-science.

This is spurred on in part by some results from a final-year undergrad project Patrick and I are supervising that shows that for certain micro-benchmarks, a combination of OCaml, the OWL numerical library, and the aforementioned GeoTIFF code is getting some usable initial performance results that are up there with Python with numpy. This being a final year project, there's still a bunch of work to be done to make it properly usable in a production thing, but it's done what a final-year project should be doing, which is showing there is something here worth pursuing.

I've started adding LZW compression support and filling in some more of the BIGTIFF support, as given the most of the size of the data we work with it will be compressed and often over 4GB, and on top of that I already have started on adding more data types, so I can get the full coverage needed for the LIFE and STAR biodiversity pipelines. My strategy here will be just to get to something working as quick as I can, and then tidy up and optimise both performance and usability once we have it running some real use-case code.

Obviously there are probably more-appropriate or more-amenable languages that I could be using for this purpose, but I work in a group with OCaml hackers on the CS side, so I hopefully won't be pushing on this alone, and as with my Swift attempt, I do think both Swift and OCaml can be presented simply enough to make implementing geospatial ecology methodologies at least readable to vernacular developers if not writable.

This week

After being up north for a couple of weeks I have some catching up with Cambridge folk to do

  • Catch up with Ali on where we are data-wise for the follow up paper to LIFE she's working on
  • Catch up with a contact at the IUCN on best practice for overly detailed coastal ranges
  • Catch up with the newly-expanded planetray computing group vis a Shark demo

In addition to that and continuing to work on the OCaml GeoTIFF stuff, I need to remember I'm chairing a discussion session at Nordic-RSE in a few weeks and I should prep for that, and perhaps read the Gamestorming book that's been sat on my desk for the past month.

Tags: yirgacheffe, ocaml, life