Weeknotes: 7th April 2025

Last Week

Talking to new team members

We welcomed Mark, David, and Jon to the group recently, and I finally found time to sit down and have some great discussions with them all on what we've been working on this last year, and how we might all collaborate on things around that. In what was an otherwise frustrating week on the technical work front, it was great to have some inspiring discussions on what we might work on over the coming year.

Life

A wise robot once said:

Life! Don't talk to me about life.

And I've held a certain amount of sympathy for that view of late, it has to be said, as I've been fighting with the LIFE pipeline as I try to run some new high-resolution maps for parts of the globe. This week has been plagued by what I felt was reasonable decision the week before, in that rather than re-engineering the pipeline to work with data at mixed resolutions, which would require some careful coding consideration, given I don't know how much this particular avenue of investigation will be pursued, I didn't want to invest a lot of time coding that up right now. In last week's weeknotes I said:

My intent had been to try and code something to make it more efficient to work with a mix of high- and low-resolution maps, rather than doing a bunch of work at the higher-resolution unnecessarily, but in practice until I know this pipeline will be one people actually use, it's not a good use of my time right now, and I have a few weeks to generate the results, so I'm just doing it the naive way for now, and I'll see just how slow it is.

My theory being that rather than recode the pipeline to do some special tricks to integrate the region specific high-resolution data into my global maps at a lower resolution, I'd just process everything at the higher resolution, and accept it taking a week to run rather than a day.

I should make it clear that when I say "low-resolution", that's still 100m-per-pixel-at-the-equator, which is roughly 400k x 200k pixels, each of which is two bytes of data per pixel: which means a global map is 150GB of data, so we still are doing a lot of number crunching in the "low resolution" version. High resolution here is 30m-per-pixel-at-the-equator, which is roughly a 10x scaling factor.

The problem was that just running the entire pipeline at the higher-resolution revealed that in certain parts of our pipeline we were getting by with default GDAL behaviour before, but when processing 10x as much data, GDAL's default behaviour around assumptions on caching etc. go out the window, as did some assumptions I'd made about how the pipeline would scale also go out the window. 150GB is a lot of data, but our server has 1.5TB of RAM, so we have some flexibility for running stages in parallel and so forth. But when the size of the map equals the amount of memory we have, we have zero tolerance for missing sub-optimal memory usage, particularly as in practice this is a shared machine, so not all of that RAM is available to us.

As I say, we don't normally need to load the entire global map at once, we can stream process it effectively, but there are points where we need to make in-memory partial maps, which we'd got away with in the past, but we can't do if I move everything to the new higher resolution (this is why originally I had wanted to do partial rendering!). To make things worse, GDAL has this "helpful" feature whereby it caches map data in memory incase you need it again. The problem is that it seems to cache the amount of data based on now much RAM you have, and so unless you explicitly tell it not to cache, it'll just not remove old data from memory after you read it, and it does this without regard for other things that might be using memory, like other users, or parallel versions of the same program. So even though I'm loading the map in chunks to avoid using too much memory, GDAL's default behaviour makes a mockery of that, and several times I was guilty of what I get grumpy at our users for doing and I triggered the system out-of-memory killer on Linux quite a few times, disrupting the work of others.

In the end I've had to annotate every part of my code to tell GDAL not to use a big cache, except on the few calls to gdal.Warp where I can't control parallelism, and so I need to let it have more memory (but I still need to constrain it somewhat to defend other users of these systems). I also had to switch some sections of the pipeline from using task based parallelism (where I run several versions of the same script) to one task with multiprocessing in Python as the maps were too big for me to work the way I used to.

This then begat another problem that I'd not hit before. At some point, trying to fix the memory-over-consumption issues, I started getting the following crash almost immediately after I executed my script:

terminated by signal SIGBUS (Misaligned address error)

It took me an embarrassing amount of time to figure this one out, because it relates to a feature that I added along time ago to Yirgacheffe, my Python Geospatial library, and how it tries to hide Python's limitations with parallelism. It turns out when my processes were killed by the Linux OOM system, I was leaking shared memory regions, and so we'd eventually not be able to allocate more of those as part of Yirgacheffe's trying to magic away problems in Python, and I had to go in and manually remove them. I need to dig a bit more into what is happening here and write that up, as it was a new failure mode to me, but for now I just needed to get the pipeline running again as I have a deadline to hit and I'm running out of time.

The long and short of this is that I predicted it'd take 7 days to run the slower LIFE pipeline without me needing to change the code, and now we're at 10 days, it's still not finished, and I've still had to change the code. 🤦

I fought the build tool chain, and the build tool chain won

As part of this fighting Python, and related to my notes last week about trying to move away from Python in the future, I thought I'd spend a day trying to get some geospatial code working in OCaml. This was also a mistake, as I lost a day of productivity due to losing a series of fights with several parts of the OCaml developer experience.

Firstly, when I tried to update my OCaml install via Opam, the OCaml package manager, it failed with some internal errors. I tried using a different OCaml version (Opam lets you have multiple installed at once), and got the same thing. I re-installed the Opam tool, and got the same thing. Eventually, thanks to much help from my colleagues Patrick and David, we eventually figured out that a breaking change had been made to the OCaml repository in December, and the associated data from my OCaml install was from before then. Deleting one file fixed everything. 🤦

With that out the way, I then went to install OWL, a scientific computing library for OWL that should be a bit like what numpy is to Python. Unfortunately, I hit into an issue, yet to be resolved. OWL depends on OpenBLAS, a math library used by many things (including both GDAL and Numpy), which on my system comes via homebrew. But that version of OpenBLAS is built using the Clang C compiler that is in homebrew, rather than the native Apple provided Clang, and OpenBLAS uses a feature that is in the homebrew version and not the Apple version, but the OCaml compiler uses the Apple native version of Clang when building things and so my attempt to install OWL failed.

With the help of David I managed to figure out how to get OCaml to use the homebrew Clang, which then let me build the OpenBLAS package needed by OWL, but I then failed on the next package, which I worried was because I was now using a Clang that wasn't what was used to buidl most of the other OCaml packages I've installed. So I started a fresh "switch" in Opam with no packages in it, tried to build that using the homebrew Clang, and then that failed in the base package.

All of which is a bunch of scrappy notes to say I guess I need to try this either without the homebrew version of OpenBLAS or on a linux machine.

Reading

I read The ecology of plant extinctions by Richard T. Corlett, which is a survey of research into plant extinctions, looking a little at why they go extinct, but mostly concentrating on how we know what we know and trying to quantify what we don't know. The interesting tidbit I took from it is that because plants are long lived, it's entirely possible that species alive today are already effectively extinct because conditions for them to propagate no longer exist. I guess this is the case for other species too, but it's just more obvious due to the timescales with plants can be so much longer. That said, given the lag in the carbon cycle, it could be entirely possible this point has been reached for a lot of species of all kinds already, which is a somewhat depressing thought to end on.

Next Week

  • Finish these LIFE rasters
  • Resurrect a Shark demo for the new team members to see
  • Look forward to discussions with group visitor Oisin Mac Aodha at the end of the week around map modelling.

Tags: ocaml, life, yirgacheffe