Weeknotes: 2nd June 2025

Last week

Adding image processing support to Yirgacheffe

For some upcoming follow-on work to LIFE, I want to be able to model some things that look a lot to me like they could be implemented as conventional image filters. To this end then I spent some time adding support for this style of operation to Yirgacheffe, my declarative geospatial library for Python. In general, it's much easier for me to implement fundamental operations like this in Yirgacheffe where I can unit test them well and get confidence in the implementation before I then use them in the more cumbersome scientific pipeline.

Yirgacheffe these days has two computational backends that I now need to support when I add new features: a CPU-targeted backend that uses numpy, and a Metal GPU based backend that uses MLX (one day a CUDA backend will be added via CUPY, but I've not needed that yet, and so it hasn't :). I had a dig around, and whilst numpy doesn't really support the 2D convolution matrix operations needed to run this style of image processing, PyTorch does (see conv2d), so that gave me a path to adding support for the CPU-targeted backend. MLX supports the same API as PyTorch for this operation, though weirdly it orders its multi-dimensional matrixes in a different order, which made the code a little messier for me. CUPY also looks to have a similar API so this isn't going to block a CUDA backend either.

Hooking this into Yirgacheffe was a little more nuanced than my previous expansion of supported operators in Yirgacheffe. Under the hood, when you say add two raster layers together, Yirgacheffe does two things:

  • It doesn't execute the expression when you define it, just as you save it. Instead it builds up a full s-expression of the operation at definition time.
  • When it does eventually execute the expression, it carries out the operation in chunks, to avoid using too much memory (we regular work with raster layers that are many times bigger than available RAM), and to optionally allow parallelism.

So in the case I'm adding two rasters, we read in the same data chunk from both the sources, add them together, and then write that chunk to the result raster. This is fine when you have a one-to-one mapping of input to output pixels, but for a convolution matrix you need to load more data than the result needs, other wise you get edge artefacts because the matrix will read null values when you go along the edges of the input layer and the matrix goes over said edges. Thus I needed to break that assumption that a chunk of data in the S-expression is always the same size wherever you are in the expression: if you go past a convolution matrix operation then the chunk window needs to expand all the way to the leaf notes of the expression.

Tedious stuff, but this again is why I hide all this stuff in Yirgacheffe: I do it once, then I never have to think about it again now matter how often I need it!

Nordic-RSE follow up

I wrote a blog post summarising Nordic RSE 2025 - this took a good amount of time, and I've no idea how Anil and others live blog events!

Geocaml

I had a good catchup with Patrick about OCaml-TIFF, our OCaml library for working with GeoTIFF files. TIFF is a somewhat awkward format to deal with in terms of modifications, and so we drew out a plan for a minimal set of features we'd support for writing TIFF files, as that's the big blocker right now in using this for anything useful.

Experiments with TIFF formats

In an idle moment I did some small initial experiments with trying to see if changing the way the rasters for LIFE are stored would make a performance change to the various Area of Habitat (AoH) based pipelines I maintain. I'd wondered if switching from using the default TIFF storage format of storing the image by rows to storing it by tiles might make sense, as normally we're reading data for a species from a narrow band within the overall image width, but the results were actually slightly slower when I did this. I suspect this is down to Yirgacheffe, which reads in chunks of data based on a fixed height size, and unless that happens to align with the tile boundaries, you'll end up reading a lot of the tiles twice. Trying to solve this for an arbitrary set of input files with different tile layouts will be a pain, but perhaps I can just special case the version where all rasters have the same tile layout, which is something I can ensure in my pipelines.

Self-hosting

I did a little bit of learning LWT, which is a promise based concurrency library for OCaml, after that's what Dream (the web framework I use) is based on, and I wanted to see if I could at least improve on the current performance issues I'm having by moving image processing from the request handler directly onto a concurrent promise. I did a minimal job, but indeed using Lwt_process to invoke GraphicsMagick does seem to have improved responsiveness on the more image heavy pages I have on my personal site. The processing time for images that have yet to be cached is still poor, but at least the rest of the site doesn't bog down too much whilst it's doing that. There's more I can do now I understand LWT a little, but I can now chip away at those over time.

Next week

I'll be working from The Wirral for at least the first half of the week.

  • Write up my discussion session from Nordic-RSE - hopefully will take a little less time as it's more on a topic I'm familiar with, but then there were lots of new things shared, so maybe not.
  • Try apply my new convolution code to LIFE.
  • I have an Outreachy intern starting this week to do fun stuff with Claudius, so looking forward to seeing what new features we get from that.

Tags: nordic-rse, yirgacheffe, ocaml-tiff, outreachy