Weeknotes: 18th August 2025

Last week

Yirgacheffe Paper

My PROPL paper submission on Yirgacheffe was accepted, with some good review comments, and nothing substantial needs redoing. I'll address the comments this coming week. Overall it seemed the reviewers got what we were trying

The continued path to Yirgacheffe 2.0

The focus for the upcoming 2.0 release for Yirgacheffe, my declarative geospatial library that I've used to build various geospatial data-science pipelines, is about simplifying the API to make it more accessible, and I realised last week there was another trick that was now possible that wasn't before, and I've not seen done in other geospatial libraries either, which is the autoscaling of polygon data to match the rasters used in a computation without having to make the user explicitly state it.

For example, after my recent round of updates, a simplified AoH calculation looks like:

import json
import yirgacheffe as yg

species_data = json.read("species_data.geojson")
with y.read_raster("habitat.tif") as habitat_map:
  with y.read_raster("elevation.tif") as elevation_map:
    with y.read_shape_like("species_data.geojson", like=habitat_map) as range:
      filtered_habitat = habitat_map.isin(species_data.habitats)
      filtered_elevation = (elevation_map > species_data.lower_elevation) && \
        (elevation < species_data.upper_elevation)
      aoh = range * filtered_habitat * filtered_elevation
      aoh.to_geotiff("species_aoh.tif")

One detail to note here is that when a polygon or shape file is loaded, as per the call for read_shape_like on line 7 there, you either need to specify the pixel scale and map projection to use explicitly, or you need provide a reference GeoTIFF layer for it to take those values from, and then when we rasterize the data internally we can make sure we do so to match the other GeoTIFFs being used.

This API was designed from the very early versions of Yirgacheffe where you manually had to call read_array on all the layers and then manipulate the data with numpy, so it was important the range layer in the above knew what the expected pixel scale and projection where. But now that you can just do math directly on the layers like you see above, Yirgacheffe actually has all the data it needs to figure this out! You can see the range is used in the penultimate line to calculate AOH by combining it with other rasters which will always have pixel scale and projection data because they come from a GeoTIFF that mandates that information.

So it's a small but important quality of life improvement then to simply line 7 there to be:

with y.read_shape("species_data.geojson") as range:

Anyway, I feel strongly about the usability of the API, so was keen to implement this after I realised we had enough information to do this. That said, it was a lot of plumbing (27 files changed in the PR!) to make work, as I needed to make Yirgacheffe relaxed about the lack of pixel scale and projection on certain layers until the point of final calculation - Yirgacheffe tries to fail early in general, so a lot of checks needed to be updated.

It's a simple thing, but it again saves the ecologist from doing bookkeeping for the computer so they can concentrate more on implementing the methodology. It also means the code looks more the like the methodology for people coming to read it later.

When I was testing this last week I also realised that it's silly that in the above example we have to load the GeoJSON twice once as a Yirgacheffe shape layer and then again as pure JSON to get the metadata, so I now have a ticket to expose that data from the shape layer somehow, which will simplify that code even further.

I was chatting to Jovana about Tessera, the foundation model that people in the EEG have derived for making global predictions from Sentinel satellite observation data, which I'd like to work with.

Tessera data comes in tiles, and I was thinking that, well, Yirgacheffe has a mode, accessed via read_rasters where you can give it a bunch of GeoTIFFs and get it to work as a single virtual layer, such that you can then clip out just the bit you need:

import yirgacheffe as yg

with yg.read_rasters(...path to a set of tiles...) as elevation_tiles:
  with yg.read_shape("area_of_interest.geojson") as area:
    clipped = elevation_tiles * area
    clipped.to_geotiff("result.tif")

However, I realised that Tessera tiles have 128 bands to them, not the more normal single layer that most GeoTIFFs have. If you imagine a TIFF of a photo, then it'll typically have three bands: red, green, and blue. Rather than mixing that data into RGB pixels in a single image, as per most image formats, TIFF stores the three colours as separate monochrome images within the TIFF file. In GeoTIFFs you can use this same feature to store related data: say different satellite imaging frequencies, or in biodiversity I've used it to store different animal taxa data per band.

Whilst Yirgacheffe works with GeoTIFF bands, it'll only work with one at a time, and so that code above doesn't actually do something useful, as the result will only have one band of Tessera data, not all 128 which is what anyone doing this would want.

Fixing this not only is possible, albeit there is a bunch of nuance that needs to be worked out first (as noted on the ticket), but I realised that it might also unlock a better API for describing parallel tasks for non-computer scientists using Yirgacheffe, which is an exciting though!

At the moment, if you want to try to utilise parallelism in a pipeline using Yirgacheffe there are two ways: you can use internal parallelism within Yirgacheffe, which you can do with adding a parallelism=True flag to the to_geotiff call, or you can externalise the parallelism and call your Python script that uses yirgacheffe many times for different input data. This later model is what happens in the LIFE and STAR pipelines I've implemented: in both of these we process tens of thousands of species, and so I have a GeoJSON per species, and use an external tool (e.g., littlejohn or GNU parallel) to run many scripts at once.

So, going back to the Tessera example, with multi-band data, we can imagine that we could automatically process each band in parallel, as they're related but independent from a computational point of view. So this means we can start to take a data-driven approach to parallelism rather than relying on the user to identify it in their workloads. But I think we can apply the multi-band approach to non-GeoTIFF data too. When you download the species data from the IUCN Redlist, that data comes in a single data unit with all the species in it, and so you have effectively a multi-band shape file! So then the AOH pipeline could become like this without the need for any external tooling:

import yirgacheffe as yg

with yg.read_raster("habitat.tif") as habitat_map:
  with yg.read_raster("elevation.tif") as elevation_map:
    with yg.read_shape("species.gpkg", bands=yg.ALL) as species:
      filtered_elevation = (elevation_map > species.lower_elevation) && \
        (elevation_map < species.upper_elevation)
      filtered_habitat = habitat_map.isin(species.habitat_list)
      aohs = filtered_habitat * filtered_elevation * species
      aohs.to_geotiff("aohs.tif")

The idea of a GeoTIFF with tens of thousands of layers in it might be a bit silly, but not technically impossible (just a little impractical to work with in existing tools like QGIS). But here we then have the code matching the method very closely, no external tools to confuse the situation, and we can infer parallelism without needing to burden the non-computer scientist with that implementation detail.

This would also be a better level of parallelism for Yirgacheffe to work at. Currently the parallelism works on trying to take a single band of computation and slice it into chunks it distributes to different CPU cores, but the limitations of Python really take a toll here, and performance is notably worse than using external parallelism (better than single threaded mind, but just not what it could be in a language better suited to parallelism). But doing band level parallelism would remove most of the pain points of having to chunk up and stitch back together the data from the parallel workers, and might lead Yirgacheffe to be on par with external parallelism again. Perhaps you can see now why this is exciting to me?

Slurm

I did some more Slurm testing and I generated some more LIFE data for different curves, but nothing major here. I did get a lot of useful feedback from the Nordic-RSE folks when I asked about Slurm social norms, including this page by Richard Darst.

I still feel that what is missing from Slurm is some level of easy container backend: one of the curses of a cluster like ours is we have a bunch of different data-science standard toolings people want to use (R, Python via Conda, or Python via pip, etc.) and often the tools they want to install they can't because they're not root. My dream data-science machine setup is something similar to Vanilla OS, whereby you never log into the machine directly, you log into container environments seamlessly. If Slurm had that feature then it'd be ideal, as we'd be able to have a set of standard container images for the different usecases and then let people specialise them if they do have unique tools they need. I need to go through the Slurm container docs and see how close they let us get to that ideal, and see if Mark is of a like mind or not.

This week

Paper updates for the Friday deadline
Get back on top of my email backlog
Edge effects - we had a LIFE edge effects meeting which reminded me I need to look at improving on the binary edge effects work I did a month or so back and making it deal with fractional edges.