Weeknotes: Remembering to turn on the tap

15 May 2026

I've been working on an update to the LIFE pipeline for a few months now, which has caused me to refactor some of the pipeline, including moving some stages that were in Python code to just calling GDAL command line tools instead. What follows is an error in translation in that process, but I thought I'd document it as I think it's not a flag that I see used often, but can have a real impact on your results despite not seeming to do much.

When I sent the most recent version of the LIFE maps from my revised pipeline out to the group for review one of the group PhD students, Emilio Luz-Ricca, spotted an interesting oddity, that whilst most of the oceans are converted to NODATA, there was this odd ring around all the land masses containing zero values.

A black and white image showing the southern end of South America as black with the ocean white, but there is a clear ring of black pixels as i there was some barrier out at sea.

Emilo also observed that it happened to coincide with the transition between Marine (Neretic) and Deep Ocean Floor in the Jung habitat map we use:

A false colour image showing the same portion of South America whre different colours represent the different habitats. In the ocean you can see two distinct shades of blue being used and the boundary of the two is clearly where the "barrier" line was in the previous map.

As I mentioned, all the values in this mystery buffer were zero, so in theory have no significance, as in the LIFE metric zero is no change to the extinction risk it is attempting to measure, so not the end of the world, but in my experience if something isn't doing quite what you expect in these data-science pipelines you really should dig in, as whilst the bit you have observed isn't doing any harm, it might be doing so otherwhere.

One of the changes I've had to make is the point at which we downsample data. Like many data-science pipelines, LIFE ingests high resolution data (global 100m per pixel rasters), but produces lower resolution outputs that better suit the accuracy of the overall data set and method. That is to say, whilst we want high resolution input maps, the species data we work with is not designed to be accurate at such high resolution, so we downsample to 2km per pixel before publishing to reflect that.

One of the challenges though with doing any work with the high resolution rasters is they take a lot of memory and CPU time to work with. Like, a lot a lot. To quote the master of hyperbole:

"Space is big. Really big. You just won't believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to space."

The Hitchhiker’s Guide to the Galaxy, by Douglas Adams

However, there is something we can do to try reign in the computational burden of all this high resolution data. Working with downsampled data is a lot cheaper, and so knowing we are going to downsample the data before publishing, we can move the point within the overall pipeline at which we do the downsampling as close to the start of the pipeline as we can without sacrificing the accuracy we need in the method, ensuring we only use the high resolution data where its value is significant, and we can save costs otherwhere by doing avoiding doing later stages at an unnecessarily high resolution.

One of my early wins when I took over building the LIFE pipeline was aligning what we did in the LIFE pipeline with how the IUCN STAR metric pipeline, as implemented by Chess Ridley as doing downsampling. Thanks to certain tricks in how other layers were handled, the STAR pipeline had moved the downsampling ahead of all per species data processing, rather than after it as was the case in the LIFE pipeline. In LIFE we deal with counterfactual generation, so I was also able to work with this at the downsampled resolution also, which saved even more time. This was a significant part of getting the LIFE pipeline down from what was a couple of weeks execution time down to a day or two.

Unfortunately for me, science doesn't sit still, and the LIFE team recently changed how we process the input layers, which meant I've had to do more of the counterfactual map generation at the full 100m per pixel resolution, which has added about a day back to the from-cold runtime for the LIFE pipeline. This change is for good reason and in doing so the new version of LIFE will do an even better job of predicting land change impacts, particularly for rewilding impacts, but it does make my life a little more tedious!

So, back to the error Emilio spotted.

The error wasn't introduced by the rewriting of the counterfactual code to work at higher resolutions, that was fine. What was broken was I failed to migrate one of the many flags I feed to GDAL warp. The old LIFE pipeline used the habitat processing script from my AOH library to do the downsampling, but due to how we process the counterfactuals now, I do all that downsampling using the GDAL warp command line. This is the warp invocation in the habitat processing script:

gdal.Warp(
    warped_map_path,
    filter_map_path,
    options=gdal.WarpOptions(
        creationOptions=[],
        multithread=True,
        dstSRS=target_projection,
        outputType=gdal.GDT_Float32,
        xRes=pixel_scale,
        yRes=((0.0 - pixel_scale) if pixel_scale is not None else pixel_scale),
        resampleAlg="average",
        outputBounds=(reprojected_area.left, reprojected_area.bottom,
            reprojected_area.right, reprojected_area.top),
        targetAlignedPixels=pixel_scale is not None,
        warpOptions=[f'NUM_THREADS={max_threads}'],
        warpMemoryLimit=available_mem,
        workingType=gdal.GDT_Float32
    )
)

The main thing to spot here really is there's a lot of options passed to GDAL warp. In the command line version I replaced this with:

gdalwarp \
    -t_srs EPSG:4326 \
    -tr {params.pixel_scale} -{params.pixel_scale} \
    -r average \
    -multi \
    -co COMPRESS=LZW \
    -co NUM_THREADS={threads} \
    -wo NUM_THREADS={threads} \
    "$d" \
    {params.output_dir}/"$basename"

There's fewer options here, some of which is valid as gdalwarp can infer those from the input layer, whereas the Python code is doing more generation work so needs to be a bit more specific, and some are research constraints that I didn't need to propagate, but in all that I missed one important flag: Target Aligned Pixels, or -tap if you're using the command line version. This option forces the reprojected map to be on a regular grid based off the pixel size, rather than just being referenced from the top left corner. This is important because whilst all the inputs to LIFE are at the same pixel size, they have slightly different dimensions (most maps for instance do not go all the way to the poles, and different maps will cut off at different points). By not passing the -tap option when downsampling it means that although my counterfactual maps are often the same in certain areas compared to the current map, and so when downsampled they should also be the same, this slight wobble in where the origin for downsampling was lead to a slight value difference between the downsampled maps, as certain pixels were shuffled to the left or right when doing the averaging.

The consequence of this is relatively minor, as it's only a slignt misalignment: most of the pixels are not boundary pixels so not impacted, and those that are the values will be close to correct. BUT they are wrong, and despite it showing up at sea, it does impact land pixels just as much, but because of the noise in the species data it's just not at all obvious. And it's actually wrong in the places that are some of the most interesting: land fragmentation is a real problem in biodoviersity protection, and land fragmentation means borders between habitat classes, which is where the error is introduced!

All of which is saying I own Emilio a cake for flagging what seemed like an odd trivial thing, but actually does have impact on our results.

I have to confess, because I tend to use Yirgacheffe, my geospatial library, as a place to hide things I don't want to have to bother with, I have a tendency to forget I have to bother with them at all, which is really what bit me here. If you have a series of rasters where the pixel scales are the same (so, all have 10m per pixel say), then given how unusual it is for all sources of rasters to have the exact same grid alignment, it looks at all the input layers, works out the closest alignment for all of them, and the proceeds with that. This is a very forgiving approach in general, but does mean I propagate the same issue: layers generated by Yirgacheffe will have yet another grid alignment rather than following a TAP method. Had Yirgacheffe defaulted to TAP in its internal processing, then by the time I got to downsampling I'd not have an issue. The only reason I don't do that is if you do feed Yirgacheffe a layer with some vector data or do some other non-alignment related work, I want the output layer to match the input, and not TAP the output.

As ever, there is no perfect default behaviour, but one I should probably reflect upon. Which is handy, as I now have a couple of days of compute whilst I recalculate LIFE from scratch with the correctly aligned base layers 🤦

Tags: weeknotes, life, gdal

Tech notes by Michael Winston Dales

Weeknotes: Remembering to turn on the tap