Weeknotes: 19th May 2025
A photo of the Danish island of Læsø taken from the plane as we descended towards Gothenburg airport.
Last Week
Nordic-RSE
I did my prep for Nordic-RSE conference, or at least as much as I feel I can, where I'll be hosting a discussion session towards the end of conference on lineage in data-science pipelines. I've limited experience hosting discussion panels before, and so I've been reading through Gamestorming by Gray et al, a book we happened to have at home. It's one of those books where a lot of what it says (at least in the opening chapters) is perhaps somewhat obvious (e.g., a session should have an opening, an exploration, and a closing), but it's really useful to have this spelled out and formalised a little, and will hopefully lead to my managing of the session being a bit more deliberate and focussed.
My ultimate aim for this is to try and tease out what tools and techniques other RSEs have been using to help preserve lineage and ensure repeatability and reproducibility of the projects they work on. At the end of this we'll hopefully have a bunch of suggestions which I'll then write up and host in a git repo somewhere so that other participants can fill in bits I missed or add more details. If at the end of the process we have a page with a set of things that the community can refer to in the future to make it easier to ensure lineage is preserved, then this will be a success. If we don't achieve that, modulo my bad facilitating, then this will also show that there is a gap here that could be used to direct our research, so that's also a success.
STAR
I squished the last known issue with my STAR implementation, which was another GDAL oddity throwing an error that got lost in the sheer volume of species we process. In the default setting GDAL will not load GeoJSON polygons that are over 200MB size. The fix is simple, you remove the limits by setting the OGR_GEOJSON_MAX_OBJ_SIZE
environmental variable to 0
. Note that the files themselves aren't over 200MB in size, so I assume this refers to the in memory representation for GDAL.
I'd applied this earlier in the pipeline to get the AoHs to work, but the problem was I'd not set it for one of the later stages. If I actually used the dockerised version of the pipeline this wouldn't be an issue, as I'd have set it in the environment once and could forget it, but because I tend to run in my native environment it has to be set in every script that might impact it. I should probably just punt this into yirgacheffe, as working with the IUCN range data you regularly end up with polygons that exceed this for species that have coastal ranges (see my recent rant on this).
That out the way I sat down with Simon Tarr from the IUCN and we ran through getting the docker pipeline for my STAR pipeline to run on his laptop. I normally don't use the docker version myself, and so there were inevitably a few teething issues. But we got Simon to a point where he was generating AoH maps, but more importantly the process helped him understand how the pipeline works internally a little.
I spent a little more time afterwards and got other parts of the pipeline also working in docker, like the model validation checks on the AoH maps. This required I install R in the container, as I had forgotten the Python stats package I ended up using for this when porting over Chess's R code actually just calls to R under the hood 🤦 Still, good progress.
I do think I need to add some better CI around the docker images, and we should push both the LIFE and STAR docker images to a registry for people.
OCaml TIFF
Still tip-toeing my way around getting to the writing of files from OCaml-TIFF, and instead this week I changed the default handling of specifying the type of the data to be read from the file to the time you open it, and then worked on a couple of useful GDAL specific GeoTIFF extensions that are quite useful. In fact, I have to confess that I'd naively assumed these GDALisms were part of the standard for GeoTIFF as I've come across them in datasets from others and popular tools like QGIS seem to honour them (though I suspect that's as QGIS is using GDAL under the hood).
The extensions are: setting a NODATA value, and sparse TIFFs. NODATA lets you nominate a value in the TIFF file that should act as a sort of mask - if I set NODATA to 42, all values 42 in the image are ignored and in tools like QGIS won't be displayed. I see this used a lot, say to mask out the ocean or areas outside a given spatial range of interest. All fine, until you look at how it's been added to GeoTIFF by GDAL: the value is not stored encoded as the same type the data is in the file (e.g., an int value for int files, or float for float files), but rather as an ASCII string of the value. This means there is a bunch of inference that has to happen, for example if you have a uint8 TIFF, here's some variations on data you get for different NODATA values from GDAL:
NODATA | synthesised data |
---|---|
3.175 | 3 |
3.9 | 4 |
-32 | 0 |
321 | 255 |
nan | 0 |
For our library, which is meant to be promoting type safety, I think we'll not try mimic however GDAL does the conversion, and just throw an error if, for example, you have a unsigned integer layer and you provided a negative NODATA.
The other extension is sparse TIFF files. TIFF files store the image data in strips or tiles, and in the header of the file have a table indicating the offset and length of those blocks within the file itself. GDAL has the nice extension that if you set the offset and length of a block to 0, then it'll synthesise the data for that block rather than reading it from the file. So if you have a block that's all a default value, or your NODATA value, you don't need to put that in the file. In particular, if you're using tiles and have areas of ocean and all you care about is land, this seems a neat saving. The synthesised block is initialised with either zero, or if one is specified, the NODATA value.
I've not yet knowingly ran into sparse TIFFs in the wild (unlike NODATA values, which I see frequently), but I definitely intend to use them now I know of that.
Also, excitingly, Patrick took my LZW implementation and speeded it up by replacing my list based implementation and using Strings with some magic calling to internal functions to reduce allocations made.
Edge effects
I've been reading some papers on "edge effects" - that is, how does a species interact with the edge of its habitat range. For example, if a species likes forest, it won't necessarily live in every part of the forest, keeping away from the edges where it transitions to other habitats it doesn't like. I've been asked to implement edge-effects for my AoH code, and I have a general idea of how I'd implement this using something like a standard image processing convolution process, but I wanted to know how others have implemented this, to see if I was missing anything about the problem. To this end I'm currently reading my way through the original paper Andrew worked on To What Extent Could Edge Effects And Habitat Fragmentation Diminish The Potential Benefits Of Land Sparing? by Lamb et al and a more recent look at the topic that attempts to add more nuance to how the edge effects are implemented (which then is going to be computational more complicated I suspect) A Mechanistic Approach To Weighting Edge‑effects In Landscape Connectivity Assessments by Dennis et al.
This Week
This week I'll be working from Gothenburg, with the Nordic-RSE conference taking up Tuesday and Wednesday. Monday I'll be doing more prep for my discussion session on Wednesday on lineage in scientific systems, Friday I'll be travelling back, and Thursday I'll hopefully do a little exploring and practicing my Swedish.
Tags: ocaml, star, self-hosting, weeknotes