Weeknotes: 26th May 2025

Last Week

Nordic-RSE 2025

A photo of a name-badge saying "Nordic RSE Conference" and "Michael Dales"

I have a lot of notes from the Nordic-RSE conference that I hope to turn into a blog post, so I won't say much here other than it was a great event: I met interesting folk that work in other science disciplines as RSEs, I saw a bunch of interesting talks and learned a lot, and once again got to appreciate Gothenburg. Very much worth going for me, and I hopefully can join it again next year when it'll be in Tromsø.

LIFE

Before heading out to the conference Ali came to me with a bit of weirdness she was seeing with one of the tools in the LIFE pipeline. The LIFE pipeline does all its work in the WGS84 map projection, where the pixel grid aligns with the latitude/longitude grid on the globe. This is a popular map format, I assume on account of it being easy to reason about, but is quite a distorted map projection also, given it is the same number of pixels wide at the equator as it does at the poles. This means pixels at the equator cover a much larger area of land than they do at the poles; other map projections like Mollweide attempt to have a projection that keeps a roughly equal area per pixel, but at the cost of being less easy to work with in other ways.

Because of this WGS84 area-per-pixel distortion, you can't compare pixels in a map directly without taking into account their area. So when we do the Area of habitat calculations in LIFE we multiply the contents of each pixel by another raster that contains in it just the approximate area of each pixel. I say approximate because to avoid excessive work given the resolution we work at, we make the simplifying assumption that the area of every pixel at the same latitude has the same area, and there are nice formulas for calculating that, which I codified into a simple script.

So far, so good.

However, if you look at the script, you'll see it makes an optimisation based on that simplifying assumption: the map it generates is only one pixel wide. There's two reasons for this. Firstly, LIFE works at a resolution to 100m per side per pixel at the equator, which means a global map is 150GB per byte per pixel, and storing the area as a float 32 as we do, that would be 600 GB uncompressed. Then if you look at the most commonly used compression in a TIFF file, LZW compression, that requires that to read a pixel in a row in the image, you have to decompress all the preceding pixels in that row: okay if you're calculating values for Alaska (being on the left of the map, so values are early in the row), but not so much for New Zealand (on the right of the map, so you need to decompress everything to the left first). Very early on in my time in the LIFE project I spotted that this was causing a lot of slow down, so I added a special mode to my geospatial library Yirgacheffe that lets you provide it a one pixel wide image, and it'll just extrapolate that out to fill rows with that same value. A simple trick, and one which gave the pipeline a significant performance boost.

So far, still so good.

The problem came when Ali tried to use that same script to generate an area layer for some analysis she was doing at 10m per side per pixel at the equator, and then rather than generate an area-per-pixel raster that was 1 pixel wide, it made one that was 2 pixels wide. Yirgacheffe does some sanity checks when you use the special mode for these maps, and checks that you have passed it something that is a single pixel wide, and so was rejecting the new map. I dug into this, and once again it was floating point weirdness that was biting me.

When you generate a GeoTIFF you need to specify the spatial location of the pixels. So when I generate my 1 pixel wide image, I was giving the location of -180˚ longitude, which sort of made sense to me, as in Yirgacheffe I then expand all the pixels out to the right. This meant that in the internal logic I was generate a map that goes from -180˚ to (-180˚ + size of pixel), and despite Ali using a pixel size value of 0.0001, which seems to a human as a nice round number, when pushed back and forth through floating point, it turns out that -180.0 + 0.0001 rounds every so slightly larger than -179.999, and Yirgacheffe when creating GeoTIFFs based on area will always round up so as to not lose data, and this we end up at 2 pixels. To make things more icky, if you specify a pixel scale of 0.000100000000001 it all works, as the floating point approximations play nice.

Fixing this properly is awkward, and I think I need to track in Yirgacheffe whether you really wanted to make a raster that was from -180.0 to -179.999 and a bit, or you just wanted to make something 1 pixel wide, and I didn't really have time for that plumbing, so I filed a bug against myself, and just moved to offsetting the area-per-pixel raster at 0˚ latitude, as the math works there, and internally the special mode in Yirgacheffe for expanding these area maps never checks the longitude, just the latitude. Not proud, but it got Ali unblocked before I vanished for a week.

Self-hosting fails

I got bit once more by my choice of self-hosting platform being a Raspberry-pi, when I tried to share my slides from Nordic-RSE, which take the form of a 17MB PDF. Turns out that my quick and admittedly bad approach of handling static files by "read all the data to memory then pass it to the web framework" broke, as OCaml's default max size for a string on a 32-bit system (which Raspian Linux is, despite the CPU being 64 bit) is 16 MB. I could just change an environmental variable for this, but I really shouldn't be loading files like this anyway, so I thought (sat in Gothenburg airport where they have eduroam), I'd try do this properly and serve the data in a more stream like way.

Looking at the Dream source code (Dream is the OCaml web framework I use), it uses LWT under the hood for this. LWT is one of the many ways of doing concurrent work in the OCaml ecosystem, and I've been trying to avoid learning it, because if I tried to learn every competing concurrent framework for OCaml I'd be late for dinner, and our group is in team EIO, so I was going to invest time into that at some point. Anyway, it was small enough code that I could just borrow from the Dream implementation the LWT bit of its static loader (which I'm not using because it doesn't set last-modified headers).

This works, but I still can't share my slides, as now I get an error of Invalid_argument("Bytes.create") within Dream somewhere when I use a large file - I assume I'm hitting a similar limit to that for strings - which implies the Dream/LWT implementation I based my updated static file handler on isn't as clever as I hoped.

And there the matter rests for now. My slides weren't that interesting (as I lead a discussion session).

This Week

Open Hardware Summit

This will be a somewhat short week again, as I need to head up to Edinburgh for Open Hardware Summit 2025, an event I last went to in Denver in 2017. Most of it isn't directly EEG related, though there is a panel on environmental monitoring which might relate to the Terracorder work Josh is working on, so I'll try get to that.

Edge effects

I need to take a stab at implementing edge-effects on habitat rasters. I've been reading papers to see what others do, and I still have some unanswered questions about the nuance of this, but I think it's probably at the stage where I should make a thing just so I can get a sense of how good/bad that is, and then refine from there.

Write up Nordic-RSE

I need to write up both my discussion session, and a general overview. So many notes, so many good ideas and learnings.

GBIF processing

If there's any time left I still need to get into processing occurrence data for species based on GBIF data. This might be a good chance to see if the performance increases duckdb announced for spacial joins are meaningful for the sort of thing I do.

Tags: nordic-rse, self-hosting, yirgacheffe

Tech notes by Michael Winston Dales