Weeknotes: 12th May 2025

Last week

Part II project submissions

I'm supervising a couple of part II projects (for those not at Cambridge, part II is what they call the third and final undergraduate year here), and the submission deadline is at the end of this week, so last week I've been doing draft reviewing for them both.

Storage

Mark, Anil, and Patrick had a useful discussion about our plans for large scale storage of data in our group. When I started in the EEG a few years ago we had a 128 TB disk that seemed like it'd last forever: six months ago it filled up, and despite attempts to garbage collect it's remained stubbornly filled ever since. We could just build it up to be even bigger, but I think we'd rather learn from this and try do something else that straddles the line between "not being annoying overhead when your supervisor is demanding progress updates" and "why is our large/expensive storage system full of data no one needs or cares about any more, and now do we weed that out from the precious data we must preserve".

We came up with some plans around how we can use ZFS to create datasets for people and then realise them on demand on our various compute servers, and then these can act as a unit of either garbage collection or publication as and when a project wraps up. I think this is fine in theory, but still needs a bit more detail put into the design, and I'm worried that (at least speaking for myself) it's too easy for this project to be pushed back compared to other near term demands on our time. To be clear, this isn't a criticism of others, just of myself as I know I'm a bit overloaded for the next month or so, and I'm trying to use my weeknotes here as a stick with which to beat myself later :) Or it's another way of saying Mark and I should have coffee soon :)

OCaml GeoTIFF progress

I stalled a bit on the OCaml GeoTIFF work, as it turns out trying to write to a TIFF file is a messy business. I alluded to some of this last week, in terms of challenge with data fragmentation within a TIFF if you are using compression and update the file, but I realised the same is true of the tagged metadata too: some tags that use more than a single unit of data don't store the offset, but an offset to where the value is stored, and that makes it awkward to build up the metadata block on disk incrementally. The flexibility of TIFF is clearly a feature, but also does make it more challenging to write.

The first stage then to getting the OCamlTIFF library to be ready for writing is to change how it does reading. Firstly it currently does exclusively on-demand from disk loading of metadata when getter function is called, but I think we need to move to loading all the metadata into a struct so we can then conversely build that struct up for writing in a single pass. Then similarly I started to add data-strip caching, which is actually a useful feature anyway: currently the data is fetched from disk each time it is accessed, and so a block cache (which is configurable) will be useful for applications where you read the same data a lot (e.g., the base habitat maps in the AoH calculations I do), and it also gives me a place to store data being written to the TIFF before it is flushed out to disk.

I also fell over on some of the clever typing that OCamlTIFF uses that is inherited from the typing used by the OCaml Bigarray library, and I need to sit down with Patrick at some point and make sure I understand what's going on there.

As a bit of fun I also continued to tie together my little bits of map visualisation code, which still proves to be a good debugging tool, as adding the incremental loading of raster data you seen in the last example in this video showed up some subtle bugs in the OCamlTIFF library's handling of loading data in chunks that don't necessarily align with the way the data is striped in the file. It was also a good excuse to refresh my memory on parallel programming in OCaml, mixing the loading and the rendering in parallel.

I suspect the slowness you see with respect to the data loading here is down to my somewhat naive implementation of LZW. I do wish I had a bit of time to explore the TIFF compression space in general. I've occasionally done point explorations of playing with the different options of how TIFF stores data, but it'd be good to do a full matrix of the different compression options and tiling versus striped for the AoH style calculations I do.

In particular, I'd also throw into this what seems to be an ah-hoc standard that GDAL has adopted for sparse file support (at least, at cursor glance through the official GeoTIFF specification didn't show this as being a standard feature). Seems if you just encode the data offset and length of a strip as zero then GDAL will assume this to be zero or whatever NODATA value the file has specified. Given that I spend a lot of time working with terrestrial species, it feels like switching from strips to tiling and not encoding the oceans would be a nice little earner both performance and storage wise.

STAR

In June we have an IUCN redlist workshop, and one of my goals is to ensure that my STAR pipeline is runable by Simon Tarr by then. We started on this the week before, and last week I finally updated my copy of the IUCN Red List from the 2023-1 release to the 2025-1 release, and did a full run through. This shook out a couple of manual bits of the pipeline I'd still not added to the shell scripts I have for people to run it.

Self hosting

Back in March I accidentally made my self hosting world worse by migrating from a well speced VPS to Raspberry Pi - I wrote about this at the time, but whilst I thought the Pi might be a little slower, I wasn't prepared for it to be an order of magnitude slower, even after I'd taken steps to speed things up for the new hosting setup.

As an aside, in those weeknotes I talked about moving from using Camlimages in my OCaml based blog hosting software to just calling out to ImageMagick took image processing down from 29 seconds to 4 seconds an image. Since then I've switched from ImageMagick to GraphicsMagick, after seeing a reddit post suggesting it was faster on a Raspberry Pi, and it did indeed take me down from around 4 seconds to 3.5 seconds, but it's still an order of magnitude too slow.

At the time I thought I'd just live with it, but the sucky performance is making me not just sad with how my websites behave, but also just stopping me work on other improvements I had planned for them. Wa have a bunch of momentum around blogging culture in our group at the Computer Lab now that is exciting, and I was on the leading edge of that, but I've since fallen behind.

At the weekend I started to migrate my long term backups from Backblaze B2 to Hetzner Cloud Storage, and whilst clicking around Hetzner's VPS offerings it looks like their ARM Ampera based servers are reasonably priced/speced, and are green energy certified! So I'll try kicking the tyres on one of those at some point soon, only this time I won't pay for a year up in advance as I did for the Pi :)

Weeknotes meta

On the topic of weeknotes, some other weeknotes I've read this week talking about weeknotes.

Firstly this, by Jon Sterling, where he muses on how our group's weeknotes culture fosters a sense of team despite us not often all being in the office together. I particularly liked this quote:

Blogging is not an alternative to meeting and talking in person; but I am starting to think that it is a prerequisite for the moments of serendipity that the latter can engender, because the ongoing dialogue of blogs and weeknotes makes me sufficiently informed to have a conversation that goes beyond the superficial.

Another frequent weeknoter I follow is Adrian McEwan and in this latest update he shared a link to 28 slightly rude notes on writing by Adam Mastroianni, and point 18, which is a response to an earlier point about what motivates people to write beyond "my course assessor said I had to". I felt I had a good handle on that, as try to focus on writing things where the reader will take something away they didn't know before they started (hence heavy linking, a focus on failure as being more interesting, etc.), but Adam challenge this:

Usually, we try to teach motive by asking: “Why should I, the reader, care about this?”

This is reasonable advice, but it’s also wrong. You, the writer, don’t know me. You don’t have a clue what I care about. The only reasons you can give me are the reasons you could give to literally anyone. “This issue is important because understanding it could increase pleasure and reduce pain.” Uh huh, cool!

What I really want to know is: why do you care? You could have spent your time knitting a pair of mittens or petting your cat or eating a whole tube of Pringles. Why did you do this instead? What kind of sicko closes the YouTube tab and types 10,000 words into a Google doc? What’s wrong with you? If you show me that—implicitly, explicitly, I don’t care—I might just close my own YouTube tab and read what you wrote.

I suspect you need to go through the caring about the reader to get to the first level of writing (see having read two part II reports this week...) but I think it'll be interesting to try be more conscious/deliberate of why I think things are interesting, rather than just relying on that happening naturally when I blog.

Which I didn't do this week, obviously.

This week

Next week I'll be hosting a discussion on lineage at the Nordic RSE conference. Outside of other scheduled duties this week, if you see I'm not doing that, please ask me why I'm not, as currently I just have a vague idea in my head of how it'll go, and that just won't cut it when stood in front of a large audience of RSEs and consuming an hour of their time!

Tags: ocaml, star, self-hosting, weeknotes

Tech notes by Michael Winston Dales