Weeknotes: 17th March 2025

Förrförra vecka

Life maps

As the outputs of the LIFE project we generate two maps: one showing the impact on extinction risk of converting land to farmland (aka the arable scenario), and another showing the risk of converting human changed land back to predicted its natural state (aka the restore scenario). Those maps though are data maps, containing fractional values in each pixel, so aren't something you can see, so instead when presenting the work I tend to lead with a "pop-science" version generated by QGIS when you load the maps. Here you can see the map for the arable scenario, showing that putting a farm in a tropical forest is a sub-optimal idea:

An image showing a set of coloured areas that map to the main landmasses of the earth. The darker colours are more evident around the central belt of the planet: the Amazon in South America, the central belt of Africa, and across Indonesia.

And the restore scenario map here, which has less change visible, as there is less area to convert:

An image showing a set of coloured areas that map to the main landmasses of the earth. Most of the colour is on the coast lines, and is particularly intense in Madagascar

The different colours are from the different channels in the rasters, which is split out by taxonomical group: amphibians, birds, mammals, and reptiles. However, there's two problems with using QGIS for this task:

Getting the maps at their full resolution 21600x10798 out of QGIS seemed a task beyond me, and I had requests for something better than just a desktop sized screenshot of the QGIS rendering.
The rendering just maps 3 channels to the red, green, and blue channels on each pixel, but there are four layers, which means that reptiles are not actually in the pop-sci maps.

I've been meaning for a while to write my own rendering, so I finally spent some time trying to solve this, and it was a bit more tricky than I'd naively assumed. My assumption of how QGIS was doing this was that it'd take the min and max values for each channel, set those to the 0 and 255 points on each colour channel and then map between them. This however produced very pale maps with details hard to see. I realised that the problem was that what QGIS and my code thought the min and max were was different: I had both a lower min and a higher max. I thought at first this is because GDAL, which QGIS uses for a lot of this stuff, doesn't usually get the absolute min and max, it uses a approximate guess based on just reading a subset of the data. So I changed my code to use GDALs approximate min/max rather than the actual min/max, but that didn't fix it.

In the end credit goes to Alison, who pointed out that QGIS also throws away the top and bottom two percentiles of data (a configurable option) by default, which gets rid of a bunch of extreme values at both ends. Thankfully I could use numpy's nanpercentile for this, and finally I got matching maps based on just using the first three taxa.

It's easy to see why using nanpercentile generates more pleasing maps, as you get more data at the two ends of the spectrum close to the visual ends of the appropriate colour channel. However, I do need to find an alternative approach, as nanpercentile relies on knowing all the data beforehand, where as for performance and resource reasons I really want to be processing the files in chunks, so I have a todo there still.

I also failed to get it to look good if I mixed in the 4th taxa (The 4th Taxa is my next band name). My (again naive) assumption is I'd just map each taxa to an RGB value, and scale all channels in the RGB per the taxa value in its range, but that doesn't work well when you then add the multiple scaled taxa values, as some will blow out if added together, and if I make the RGB values lower so that doesn't happen, the map looks dull and muddy. I need to add some non-linear scaling function here I fear, but I ran out of time on this, so that's my second todo here.

Förra vecka

Nordic-RSE

I submitted three submissions to Nordic-RSE: a discussion session proposal on building lineage into pipelines (I'm interested in what, if anything, others do about this), a lightning talk on Yirgacheffe, and a full talk doing a retrospective on the LIFE pipeline as a lessons learned sort of thing.

The most interesting thing about the submissions process was that a few days before the deadline they organised a "night of unfinished abstracts", which was a two hour Zoom call where people could come along, discussion submission ideas, ask questions of the organisers, and work on ideas. This feels like such a good idea I'm surprised I don't see it more: it's relatively low cost to implement, it allows those feeling unsure about their ideas to get encouragement and help focussing, and in general helps everyone focus better. I went in with two submission ideas, and got some good feedback on the discussion submission (had I thought about how it'd be facilitated, what are the outcomes, etc.), and I learned that retrospectives are of interest, so I added that last idea.

Should I have the misfortune of organising a conference in the future it's definitely something I'd attempt to organise.

Habitat maps

This week I re-read the papers behind the two habitat maps used in the Area of Habitat calculations by LIFE and STAR respectively:

A global map of terrestrial habitat types by Jung et al
Translating habitat class to land cover to map area of habitat of terrestrial vertebrates by Lumbierres et al

Like many papers in this area, I read them a while ago when I started in this area, but it's only now re-reading them after being immersed in this field for a couple of years that I appreciate what they're saying and why.

To calculate the area of habitable land for a species, one of the obvious inputs to that is a map of what habitats there are on the planet: because we use species data from the IUCN Red List, we want to have a map whereby we know the equivalent IUCN habitat classification in each pixel. Habitats are funny old things, and I don't claim to know even enough to match a first year undergrad in ecology here so please take all this with a pinch of salt, but the important thing to note is that habitats aren't just what is on the ground - is it sand or plants or concrete etc., but also what the conditions are about it too: is it usually damp or dry and so forth. So you can't just take a photo of the earth and declare habitat from it, you need to know more, and that's what both these papers are addressing, each using a different technique.

Jung et al takes a land-cover map, which is the layer which says what the ground is like (the above "sand or plants or concrete" layer), specifically it uses the same Copernicus Land Cover at 100m layer Lumbierres et al also uses, and then combines it with other data sources on climates and biomes, and uses a simple decision tree to say for each pixel does it meet certain requirements for being a specific IUCN habitat class. Once they have the map, then they take a set of species point observation data (i.e., data that says "oh, I saw a lesser spotted worbler at this location"), look up the habitat preference data for that species on the IUCN Red List, and attempt to validate their map from that (i.e., does the map match in that location of the observation one of the habitat classes of the species). The top level results are not amazing, at 62% for the top level catagories on the IUCN habitat classicication list, and 55% for the more detailed breakdown version - however animals do tend to move around, and the observation data is notoriously difficult to assess.

Lumbierres et al on the other hand doesn't actually produce a new map, rather they create a crosswalk table that lets you go from IUCN species habitat classifications to the land classes in that Copernicus land-cover map. So that is if a species likes "rocky areas" in IUCN terms it'll be able to use the following set of codes in the Copernicus map. This is interesting as it means you can map several classifications in the IUCN species data to a single pixel in the Copernicus map, which isn't the case with the Jung et al map (as I understand it anyway). The way Lumbierres et al build this table is actually by taking a lot of species observation data, doing a bunch of hygiene work on that to try improve the quality, and then using statistics to say "well, these species are found in this pixel, and they prefer these habitats so it's probably one of these classes". Their validation techniques also used observation data held aside for this purpose, and seemed to show accuracy of between 64% and 94% depending on the land class.

Whilst the Lumbierres et al results seem better numbers than Jung et al, I don't think the numbers are necessarily comparable due to the different techniques - given Lumbierres et al have a process for fitting observation data to a map and then test it with yet more observation data, I'd expect it to have a high correlation when tested with that data (otherwise they'd have tried again) - basically the Lumbierres et al approach accepts the uncertainty of the observation data (albeit after cleaning it up), and so it's a map that allows for that. Which could be totally fine, as the species were seen there and the data may be perfect, but I guess doesn't allow for someone say spotting me in London when I've gone to see something there, but I really only habitat a small town in the fens. The Jung et al map I feel is trying to build a habitat model and then test that, but we can't really also say how well it's done for similar reasons. It does feel the decision tree is a bit simplistic given the diversity of landscapes and climates we have, and I'd expect Lumbierres et al to neatly include that complexity through their statistical methods, allowing for leopards nipping to Tate Modern for a day out etc.

I have no real sense of which is better: I'm just a compsci pretending to be an ecologist some of the time, but I do think it's been useful to understand where both come from and start to ponder what else might be done between these two methods.

Self hosting

Self hosting is liberating, but comes at a cost of there being no-one to blame but yourself when things go a bit Pete Tong.

My plan was simple: I wanted to consolidate my two VPSs down to a single instance, and to have those VPSs move from Linode to Mythic Beasts as I already host some other bits there. My two VPSs had started small, but as tends to happen, had grown over time as my needs changed, and so I was paying over the odds, and having to keep up two machines, and so a bit of reset felt in order.

My mistake was, as ever, getting sucked into trying something new. I spotted MB did hosted Raspberry Pis, which in theory should be enough for what I need: a few sites ran with my own OCaml based semi-static-site framework, my GoToSocial instance, my matrix homeserver, and the PostgreSQL database instance for those last two. I felt a 4-core Pi with 8GB of RAM was plenty for this, and so decided to give this a whirl rather than go for another VPS.

This was probably a poor choice.

As I've written up in the past, my website service is semi-dynamic: it is laid out on disk like a static website similar to Jekyll or Hugo (and indeed for many years was a Hugo static-site), but I don't compile images to the right size ahead of time, as that's a bit of a waste for site like mine: I have thousands of images, but most of my website isn't ever visited, so it's a waste of energy and disk space doing all that work. Instead, Webplats just renders images to the appropriate format (thumbnail or article image, retina or non-retina version, etc.) on demand. It does so into a small cache, so any frequently used images are fast, but otherwise the first time images are seen there will be a little delay - but these are my personal websites, they don't need to be super responsive.

However, I noticed on my new Pi server things were incredibly slow, and this then showed up another problem: Dream, the OCaml library I'm using for Webplats to do all the HTTP bits, does not support parallelism, and so any slow image rendering basically stopped my site working at all.

I did some benchmarking to try work out what was going on: is the Pi that slow? Is it the OCaml image libraries are just bad on the Pi? Is it something else like disk access speed on my Pi?

The first test I did was to benchmark resizing an image using ImageMagick, testing it on my M1 Macbook Pro laptop, my old VPS at Linode, and on my new hosted Pi 4. I picked ImageMagick as it's been around a good while and I trust it to have been sensibly written wrt performance. I used this image as my test, resizing it down from its native 3581x5371 down to 1033x1600 as it is shown on that page. The numbers were:

M1 MacBook Pro: 0.8 seconds
Linode VPS: 0.8 seconds
Raspberry Pi 4: 3.9 seconds

Ouch, so yes, quite a bit slower. I'm impressed at the speed of the Linode VPS, and when I had a look at what it's hosted on, it turns out to be an AMD EPYC server, similar to our big compute machines in the lab, so pretty well spec'd.

However, I felt 3.9 seconds was still a lot better than what I was seeing using Webplats, so I dug some more. Webplats uses Camlimages for image resizing, which is the only image library I could find in OCaml that supports all the common image types I need to handle. I had a look at the image resize code in Camlimages, and it's nicely written, but clearly not performance optimised (and to be clear, it never claims to be). So I made a quick stand alone example and ran that, and now I can see why my website was so bad. Doing the same image test as before:

M1 MacBook Pro: 0.9 seconds
Raspberry Pi 4: 29 seconds

Oooooh, that explains what I'm seeing then :) Seems I have a double whammy of problems: both that the Pi is slower anyway (as shown by ImageMagick's performance) but the behaviour of OCaml in this context makes it much worse. Like I say, the code in Camlimages isn't written for performance, but rather for clarity, and so isn't attempting to be efficient in how often it access memory etc., but even so that's quite the slow down.

I need to dig more into this, but also my Sunday afternoon was vanishing away from me, so for now I've enacted the obvious workaround rather than investing time into trying to understand where the hit with OCaml on the Pi is coming from: I changed Webplats to use ImageMagick :) My reasoning is that WebPlats was writing the new image to disk anyway, as it would cache the image, and also the ImageMagick code is likely to be pretty good performance wise, and I was unlikely to beat it with just an hour of hacking at OCaml.

So not a great result: my Pi is slower still so I need to make Webplats use OCaml domains at some point to push image processing off the main thread, and I have an unsolved mystery about why Camlimages is being so slow in this context, but there's only so many hours in the day, and I've still not migrated the rest of my bits from my old VPSs!

Welcome to the self-hosting rabbit hole :)

Denna vecka

I'm remote all this week up, working from up on The Wirral. My main todos are:

I got a bunch of habitat data (aha) for Brazil that claims to be more detailed than either of the above global maps, and so I need to look at whether this is something we can technically work in, as well as talk to ecologists about whether this is something we should work in (queue quote from Dr Ian Malcolm).
Last week I finally ported over to the STAR pipeline the decision reporting code I added to LIFE for why species were accepted/rejected, and I can now try use this to get back to why Chess and I get slightly different answers for our respective implementations of the STAR species selection process (which to be clear, Chess is the original author so her implementation is definitively correct, and I've got a bug somewhere!).
I had an idea for some work that might take advantage of what Frank has been working on with foundational models, so I need to get him some data for that so we can do an initial test of that idea.
Outreachy has started, so I'll be helping candidates for that find things to work on as I'm running one of the projects there.
Finish moving my world over to my new hosting setup if there's any time.

Tags: life, self-hosting, habitats

Tech notes by Michael Winston Dales