The trials of area per pixel maps, or Michael's Holstic Detective Agency

17 Feb 2026

Tags: yirgacheffe, life, nemesis, dirk gently

This is a bit of a meandering blog post, which was meant to be about one thing, and then I had to pull in some other bits to give context, and now it feels a little incoherent. However, I do think somewhat that is part of the broader point I'll try to make at the end, about the challenge of solving problems with lots of little bits to them. So sorry, not sorry; but I'll understand if you get bored and wander off half way through :)

After my undergrad degree I worked for a year as a research assistant before I'd go on to doing my PhD in the same department. I was part of the Systems Research Group (SRG), working for one of my undergrad lecturers Richard Black on the Nemesis operating system (the project that would pave the way for the Xen hypervisor). I was one of the top students in my year, so felt I knew a thing or two, but whilst I knew a lot of book-work, what I didn’t realise how much I lacked in application of that knowledge.

I remember in particular one group meeting early on in my RA tenure: we were discussing some performance problem - I can’t remember the specifics, other than it was something one of my colleagues was struggling with some code running poorly on their computer. Richard took one look at the problem, and immediately identified the issue as being a buffer size in the network layer on another computer that the first computer was talking to. My mind was blown: how could he so quickly do that? He spotted not only the root cause of the issue, but that it was also on some other machine, not even the one where the symptoms were displaying! I was in awe of this feat of deduction.

Whilst Richard is indeed quite clever, what enabled him to make this quick deduction was a combination of experience and thinking of the problem not in isolation but as part of a broader context: that is to say, it was a systems approach. When I started in the SRG I used to think of the “Systems” part as being somehow a contraction of “Operating Systems” as there I was working on an operating system project. However, I came to realise that “Systems” is more about how everything fits together: your application doesn’t run in isolation, it’s just a small part of a set of software and hardware all designed in isolation and all trying to work together, and occasionally there’s a small kink where they don’t line up and things start to misbehave or, in this case, behave slowly. A chain is as strong as its weakest link and all that jazz. An important less for young Michael, which is why that meeting has stuck in my memory so well.

Recently I was reminded of Richard’s feat of deduction once more when talking to Alison Eyres, my regular ecology-collaborator. Ali has been preparing to give a talk looking back at some of the work we did together in speeding up the LIFE biodiversity metric data-science pipeline over the last few years. Back when I started on the project it would take a few weeks to run on a Very Large Computer™, and now from cold it takes a little over a day to run, and if you’ve ran it before, then just a few hours might be sufficient.

There was no one big win to achieve that significant improvement in performance: Ali and I worked together over several iterations, and we collaborated with other teams doing similar projects, such as the IUCN’s STAR team, to figure out how to get things to run quicker. However, in prep for her talk, Ali reminded me of one of the very early improvements I was able to make which I’d forgotten about; it wasn’t particularly a challenging bit of computer science to solve, it was just a bit of systems thinking that showed a problem that appeared to be over here was in fact this innocuous little detail all the way over there.

At the core of the LIFE biodiversity metric is calculating the Area Of Habitat (AOH) of a particular species. An AOH calculation isn’t particularly complicated: you mix together a species rough range (where on the globe it might be found) along with high-resolution habitat and elevation maps, and use that to work out where the land is the species relies on. What is challenging though is the scale at which this calculation happens: the LIFE pipeline does around 120,000 AOH calculations at a resolution of around 100m per pixel (at the equator). The habitat and elevation rasters are roughly 400,000 by 200,000 pixels in size. I you think that the the extreme high detail camera on top end phones is 50 mega pixels (MP), and more regular photos are 12 to 14 MP, these rasters are 80,000 MP, or indeed 80 tera pixels. Depending on the pixel size, these rasters can take 150 to 300 gigabytes of RAM to hold.

Obviously with such large numbers, people want to find ways to save space. There’s a couple of obvious tricks to this, but this is where things start to interact in unfortunate ways. The first trick is using compression on the image data. Whilst the uncompressed size of the rasters is in the hundreds of gigabytes, the image format used to store them on disk, GeoTIFF, typically stores the data using LZW compression, squashing them down quite considerably: looking at the data director for my LIFE pipeline the elevation raster is down to 10GB and the habitat map down to just 2.5GB! That’s quite a saving, but it comes at a cost, as we’ll see shortly.

The other way we can save on space is by realising that most species we work with live in a small area of the world. For instance, the UK is a relatively small landmass, so if we pick a species that is endemic to the UK (that is a fancy way of saying only found in the UK), such as the Scottish Crossbill:

When processing the Scottish Crossbill we can just draw a box round its range, and we find that at our 100m per pixel (at the equator) habitat and elevation maps we now need to load just 6 MP of data - less than 0.01% of the original data! Even for creates with a larger range, such as the cute Bush Hyrax, which lives along the eastern edge of Africa:

We're now up to 2,000 MP, which is a lot of mobile phone pictures, but 2.5% of our original detailed rasters. And even the Moose, which is at times my friend but in this case my enemy, as their range covers a significant portion of the norther hemisphere:

Their range requires me to load 30,000 MP worth of data, and they are very slow to process as a result compared to the Scottish Crossbill, but it's still a significant saving over loading the entire satellite layers.

So far, so sensible: we can both compress the data so it takes up less room on disk, and we can cut out just the bit we need so it takes less room in computer memory. So what's the problem?

Well, there's one final layer the LIFE team were using when I started, which is a full 400,000 x 200,000 pixel raster in which is stored the area in square meters of that pixel. You might be scratching your head as to what this means, and even if you do understand the words, it might not be obvious why you'd want this. So let me quickly explain that.

Any attempt to convert a roughly spherical object, like the surface of our planet, into a flat rectangle that you could put on a table, in a book, or onto a web page, will result in significant distortion. This is evident in the maps above when you see that Greenland looks bigger than either South America or Africa, but is in fact significantly smaller: Greenland is 2,166,086 km^2, South America is 17,840,000 km^2, and Africa is 30,370,000 km^2. This is because the width of a rectangular map is defined by the circumference of the planet at the equator, but the map insists we still fill edge to edge as we move to the poles, despite the circumference of the planet getting smaller as we move away from the equator. If we think of this as a bitmap image on a computer, it's just a grid of square pixels, so there are the same number of pixels in each horizontal row, but less land for them to share as you move to the top and bottom of the image. This is why I've been pedantically pointing out that the maps I use are 100m per pixel AT THE EQUATOR, because by the time you get to, say, Inverness in Scotland (where we might set off on an expedition to find some Scottish Crossbills) whilst the pixels are 100m on the north-south axis still, they're just ~55m wide on the east-west axis, so almost half the area of the pixels on the equator, despite looking the same on a computer screen.

It's all a mess, and why we need 3D displays to stop us making these distorted map projections :) There are less distorted map projections one can use, but in LIFE they were using one of these quite distored projections, so this is what I had to live with. The implication of this is that when calculating the Area Of Habitat for a species based on one of these distorted map projections, we need to actually take each pixel we know the species inhabits and then multiply that pixel by its specific surface area, and then we can add them all up to get the total area.

Hence why the LIFE team had this extra area-per-pixel raster, where the value in each pixel is just the area in square meters that pixel covers. It's quite a simple image, particularly as although the image is the same size as the other high-resolution habitat and elevation rasters, the values on each horizontal row are the same, as the production distortion only happens on the north-south axis.

So now we have all our pieces, and this is where I came in trying to look at where one might speed up LIFE pipeline. I did some initial profiling, and quickly realised that the area-per-pixel layer was actually a significant contribution to the run time: if we skipped this step the pipeline was notably quicker. But it's just another image, so why is that? This was my turn to apply systems thinking, as I'd watched Richard do all those years before. The moment I spotted this was a source of performance loss I knew that what seemed like two sensible attempts to make things faster was actually slowing them down.

I mentioned before that GeoTIFFs are typically compressed with LZW compression. This algorithm works by taking each horizontal row of the image, going left to right along that row, and looks for common patterns in the data. As it goes along it notes patterns it sees in the data, and assigns them a code number in a dictionary, and if it sees that pattern again it just needs to write out "this is the same as pattern 42", rather than the full pattern again. Statistically most data has enough repeated patterns in it that this makes the result smaller. The really clever bit about LZW, which I'm going to skip going into details on, is you don't need to store the dictionary of patterns at the end, the decompression algorithm can figure that out on its own as it scans the row of code numbers, which saves yet more space.

The problem with the clever decompression of LZW is that to work out how to decode any given part of a row of data, you need to have processed all the data to the left hand side first, as it needs to do that to build up the dictionary. This internal detail of LZW makes somewhat of a mockery of the being selective about which are of the image you want to read, as to decode the pixels for Inverness we'll still have to decoded a chunk of the Pacific Ocean, a swathe of Canada, all the pixels over the Atlantic, then the Outer Hebrides, Skye, and a the Highlands left of Inverness. If the data hadn't been compressed, then when loading data from the image you can just skip to the right part, but then the image would be infeasibly large.

Once you know this, it's an easy enough problem to solve, but knowing that requires that blend of experience (I like computer graphics, and so was familiar with image formats and LZW compression), and systems thinking (what seemed to be a geospatial pipeline problem is actually an image file format problem). The fun part of computer science I guess is being a holistic detective, to quote Dirk Gently:

| The term `holistic' refers to my conviction that what we are concerned with here is the fundamental interconnectedness of all things. I do not concern myself with such petty things as fingerprint powder, telltale pieces of pocket fluff and inane footprints. I see the solution to each problem as being detectable in the pattern and web of the whole. The connections between causes and effects are often much more subtle and complex than we with our rough and ready understanding of the physical world might naturally suppose.

I'm not claiming this example is the most complex bit of detective work, but I think shows how challenging solving computer problems can be at times because of this holistic or systems nature. I think it's useful for me to reflect upon that as I try and position my work as a "how do I remove the computer scientist from data-science". I can't encapsulate the entire system such that this sort of broader context problem solving will never be needed. By definition, when I make tools and libraries to simplify certain ecology problems, they will encapsulate near-by problems, and not solve these kinds of logically distant issue. For that perhaps you will always need an experienced engineer - which offends me somewhat, even if it does give me some kind of job security.

In the end, my solution to the poor performing area-per-pixel raster was just to not store it as 400,000 pixels wide, but as 1 wide, and then expand it later when needed. We know that all the pixels along the horizontal axis have the same value, so there's no point compressing them. This trick clearly stuck in Ali's mind in a similar fashion to how Richard's insight had stuck in mine. What's perhaps ironic in a sort of Alanis Morisette definition of the term, is that this week I went one better and removed the need for the area-per-pixel raster entirely from the pipeline, just as Ali was reminding me of what we'd done.

In the latest update to my geospatial library Yirgacheffe (which is what originally this post was meant to be about!) I added a virtual raster layer where you give it a map projection and pixel size, and it calculates the pixel area on demand for you, so no need to have any images 1 pixel wide or otherwise in your pipeline.

with yg.read_raster("species_range.tif") as range:
    with yg.area_raster(range.map_projection) as area:
        total_area = (range * area).sum()

It wins on both storage (no data to store, as it calculates the area of each pixel as needed from a somewhat simple formula), and doesn't need compression, and indeed will work with almost any map projection. Finally I have a proper solution to the original problem!

As a footnote, whilst we're on the topic of Dirk Gently quotes, I spotted this one also whilst finding the one above:

| What I mean is that if you really want to understand something, the best way is to try and explain it to someone else. That forces you to sort it out in your own mind.

Which is both why I enjoy blogging and why I'm very slow at doing so.

Tech notes by Michael Winston Dales

The trials of area per pixel maps, or Michael's Holstic Detective Agency