Weeknotes: 10th February 2025

Last week

Yirgacheffe

I hand't planned this, but I spent a day reworking the Yigacheffe APIs towards what I've always thought of as the next major version interface, and had filed under "mañana" due to other work being more important. But inspired by one of the part II projects I'm supervising (final year undergrad for those not from Cambridge), I decided I really needed to fix up the APIs to make them cleaner.

Yirgacheffe has some clunky parts to the API where it fails to meet the aim I have that working on maps should result in Python code that clearly follows the methodology rather than dealing with house keeping around the data. In a lot of areas I like to think I've been successful; for example, it will deal with loading in data in small chunks to as to fit within memory. But other parts expose the fact that the API didn't originally still required you to work with array data via simple operators, so you had to manually tell Yirgacheffe how to align the data you were using.

What do I mean by this? Let's pick two examples of working with GeoTIFF raster maps: generating a species Area of Habitat (AOH) map, and generating an overall species richness map.

For the per-species AOH map, we take two global layers, the habitat classification map and the land elevation map, and then we have a species range map which is just localized to where that species is. In this case you want the output map to have an area that is the intersection of the area of the three input maps' area, as it'd be hugely wasteful to calculate the entire globe when we know the species doesn't live there.

Once we have a bunch of AOH maps, each of which is a GeoTIFF raster that tells us where an animal is likely to be found on a globe, we can sum all those up to make a species richness map that tells us the density of species across the globe. In this case all the individual AOH maps are just rasters covering that specific bit of the planet, but as we some them we need to end up with a global map, so we need the resulting area to the union of all the AOH maps.

Until now in Yirgacheffe making that decision has been the burden of the user, as can be seen in this example of AOH implementation:

from yirgacheffe.layer import RasterLayer, VectorLayer

habitat_map = RasterLayer.layer_from_file("habitats.tif")
elevation_map = RasterLayer.layer_from_file('elevation.tif')
range_polygon = VectorLayer.layer_from_file('species123.geojson', raster_like=habitat_map)

layers = [habitat_map, elevation_map, range_polygon]
intersection_area = YirgacheffeLayer.find_intersection(layers)
for layer in layers:
	layer.set_window_for_intersection(intersection_area)

refined_habitat = habitat_map.isin([...species habitat codes...])
refined_elevation = (elevation_map >= species_min) && (elevation_map <= species_max)

aoh = refined_habitat * refined_elevation * range_polygon

# Create a layer that is just the size of the intersection to store the results
with RasterLayer.empty_raster_layer_like(range_polygon, datatype=np.float64, filename="result.tif") as result:
	aoh.save(result)

You can see that a good third of the code here is dealing with raster bookkeeping rather than methodology.

I've always wanted to get rid of this, and it occurred to me that the answer is actually quite simple. We can infer from the methodology the likely path of using intersection or union to build the final area of the calculation: if you're using operations like multiply or logical AND, where a zero input will result in a zero output, then intersection makes sense, as the areas you're removing would otherwise just be zero'd out. For operators like add or logical OR, then you want to use a union as every source pixel will count, even if it's matched with out of area from other layers.

Because Yirgacheffe lazily evaluates equations, we can examine the graph of any calculation before we execute it to work out the desired strategy, just by looking at the operations used. This means my AOH code is now much more clear in what it's doing (thanks also to the effort I made a few weeks ago to expand the native operators Yirgacheffe supports):

from yirgacheffe.layer import RasterLayer, VectorLayer

habitat_map = RasterLayer.layer_from_file("habitats.tif")
elevation_map = RasterLayer.layer_from_file('elevation.tif')
range_polygon = VectorLayer.layer_from_file('species123.geojson', raster_like=habitat_map)

refined_habitat = habitat_map.isin([...species habitat codes...])
refined_elevation = (elevation_map >= species_min) && (elevation_map <= species_max)

aoh = refined_habitat * refined_elevation * range_polygon

# Create a layer that is just the size of the intersection to store the results
with RasterLayer.empty_raster_layer_like(aoh, filename="result.tif") as result:
	aoh.save(result)

I still think I can do better - it's a bit clunky that we have to create the empty raster rather than just save to a file name for instance. Thus I'm not quite ready to declare a 1.0 release. But that is all relatively trivial compared to the more fundamental change I've just made.


Part of why I like using Yirgacheffe to do a lot of the heavy lifting in my code, and why I can make big changes like the one above quite quickly is that Yirgacheffe has a lot of unit tests on it. This means I have higher confidence of geospatial code I've written using Yirgacheffe being correct than if I just used GDAL or other APIs directly. We process so much data in these ecology pipelines and do so much aggregation that it can be hard to spot small issues at the per pixel level, and this is how I've tried to combat that.

At time of writing there's 577 unit tests on Yirgacheffe, which isn't a particularly meaningful number, other than telling you I have put some effort in here. There can still be bugs, but each test case makes that less likely, which means I can sleep slightly easier at night. But I was curious as I made this big change, how much coverage do I have? Is this a false sense of safety I have, or a meaningful one?

To answer that, I finally added test coverage output to my builds using pytest-cov, and the answer is that I'm doing not too bad, but I could also be doing better in places:

Name                             Stmts   Miss  Cover
————————————————————————————————————————————————————
yirgacheffe/__init__.py              9      0   100%
yirgacheffe/constants.py             1      0   100%
yirgacheffe/h3layer.py               1      0   100%
yirgacheffe/layers/__init__.py      32      4    88%
yirgacheffe/layers/area.py          51     28    45%
yirgacheffe/layers/base.py         127     17    87%
yirgacheffe/layers/constant.py      24      2    92%
yirgacheffe/layers/group.py        209     18    91%
yirgacheffe/layers/h3layer.py       95     20    79%
yirgacheffe/layers/rasters.py      177     18    90%
yirgacheffe/layers/rescaled.py      36      4    89%
yirgacheffe/layers/vectors.py      194     39    80%
yirgacheffe/operators.py           411     50    88%
yirgacheffe/rounding.py             33      0   100%
yirgacheffe/window.py               63      1    98%
————————————————————————————————————————————————————
TOTAL                             1463    201    86%

Given I was just adding tests as I went along, I'm pretty pleased with that coverage score, and now I have something I can just chip away at each time I make a change.

Species selection for STAR and LIFE

I spent a couple of days digging down more into the species selection filters for both the LIFE biodiversity metric pipeline and my work-in-progress STAR species threat assessment code.

For the LIFE metric I'm updating it to use a newer version of the IUCN redlist and I'm trying to compare it to the original version by Alison Eyres (the original used data from 2021, and I'm using data from 2023), and for STAR this is my first implementation, but Chess Ridley already has a workflow for this as she maintains the actual IUCN STAR pipeline, and I can compare against that. In both cases I'm being repeatedly schooled that there is a bunch of subtle issues that I miss being not a zoologist, and I'm very glad and grateful that I have Ali and Chess to help compare notes with.

The majority of time if we find a difference, it's down to me failing to understand some detail, or just being slightly outside the working group I miss on some detail that was agreed upon that is perhaps obvious to a zoologist, but not to me who just is learning as they go. For instance, I was including subpopulations of species when building my lists, which is not what you do, but because that data is tucked away in a JSON field in the database I'm working from, I never got to noticing.

And at other times I've been able to find issues in the original data that was missed by the zoologists for the same sort of reason: a trust in the data being as it is expected, but because I'm having to nosey around it more I spotted say an issue in the elevation map we're using, or that a species has its polygon encoded in such a way GDAL silently ignores it.

It's clear to me that in the past I've neglected storing negative results in my pipelines, which make it more evident when things fall out that shouldn't. In the past if a species had zero area of habitat (AOH) I just wouldn't generate any data for that species, but then when quizzed by someone on why does the output have this feature, and I have to work back to find out it's because a species was excluded in the final result, that lack of information on why bites me: did the species have a zero AOH? Did it fail our selection criteria? Did we have a bug in the program so it just failed to process it? So now I'm learning to leave data for every decision step in the pipeline for each species.

Weeknotes meta

As I try to keep up with Anil in terms of improving our regular documenting of progress, I made two changes to webplats, the software that hosts this site:

  • Whilst the OCaml web framework Dream has been great boost to getting this site up and running under OCaml quickly, there's a bunch of stuff that I feel it should do out the box but doesn't: in particular it doesn't support HEAD requests and it doesn't do anything to help with content caching on static data. I fixed the second one of those quickly, as I spotted that every page load was reserving all the assets. I'll add HEAD responses in the near future, as it's my personal opinion that any URI that responds to a GET request should also respond to a HEAD request.

  • I started adding a custom label resolver, but failed to finish it in time for this week. The idea that Anil already has working is that rather that use markdown links with URLs to things, we want to instead have internal data for contacts, projects, publications, etc. that we can just refer to using a single slug and it'll magically pull out the right text and URL. I figured out the slightly obtuse API Cmarkit has for letting you add and mangle labels to a document, but I ran out of time to hook it up.

Next week

LIFE

Two LIFE related todos this week:

  1. For the follow on paper to LIFE I need to get Ali some deep dives on specific areas, breaking down the LIFE scores in those areas. This is something I've automated so I can pull together the data, but not that I've automated enough to hand over to someone else. Partly this is a "one, two, many" problem, and partly I loathed to make a more general tool here when I know we can't use it beyond this particular work due to terms of the IUCN data licensing, as I'm fairly sure if we made a LIFE inspection tool that let you drill down per species, it'd count as republishing the raw data, which is against the terms of use as I understand it.

  2. We want to publish some updated version of the LIFE maps to correct some issues that we've found (related to the above discussion on my understanding of certain things, and due to oddities we've also found in the original pipeline that have been tidied up). The code has been updated as we go in the public repo, but the Zenodo maps will need to be updated.

Yirgacheffe

After all the work described above, the main thing I want to do is tidy up a couple of the other rough API edges that are just down to poor naming rather than being particularly structural. If I can get those done, then I think a 1.0 party can occur in the next week or so.

Tags: star, life, redlist, yirgacheffe, testing, webplats