Weeknotes: 17th February 2025

Last Week

LIFE

For the LIFE extinction-risk metric pipeline, all the species data comes from the IUCN Red List, and this week I began a job I'd been intending to do for a while to make that stage pragmatically accessible to people trying to run our pipeline externally.

Originally, as LIFE was being developed, the species data was manually downloaded from the Red List website using their bulk download facility, which gives you a ZIP file of CSVs for the empirical data, and then another set of ZIPs that contain the polygon range data. These had then been run through a mixture of hand processing and Python scripting to filter it down to the appropriate species for the LIFE metric. As we came to publication I wanted to automate this, which I did, but at that stage, because I was working closely with the IUCN on implementing their STAR metric in addition to the LIFE metric, which both start with the raw Red List data, I had all the information stored in a PostGIS database, and my code runs against that currently. The contents of the database is the same raw Red List data from the bulk downloads, but is just easier to work with rather than a bunch of different files in different formats to ingest.

However, this does mean that if you want to run the LIFE pipeline yourself you need to have a similar database set up, and I don't think that makes our code very reproducible as a result. Technically, everything you need is in our current scripts, so there's nothing hidden: you can see the methodology we use to filter down species, so it's all scrutable, but you can't just run it, and that seems like complying with the letter of the reproducibility law, rather than the spirit of it, and so I've wanted to fix this for a while.

My plan was to use the new API for the Red List the IUCN have on their roadmap, but that's currently still in development, so I have finally gone back and spent time writing a second set of scripts to mirror my PostGIS based ones that will take in the data downloaded from the website manually and run from that. This way we've minimised manual intervention (you just do the download, and the scripts do the cleaning/filtering), but someone without access to a PostGIS database with the Red List on it can run LIFE.

In doing this, I was super pleased to discover DuckDB, which I think Anil pointed me to a while ago. DuckDB lets you treat random data file formats as if they were an SQL database. The IUCN Red List data download comes as a ZIP file with 5 CSV files in it, and I was not looking forward to having to migrate the SQL statements I use for extracting just the right species from the Red List database I have into pandas data-frame logic: I personally find that format more cumbersome, and it would mean we had the same logic in two different forms, making it harder to spot any discrepancies. But thanks to DuckDB, I can still use SQL queries to access CSV files, including doing joins across the different files:

import pandas as pd
import duckdb

taxons_df = pd.read_csv("taxonomy.csv")
assessments_df = pd.read_csv("assessments.csv")

results = duckdb.query("""
SELECT
	assessments_df.assessmentId,
	assessments_df.season,
	taxons_df.internalTaxonId,
	taxons_df.scientificName
FROM
	assessments_df
	LEFT JOIN taxons_df ON assessments_df.internalTaxonId = taxons_df.internalTaxonId
WHERE
	taxons_df.className = 'AVES'
""")

The CSV files you download from the IUCN have a slightly different structure than the PostGIS database version I'm using, so I have ended up duplicating the SQL statements, but they all follow a very similar structure, so I can directly compare them for review purposes, and then all the subsequent logic for working with the data is shared between the two implementations.

Thankfully DuckDB also has geospatial extensions, so I can even load a geopackage file, and then join it with data from a CSV file!

import pandas as pd
import duckdb

taxons_df = pd.read_csv("taxonomy.csv")
duckdb.query("CREATE TABLE ranges AS SELECT * FROM '/home/michael/downloads/ranges.gpkg'")

result = duckdb.query("""
SELECT
	ST_AsText(ST_UNION_AGG(ranges.geom::geometry))
FROM
	ranges
	LEFT JOIN taxons_df ON taxons.df.internalTaxonId = ranges.id_no
WHERE
	taxons.scientificName = 'Regulus regulus'
""")

The spatial operators DuckDB supports are slightly different from PostGIS, but I found equivalents for all the ones I use, like the above query that creates a unified range for the Goldcrest. Again this necessitates that I have two sets of SQL, one for my PostGIS database and one for the DuckDB version, but I was already doing that, and the query itself is identical so they're easily comparable for review purposes.

As good as DuckDB is, it's not as fast as using PostGIS, but I don't think it's any slower than if I just used pandas. And as someone who knows SQL reasonably, it's just a useful tool generally that I suspect I'll use a lot to let me query pandas dataframes when exploring datasets, even if my eventual scripts just use pandas directly. And it's just a library, so doesn't require I install any services to do all this, which makes it a great thing for throwing into a portable pipeline, where databases are just a pain for people to have to set up and manage versus using files.

STAR

I had a productive catchup with Simon Tarr from the IUCN about enabling him to run my implementation of the first half of the STAR pipeline. Mostly it made me realise that whilst I think of that as being mostly done, it doesn't really run well if you don't have a well configured large compute server, and as a result I've taken the action to try run the whole thing in a docker container on my laptop. Mostly this has involved writing a Dockerfile that brings in the appropriate dependancies, catching where I've naively assumed you have a machine with 128 cores and 1.5TB of RAM (ahem), and writing a run.sh script that will do sensible things if you've already ran it part way through rather than doing a lot of repeat work each time.

Shark

All of which made me lament that Shark, our tooling for repeatable pipelines driven by documentation as code, isn't usable yet. I had to park working on Shark last year as I was trying to innovate on too many fronts, and doing biodiversity work had a higher priority than the comp sci work, so I parked Shark and just went to running the pipeline from shell scripts.

Shell scripts are obviously hardly ideal, but they do have the advantage that data-science folk know how to read/write them, so in terms of hand over it was a better choice than my writing a clever Makefile, despite the strong temptation that has. This week I did a little looking around to see if I could find a middle ground, and despite finding some interesting links to other tools in this space, nothing I found met my apparently unreasonable requirements:

Lightweight: I want to install some simple tools, not install a service that needs to be maintained, requires perhaps admin privileges to run.
Easy to write: Has a simple language to express the pipeline, and isn't lots of YAML with lots of boilerplate
Makes it easy to use without the tooling: it's common in data-science that you want to work on part of the pipeline, so being able to step out the workflow and just run that particular bit seems key

As it stands in the prototype implementation we have today Shark works for some of this, but definitely not all of it. But I feel at least motivated to try and fix it as it'd make my own life easier.

Yirgacheffe in OCaml

One of my objectives for 2025 is to start to implement some of the ecology code I normally write in Python in OCaml. Yirgacheffe is my attempt to make working with geospatial layers more declarative, but is still doing so in Python, as I feel I need to write code that I can expect ecologists to take and run, and besides R, a lot of data-science folk do write in Python (and Julia) also.

A fine plan, but in practice I think it fails because the kind of Python I write is not the kind of Python that a data-scientist would write, and so even though it's technically in the same language, the approaches I take as a "Software Engineer", trying to write something well structured and maintainable as a software artefact, is quite far removed from those of someone who's using Python as an ends to a means for processing a bit of data. I realise I'm at risk of being patronising here, which is not my intent: I fully subscribe to Mary Shaw's definition of data-scientists as "vernacular software developers" - domain experts in another field that write software out of necessity to get their jobs done. But the tooling we've made for software development mean that writing code to get a task done and writing code where the code is the product ends up, in my mind (and I feel the mind of Mary Shaw), being quite removed. So the impression I'm left with is that my code, when viewed by an ecologist, might as well be written in any other language for all the sense its structure and flow makes to them.

Ultimately, to my mind, we want a declarative implementation of these pipelines that can more obviously be aligned with the methodology description that will be in the resultant publication; see the efforts I was going to in the previous weeknotes to strip back yet more boilerplate from my Python code to let the methodology shine through more clearly. Thus, I have a goal for 2025 to move to OCaml for at least one pipeline, based on the theory that OCaml is well suited to this: being a functional language the code is declarative in nature anyway, but it's also a pragmatic language that means those less familiar with it can still throw in bits of imperative code as they need to.

To this end I've been co-supervising a part II (final year undergrad at Cambridge) project which aims to get some initial version of Yirgacheffe bootstrapped in OCaml. This will bring with it an interesting set of challenges, particularly around how the typed nature of OCaml will mean more boilerplate is potentially needed, moving me away from the aims I've stated. But I think a worthy challenge to try and solve! And the exciting milestone this week was the student not only has something working, they actually ran a standard Area of Habitat calculation through their code. An AoH is a very standard ecological method, and key to one of the bigger projects I work on, so this is exiting news both for the student and for myself.

Next Week

LIFE

Finish off the new species filter code, and improve data logging on decisions made in filtering. One of the things I end up doing whenever I touch this code is having to work out why certain species are included or not, and I predict I'll have to do this here. So I want to follow the lead of Chess Ridley (who maintains the R-based STAR pipeline at the IUCN) here and have my species filtering script spit out a report of all species it considered, and where in the filtering process they dropped out (lack of habitats, lack of appropriate ranges, etc.).

Plants

We are starting up some new research threads around plants, and I have a bunch of meetings on this topic this week, so I'll need to switch my brain over to prep work for those meetings.

Yirgacheffe/Shark coding

A lapsed-hacker can dream, right?

Tags: star, shark, life, duckdb, yirgacheffe, ocaml, iucn

Tech notes by Michael Winston Dales

Weeknotes: 17th February 2025

Last Week

LIFE

STAR

Shark

Yirgacheffe in OCaml

Next Week

LIFE

Plants

Yirgacheffe/Shark coding