Weeknotes: 10th March 2025

Last week

iNaturalist

I've continued to do a little playing around with iNaturalist data, both the range data and the occurrence data for birds. Birds, aka Aves, is a good one to pick as it's fairly well represented in both IUCN data (which I understand to have all known species of bird in it), and well covered in observations.

I read up on how the "range" maps that iNaturalist have started to share, which they describe in this blog post, and link to this paper on the methods use. This describes how they use a neural network model to generate the data, and how it differs from standard modelling approaches as they only have presence data and not absence data. Anil then pointed me at a newer paper by the same group that looks specifically at trying to model species with very few occurrence points using machine learning techniques.

I was slightly bemused by this quote from the second paper:

Identifying locations where under-observed species can be found is a time consuming and laborious process, often requiring long expeditions to remote locations to search for species that are hard to find. Consequently, there is a pressing need for computational methods that can reliably estimate the spatial distributions of species using only a small number of observations.

I'd argue there is a pressing need for both! For all the species distribution modelling I've seen I've yet to see one that isn't improved by having more than a few points of data for each species, but clearly that is hard to get, so in lieu of that data we can use these techniques, but real observations really do seem to me to be vital.

Analysis wise I'm just trying to get my head around what they have and how it's catalogued compared to other things. I think my main learning this week, both from this and from another project on plants I'm working on, is that there is not one unique identifier for all species, each collector of data has their own "taxonomy ID", partly as different systems started at different points so started assigning IDs independently, but also because taxonomies are difficult I not everyone agrees where species should be filed.

LIFE

I ran some analysis for Alison based on assessing some projects using LIFE as we look a little more in detail at the LIFE map in specific areas rather than as a global aggregate.

I ran through the pipeline with Emilio, working out where he can fit in the changes he'll need to make for his PhD work to the pipeline, and I owe him a summary email of that conversation with links to files etc.

Outreachy

Patrick has been talking about outreachy, which provides summer internships in open source projects for those underrepresented in the tech industry, and after chatting to him about what I could do to get involved, I've submitted a project proposal around Claudius.

Claudius is my OCaml library for making generative art and retro-style computer graphics work, and it's a project I love to work on but have very little time to tinker with myself, but there are lots of opportunities for taking it further to say make it useful for making games, or turning it into a GUI toolkit, or porting it to Web Assembly to let you more readily share the art you've made.

The deadline was last Friday for projects, so I quickly wrote up a whole bunch of issues so I had some ideas to point at, and in the coming week I'll try make a little home page for it so it's more accessible as to what the project is for.

Elevation maps: science versus licensing

For LIFE we've been using an SRTM derived Digital Elevation Map (DEM), which for a long time was the best source of elevation data at fine detail (30m per pixel at the equator). Since then there's been a number of improved maps coming out, and my intent was to switch LIFE to use those, after being introduced to them by working on the STAR pipeline with Chess Ridley.

The plan was to use FABDEM, which used neural networks to remove forests and buildings. However, we found a couple of issues with FABDEM, one of which the authors have fixed as a set of complementary tiles, but one of which they don't plan to fix (there is a gap around Azerbaijan as at the time of the original publication the source data they used was missing it - that data has now been updated, but FABDEM has not). When we spoke to the authors, they pointed out they have a new better elevation map we should be using instead, FathomDEM.

Unfortunately, whilst I'm happy that the new map is another step forward technically, it's published under a Creative Commons CC-BY-NC-SA license, which in theory is great as it encourages open research, but in practice means we can't use it on projects such as LIFE. The challenge is that the "SA" part of the license name means "share alike" - so any work you use FathomDEM for must be published under the same open license. Unfortunately for LIFE (and other biodiversity metrics I've worked on) the species data comes from other sources with license terms that mean we can't just openly share the results, and so we can't "share alike", which means we can't currently use FathomDEM, no matter how good it is.

I have a strong sympathy for the ideals of licenses like the GNU Public License and CC-BY-NC-SA, but their viral nature can also inhibit as much as it enables at times unfortunately because it assumes that all other externalities to a project can be brought under the same terms, and that isn't something many authors can control. So on a pragmatic "trying to save the planet" basis I find it sad that we can't use this work.

To be clear, this isn't the only frustration I have with data licensing in the climate science space, so I don't want to single the FathomDEM data out in particular, just to flag that data licensing is a critical topic in climate science that I don't think gets as much attention as it deserves. I hope to write a bit more up on this in the near future as I have some related things brewing.

This week

I need to get together my submissions for Nordic-RSE conference. I plan on putting in both a discussion topic on lineage in pipelines and a lighting talk on Yirgacheffe.
I need to write up notes for Emilio on running the LIFE pipeline
Anil pointed me at the call that just went up for the Research Software Maintenance Fund and it'd be interesting to put in something around our Area of Habitat code that underpins LIFE and is being re-targetted at other biodiversity metrics also.
The deadline for submission to the first issue of the good internet magazine is the end of the week, and I want to write up some words about what we've been doing in our little corner of the EEG with blogging and diversity in implementations.
A couple of weeks ago we looked at doing a version of LIFE that used more detailed information for specific countries, and I have to do some preliminary looking at data for that to see how it'll work.

Tags: inaturalist, life, licensing, data

Tech notes by Michael Winston Dales