Weeknotes: 24th April 2023

The week in review

Swift persistence calculator

In finally got an AoH calculation out of my Swift code, and after some coercion it even ran faster than the Python version.

In retrospect it was probably hubris that lead me to think my Swift implementation would run many times faster than the Python version, rather than just 50% faster as it did to generate the above. I guess in my head the data manipulation going between gdal and numpy would be heavy, but ultimately I suspect the workflow overall is as we can see here in this profile of my Swift code, where you see spikes of CPU usage and the a bunch of time writing out that chunk and reading it back in, and on a single thread that pattern will exist regardless of which platform you’re using.

Still, that does mean that parallelisation attempts will be interesting.

The remaining work here:

Do a broad sweep over a range of species of different scales of AoH to check for performance impact (e.g., with GPU we know that small AoH species are faster on the CPU as you don’t amortise the data transfer under CUDA)
I Want to now compare Python+LittleJohn, Swift+LittleJohn, and Swift+GCD.

Alison is going to help me build a list of 100 species to let us do some repeatable testing here.

IUCN data to SQLite, looking into habitat conversions

One bit that’s still hacked into my Swift code is the species data - I took canned results in my tests, so it only ever works for species 9997 and for ESACCI. Because the IUCN API can't be expected to be 100% reliable (nor is the internet, etc.) but we want certainty of execution for our multi-day pipelines, we use a cached version of a lot of the IUCN data, but it’s sat in a bunch of folders per classification (mammal, reptile, etc.), with 5 CSV files in each folder in a somewhat normalised table form.

In the end I wrote a small script to pull all that data into a SQLIte database, which means we can then have a single blog to store with experiments. In theory this could also go into Postgres and then have available on the 4C cluster subject to licensing, just so we have one canonical source.

The other part of this I discovered is the problems around habitat conversions, which Alison has been trying to school me on. IUCN, ESA CCI, and the Jung dataset all use different classifications of habitat on their maps. Usually there’s some mixed level of refinement in each one, “Forest, temperate”, “Forest, Boreal”,etc., but all the types used by the three sources don’t align. So inside the IUCN modlib it has code, for both Jung and ESA CCI, to do what feels to an outside as a crude reclassification of the data to just the top level categories that are common to all: “Forest”, “Desert”, etc. Daniele’s code just has a lot of constants for this.

Alison points out that this is what people do, but is a source of contention as there’s no one true way to do this, and so at least for our code so I’m going to at least try track down the origins of what we do, as right now I just have some constants in a python file to go on with no references.

Ultimately, as not the ecologist this isn’t my place to say what is good/bad, just that things like this need to be made clear in the code and ideally the README, so the methodologies are transparent.

Helping out with other geospatial code tasks

I helped out Kate with some data manipulation for her current work, pulling together a little script for tiling different maps together using Yirgacheffe. At first it didn’t work, as the data wasn’t in the expected WSG84 data we usually use, but in Mollweide, which is notionally a projection where each pixel has the same area, but seems to only offer that promise locally. In the end to get things to work Kate reprojected her data to WSG84 and we got her plots done (given the final result needed to be in WSG84).

Still, a nice reminder to me that WSG84 isn’t the be all and end all of projections in our community.

Meeting with Tom/Patrick re methodology

I read through again Keshav’s methodology document and Patrick and I had a meeting with Tom Swinfield about it, and I know since then Keshav has factored in a bunch of changes suggested by Patrick, so I’ll need to re-read it again.

The push seems to be to implement this as code ASAP, though that’s at odds with thinking about HotCarbon, which has a deadline less than two weeks away.

Patrick and I are going to get the ball rolling on this, but the first task is to identify do we have all the inputs we need, which Patrick is looking into.

HotCarbon

I forked the HotOS paper in overleaf as there’s a hope we can submit a variation on this to HotCarbon. I chatted to Patrick about this a bit, but haven’t made significant progress here. One thing that Patrick flagged is that the HotCarbon CFP includes this line:

“We solicit position papers that address sustainability and/or the carbon footprint of computer and network systems.”

Which makes me think we should perhaps lean into the CI aspects of Ark a little for this paper. A change in algorithm can cause a massive amount of data to be re-calculated, which has quite a lot of energy use, and by having a properly tracked dependancy graph as per a CI or build system then we can make a claim that Ark would reduce the carbon costs of doing the sort of analysis we do.

The week coming

Chat to Amelia/Anil about HotCarbon
Chat to Patrick about methodology implementation again
Code up some of these habitat smooshing methods
Watch out for meese

Interesting links:

How to film a river crossing 24 hours a day for three weeks with a dozen studio quality cameras and microphones: https://www.svtplay.se/video/jmLgp54/den-stora-algvandringen/the-great-moose-migration-behind-the-scenes
Paper Alison provided for me to read on habitat classifications https://www.nature.com/articles/s41597-020-00599-8

Tags: swift, weeknotes, life

Tech notes by Michael Winston Dales