Weeknotes: 5th February 2024
Last week
Shark
I spent a little time trying to close the loop between FSARK, my system for putting things like python into containers without the user needing to know, and Pyshark, my python module that automatically attaches provenance information to results as they’re written.
It turns out there’s a set of standards for labelling container images with provenance information set by the OCI, and it does seem people do use these (e.g., GDAL’s official container images are all appropriately labelled). So for now if you use a container name (rather than an instance image) with FSARK, then FSARK will pull that metadata and place it into the guest’s environment, and Pyshark now knows to look for this information and creates a “container” metadata section with that info in. You can see an example of the result here.
This means that if you both use both these tools, not only does the file contain provenance information about the program used (e.g., git commit IDs etc.) and the python libraries installed, but it also now containers the information about the container image used (it’s git version info etc.). Given what we know about things like system library versions changing results (e.g., Patrick’s GDAL version changes example), this gives a much stronger chance of reproducibility.
The downside with the OCI labelling is that you have to trust they got it right, and more vexingly you need to hope they did something sane if their image is based on a downstream image, as there’s no hierarchy to the labelling - you can replace labels used by your parent image, but if you don’t explicitly override them, the parents values will propagate through, and you could end up with mixed values that then make no sense. Still, having something here is way better than having nothing.
Partly I did this as I wanted to reach out to Vashti Galpin at Edinburgh University, who I’d chatted to at PROPL after she gave a talk on the role of provenance in environmental science, but I wanted to send her an example of the kind of things we’d been working on to have some basis for a conversation.
I did a bunch of thinking/diagraming time around how all these parts we have (FSARK, Pyshark, and Anil’s virtualised references) link together, as I feel there’s an end-to-end story that’s starting to emerge here now, and I want to work out if this is a point at which we declare we made a thing (i.e., it’s paper time) or there’s still more to do.
Mostly I realise that I still don’t quite grasp all of Anil’s ideas about [redacted research idea], and alas Patrick and I didn’t manage to sync up before he went away (he’d spoken to Anil more recently than I had on this), so now I feel a little lost on that topic again, and need to have another chat with Anil I think.
LIFE
The reviews came back for the LIFE paper and it needs some work, but it doesn’t look like we’ll need to do a full run of all the data again.
As per last week’s todos, I updated the LIFE data publishing site I made with the ability to download the data layers in GeoTIFF format.
I’m going to park that work there now until we’re ready to move forward once reviews are done.
Simon Tarr from the IUCN got back in touch about us moving forward attempting to use my LIFE pipeline code for AoH rasters to also produce STAR layers, and so I need to follow up with him. When last we spoke I’d asked for more corner case examples of species to test our code with after he was happy with the basic tests, and I guess I still want to figure that out. But the other part of this is what’s the long term plan, given that they will need to periodically run things using our infra.
Misc
I checked out OCurrent (the thing that runs the pipeline for the TMF) and tried to get it to run, which was quite a frustrating experience as the docs don’t talk about how to get to run it, they just assume a bunch of knowledge/experience. Two lines to the README would have saved me a bunch of time, maybe three if we assume it doesn’t like macOS docker. Anyway, I shall make a PR to add those lines somewhere.
I made a lot of marmalade at the weekend, which made the house smell very nice.
This week
- Catching up with Abby about how she can automate her running of the calc_k stage of the TMFEMI code.
- Get back to Simon Tarr about AoH things.
- FSARK can be a bit slow to start, so I want to add some progress info when it has to fetch the container image from a remote.