Weeknotes: 27th November 2023

Last week

Data sharing

I failed to get the data shared, as having come up with a strategy for sharing data via our machines, and playing around with ZFS and EROFS looking at how we might distribute the data as packages, lab IT decided they didn’t like the idea of opening up one of our machines like this, and so in the end they created me a share on the existing webspace.

On the plus side I did finally get to play with ZFS and EROFS in context somewhat, so that’s not wasted effort given how they seem to be two of the options for distributing results blocks in the shark world, but it does mean I moved all that data around incorrectly.

So I still need to copy the data from the machine I set it all up on over to the lab public hosting machine.

Tropical Moist Forest Evaluation Method Implementation

I updated the docs for the TMFEMI to include a fuller description of the inputs required, including things not previously added, like Random Seed. Sadiq had mentioned we should specify the random algorithm used too, but right now I think that’s too specific for where we’re at, and something we should add in as we refine more (he’s totally correct, just not even our python implementation is ready for that yet alas).

I did a small performance optimisation pass, as we were running out of memory processing some of the larger projects, and I spotted we had some very sparse data structures that I simplified, making things run more efficiently.

I did another review of the old and new methodology documents based on Tom’s comments at the previous week’s meeting, and I spotted that due to the code going through many hands, a subtlety of the original methodology had been lost, and we are generating the set of potential matches S using Ksub rather than the full K. I did try to fix it, but this then made the script much slower and need more memory, so I’ve asked Robin to see if he can make this change and make find_pairs.py more performant at the same time regardless of this switch. He did a good job on find_potential_matches.py, so hopefully he can work similar magic here.

At last week’s meeting it turns out that the numbers Tom uses to assess a project aren’t actually listed as an output in the methodology document, along with key intermediaries that are useful things to have. I need to add these to the outputs of the document as at least one of them would be something I’d happily have removed as a performance optimisation., give it wasn’t listed as an explicit requirement.

Biodiversity

I did some results spelunking for Margaux as to why when you do the trash scenario (habitat is changed form usable to something the species definitely doesn’t like) we still saw some positive results. After much digging into the code, this is down to poor behaviour of floating point numbers - at some point we get some values that should be zero but end up with a positive result smaller than the CPU’s epsilon value, but then when you add those layers together you end up with numbers bigger than epsilon, so definite non-zero numbers :/

>>> old_p
array(0.96596205)
>>> new_p_layer.read_array(0, 0, 1, 1)
array([[0.96596205]])
>>> new_p_layer.read_array(0, 0, 1, 1) - old_p
array([[1.11022302e-16]])

and we can see epsilon is suspiciously twice that.

>>> import sys
>>> sys.float_info.epsilon
2.220446049250313e-16

This is somewhat frustrating, and I assume there must be some best practice for dealing with this sort of thing already - I need to do some reading to work out what that is.

Blockchain paper

I read Sadiq’s draft for the blockchain paper, and there is a gap in there which needs a summary of the TMFEMI, so I need to dump some words into there this week.

General tidying repositories

Patrick and I moved most TMF and Biodiversity related things to the new Quantify Earth organisation in GitHub, leaving the carbon credits organisation more for 4C blockchain/offsetting specific things.

Shark thoughts

I had a good catch up with Patrick on things sharky, and we decided that the shell I keep pushing on is not really the shark shell that he and Ryan are working on, so I need a new name for mine :) They’re close enough to be useful for comparison, but we’re approaching things from opposite ends of the determinism spectrum I suspect.

I did build a small prototype of a library that lets me generate experiment manifests painlessly from python. In my mind the container run time would do this, but for now just so I can start reasoning about it I’ve just made a python module that shims all the common libraries we use to generate the manifest, and I tweaked fsark to automatically inject this to python processes so the user doesn’t need to do anything to get an experiment manifest such as this one for an AoH calculation.

In theory I should be able to make this work with both the biodiversity code and the TMFEMI, but the problem at the moment is one of multiprocessing, which breaks my shims and is something we use heavily in TMFEMI (biodiversity uses littlejohn instead, which will also need a bodge but should be easier).

Even aș a bodge like this, it solves a real problem for me, as I regularly get asked by ecologists to help find old data and the context in which it was run, so I have motivation to get this working more broadly!

This week

Write words on TMFEMI into Sadiq’s paper
Document all the machines - we have quite a few now
Get that data shared
Add new outputs to TMFEM
Try again to tidy up the biodiversity pipeline code for both publication and getting a single common codebase for use with the IUCN
Rerun data for Margaux to see if I can tidy up these troublesome non-zeros

Tags: weeknotes, tmfemi, pyshark, life

Tech notes by Michael Winston Dales