Weeknotes: 20th November 2023

Last week

Tropical Moist Forest Evaluation Method Implementation

As intended we got a new revision to the TMF Evaluation Methodology document out to the group, albeit a couple of days later than I’d hoped as we found another place where the implementation had diverged from the original in a way I wasn’t aware of, so there was a mix of more document writing and code tidying from both Patrick and myself, but we got there in the end.

Whilst most response has been positive, I think Tom is a little uncomfortable with how we end up doing a double sub-sampling of the project area: we do an initial spatial sampling of the project area (resulting in a set called K) which is an area based reduction from the original methodology that I assume has been done for performance reasons on the initial GEE implementation. Then later on, as per the original methodology, we then when doing pixel matching we use a 10% drop again of K, referred to in the new methodology and implementation as Ksub before we do the final matching, but never actually named in the original methodology.

Going back and reviewing the original methodology document, we do the following:

Make K as a sample of pixels in the project area at an area density based on the size of the project. (6.5.1)
We build up S, the set of pixels from which we do the final matching, based on building it to be 10 times the size of K (6.5.4)
We use Ksub for the final round o finding pairs based on S (6.5.7)

In the new methodology however, there is a subtle shift

Make K as a sample of pixels in the project area at an area density based on the size of the project. (6.6.1)
We build up M, the set of all possible counterfactual pixels from which S is made, based on K (6.6.3)
We build up S from M, the set of pixels from which we do the final matching based on building it to be 100 times the size of Ksub (6.7.4)
We use Ksub for the final round of finding pairs based on S (6.7.7)

Both Ksub and M are artefacts of the code of the implementation, but ones I’ve explicitly named in the new methodology, as they’ve been useful discussion points, and so I felt they needed names to help with discussion. I feel both do exist in the old methodology, just they’re not named: M is the superset of all possible S values, and a subset of R, just if you remove the “ten times the size of K clause”, and Ksub is what you get from 6.5.7 when you say a 10% sample of K.

Naming aside then, the main difference between the old methodology and the new one is that S is now calculated from a smaller set of pixels, Ksub. I can see that this has potential to change the results as it does potentially narrow the set of pixels from M that go into S. So I want to check with Tom if 1) I have understood what is concerning him and 2) if that’s the case try to engage Robin to implement this and to look at the performance problems we’re having with find_pairs again.

Biodiversity

I did a review pass as requested on the LIFE paper, mostly just providing accurate data on how many difference species were used. I also attempted to look into why New Zealand had got lost from the diagrams in the paper, as I was worried it was similar to an incident a while ago where Ireland fell off the map. I spent a bunch of time going through different layers that build up to the final results, and eventually concluded that the data was there all the way through:

The root cause, found by Tom after I’d ruled out the earlier stages, turned out to be in one of the very final stages where the data is summed up between taxa and some NaNs were getting in, and it seems in New Zealand there is always some taxa for a given pixel that isn’t present.

I’ve started moving repositories around into the current GitHub organisation for publication - we’re moving non-blockchain related work out of the carboncredits GitHub org into one called quantifyearth as that’s a bit more decoupled from the application of the research to a particular niche.

Data sharing

I still haven’t managed to get this AoH data shared, but I’m now not blocked on anything else. Anil has set up data.quantify.earth for me, pointed at the usual spot, and I’ve taken one of the “new” machines and prepped it ready - all three of the new machines are running ZFS on their additional storage, which is a good opportunity for me to learn about this - I also migrated the scratch spinning disk on my home linux box to ZFS as well.

This week I just need to get the data on the machine, set up nginx/caddy to serve it, and ask Patrick to forward from the quantify.earth caddy to the new spot, possibly asking Malcolm to open a port along the way.

IBAT meeting

There was another IBAT meeting, this time focussed on tree biodiversity, where apparently there is a problem with getting good range data, on which the STAR ranking is based. It was a useful meeting for me not knowing much about the domain, and I learned about BIEN, a ten year old project to track plant biodiversity. This is interesting to EEG types because not only do they have a lot of data, but they’ve also done both versioning of that dataset as it evolves, and keep the old data around (apparently, I’ve not yet checked this myself). They also take a lot of care apparently in how they filter in the data, trying to check for biases both technical and cultural in the data.

I struggled to find the API documentation, so I dropped by person from BIEN on the call a note asking for pointers to the docs and suggested if anyone from the BIEN group was free we’d be definitely interested in having them give a lunchtime seminar.

This week

Proof read the paper Sadiq's written on the tezos carbon credit contract work
Document all the machines - we have quite a few now
Get that data shared
Push on with trying to clear up this methodology misunderstanding
Try again to tidy up the biodiversity pipeline code for both publication and getting a single common codebase for use with the IUCN
If I get chance plan with ZFS more in a shark context

Tags: weeknotes, life, tmfemi