Weeknotes: 17th July 2023
Last week I had fun working from Amsterdam, in a co-working space that was a weird pastiche of start-up land circa 2007, and this week I’m actually in Cambridge all week for a change. That is before this weekend I head off to Oxfordshire to help maintain a 3000 year old monument.

And people think Cambridge is old!
Last week
Tropical Moist Forest Evaluation Methodology Implementation
A slow progress week, as I tried to juggle making progress on pixel matching in which I hit unexpected issues, and then having to change track as priorities shifted and going back to work on earlier parts of the pipeline because of an unexpected demand for early results.
In terms of the pixel matching algorithm, as outlined in section 6.5 of the methodology document, I now have both K (the pixels from which to match in the project) and S (all the potential matching pixels in the expanded zone around the project) calculated and saved with stats. This sounds like no progress, as last week I had the pixels in S, but then I had not started to try save the data about the pixels that are needed in later stages of both the pixel matching and then the additionally calculations.
A full S took longer than I thought because I kept running out of memory - which was unexpected as even before I de-duplicate matches, with 3 billion pixels, that should have happily fit in memory. I tried to refactor the way S is calculated several ways to trade off concurrents mechanisms versus memory (disk files, in memory queues, etc.) and in the end I think it was just GDAL being a memory hog again, and when I set limits on GDAL’s memory usage it works. Annoyingly this means I’m probably no longer being optimal in my approach due to changing how I tackled the problem as I chased red herrings, I’d want to go back a few steps for that, but I have something that works, so I’ll leave it at that and get onto the random sampling stage.
Now I’ve done this, the next step is to do the random sampling and nearest distance matching, but that will need to wait now, as priorities have shifted.
The priority shift is that we now need to get all the GEDI data based AGB values from section 6.1 of the project by Friday 21st of July.
Although in theory we’d completed the work for section 6.1, in that we had code to do it, we were not in a position to do a bulk calculation because we’d not yet made any stage production ready - the original plan that we’d all agreed to was we’d complete an end-to-end pass to de-risk the methodology first, using made up data if necessary. But this approach has been thrown out now, and so I had to go back and fill in the gaps around the AGB process to make it production ready.
The big gap here was having a database in which we can work with the GEDI data. Technically a database isn’t needed for the methodology, but using PostGIS makes doing spacial queries over very large datasets so much easier, that it makes sense to use it rather than write our own code to trawl over all the GEDI HDF5 files directly.
For the demonstration thus far we’d been using the GEDI database another researcher has set up, but that is made using V1 GEDI data, and we’re downloading V2 data, and not only were the versions incompatible, that database doesn’t have all the granules we need. I set up a new database on the same PostGIS instance for the TMFEMI pipeline to use, and downloaded and imported data for the Gola project to let me convince myself it was set up and working properly. I also wrote a new ingest script, as before I’d been using the one Amelia had written just with the Spark code hacked out of it, and now we needed a script that could go into Patrick’s pipeline, so that is now also done.
This inevitably turned up a couple of issues in the early code which I fixed, and so now I believe GEDI ingest is now ready to go at scale for TMFEMI.
Related to this, we will also need to get the project data into the pipeline, and for that Tom and/or someone on the ecology side of things needs to know how to get the project data into the right form. To this end I updated the documentation on the tmf-data repository where this is stored, and I wrote a set of checks so that pull requests can be validated before merging. We’d already written a JSON Schema spec for the input files, so all I needed to do was automate the validation of the JSON with that, and I added an additional check to ensure that the shape file referenced in the project spec actually exists.
The checks are very rudimentary right now, and the errors will be opaque I suspect, but having some validation is better than none, and we can improve on this over time.
Unrelated to the priority change, the week before I’d written some code to generate rasters of the eco-regions data for pixel matching, and Patrick found that when integrating that into the pipeline certain regions didn’t render correctly, seeming to lack data. It turned out that Yirgacheffe was setting the datatype to the raster layer to byte, as I’d forgot to fix this when adding a dynamic burn value to rasterisation of vector layers, and there are more than 255 eco-regions. That is now fixed in Yirgacheffe, as you can now both specify a target data type and I do inference based on the type data I can get from the GPKG data. The type inference I’m a bit weary about, as I have to assume the largest possible datatype for the data when dynamic (e.g., float64 rather than float32 if the GPKG “Real” data type is used), and given how large rasters get this can lead to unnecessarily large files, but I figure that for ecologist users having some inference was better than none, and if it’s a problem I could start to try and walk the GPKG tables to work out the max possible values.
AoH processing for Florian Gollnow
Alison had been helping Florian Gollnow with some work evaluating AoH data in Brazil, and he asked for some help because the downstream processing of that data was quite slow in R. I spent a little time looking at what he was trying to do, and wrote him a simple script using Yirgacheffe that would let him further divide the AoH data by sub-region within Brazil based on a GPKG of said regions. I didn’t really have much time to spend on this, so it was a bit of a crude script - I’d like to have worked out which regions matched to which raster, as the data was there, but not the hours in the day. Hopefully it’ll be enough top help Florian move forward.
This coming week
- Catch up with Patrick about how much we need to do still to hit this AGB deadline
- Explain to Tom what we need in terms of data
- Get back to pixel matching when I can