Weeknotes: 4th September 2023

Last week

Hextile visualisation tools

The week before I’d started on trying to visualise the H3 data that we have as part of the biodiversity work, and that continued to soak up most of my time last week, as I tried to get a video ready for Anil’s ICFP talk. Previously I had managed to get a couple of examples of data from static JSON files into the quantify.earth deck.gl set up, and noted that Safari struggled a little with this, and I stated I had two objectives: push more data through it to see where it falls over, and to try getting the data to be dynamically generated based on queries from a PostGIS database.

A the start of the week I had a few species in a PostGIS database, and using a small backend service I wrote in Go, I was able to generate summary layers dynamically and display them. In the database we have all the raw per species hextile, area tuples, and in the display we’re summing up area and species count information per tile (with the computation being done in PostGIS).

The idea here is that by using both colour and height, you can start to gain some insight into the data that just one axis alone would not give you: the colour of the hex tile is defined by the sum of the area of all species in it, so you can see popular tiles versus less populated ones, but the height also then tells you if that’s because there are many small species there or a few bigger ones.

I then scaled this up to a larger dataset. The full hextile dataset that Alison has generated is about 2.5 billion tiles per experiment, and so I didn’t quite want to flood our PostGIS server yet with this, so I restricted the data to just Brazil again. But even here I hit two performance issues:

Importing the data was slow.
The summary query at scale is relatively slow

In terms of import, I lost a bunch of time when it turned out the Go data frames library I was using seems to be fast under macOS, but slow under Linux (3 minutes vs 16 minutes for the same dataset). In the end I ditched the data frame library I was using and just used the raw parquet library that was under the data frame library, and then my linux imports were down to a minute for my test case, so that was good - but a lesson I guess that you need to check why things are slow, as initially I assumed it was the Postgres instance that was the case.

But even once my import code was performant, the database import slowed down over time, which based on discussions with Amelia, I assume is down to indexing. Amelia told me that when doing bulk imports she often will delete indexes before hand, and I have quite a few indexes on this data to make querying faster. Next time I do an import I’ll follow suite - though it makes a bit of a mockery of using an ORM to manage the schema.

Even for the 181 million tiles I did manage to load, the summary query required to display the data took two minutes to run. I think in terms of doing exploratory querying of the data, this is acceptable, but for a pretty demo of exploring the data it is not. So in the end my demo for Anil was the result of me taking the API call to my little backend service and saving it to a static file. I suspect for quantify.earth we’d then use the static file for the initial overlay, and then when you click on tiles to query which species are in them etc, we’d go back to the database for that.

By the end of the week I had a nice demo for Anil:

I think there’s promise here in terms of getting this to become a useful exploration tool for ecologists, but that’ll need to wait as this week I need to return to TMFEMI.

Tropical Moist Forest Evaluation Methodology Implementation

Patrick has been plugging away with Tom to identify where the current implementation is either wrong, or the methodology needed more clarity as what we’d implemented was technically correct but not what was wanted. This week I have a bunch of todos on that I need to get through.

Placement student

Last week was the last week of our student's work placement. By the end of the week she’d written us a tool that should make importing the IUCN data more repeatable, but because of the hextile work being high priority, I didn’t get chance to integrate it with the AoH pipeline yet. Still, we did get her to make her first pull request via GitHub and her code now is part of the persistence code-base, which is a great outcome.

This coming week

TMFEMI: work through my todos there
Talk to Simon from IUCN about next steps on setting up their AoH pipeline
Write up Julia notes, as I suspect it’ll be a little while before I get to do more there