Weeknotes: 13th March 2023

Week in review

Getting Biodiversity results queryable.

The main thing I worked on this week was work around setting up PostGIS and testing it with large subset of the total LIFE experiment data. Alison had been doing some analysis of data around Brazil, and so we had to had all the persistence data for species with any overlap with Brazil, which came to just over 300M rows of data. This is around 5% to 10% of the total I’d expect, so not quite representative, but a useful start. As a reminder I originally had ruled out PostGIS as in my experience in production web services it starts to struggle at around the few hundred million rows per table mark, but as Anil pointed out, that’s when you have churn in the data, and here things will be quite static. And on the plus side, I’ve tested a bunch of Alison’s queries and they seem to be expressible in raw SQL, something that ClickHouse couldn’t do (no subquery support) and Elastic Search could do but it’s not simple to express.

Example 1: Percentage of a total area for a tile for a given species

postgres=# select tile,area*100/(select sum(area) from geotest
where species=10357 and experiment='current' limit 10;;
       tile       |      ?column?
——————————————————+—————————————————————
 87a8910c3ffffff  |  0.5029916932543952
 87a891725ffffff  |  0.7866602243330656
 87a8a3d0bffffff  | 0.06943932072465422
 87a8a3c0cffffff  |  0.8023344475477204
 87a8910c2ffffff  |  1.1606584738475267
 87a8a3d50ffffff  |  1.7868703799034376
 87a89170dffffff  |  1.8646149531041507
 87a89172dffffff  |  1.7349703995514922
 87a89172bffffff  | 0.35479811204679734
 87a8a3d53ffffff  |  3.7004814323134965

Example 2: Number of species per tile:

postgres=# select tile, count(species) from geotest where
experiment = 'current' group by tile limit 10;
       tile       | count
——————————————————+———————
 874424800ffffff  |     3
 874424801ffffff  |     3
 874424802ffffff  |     3
 874424803ffffff  |     3
 874424804ffffff  |     3
 874424805ffffff  |     3
 87442480cffffff  |     3
 87442480effffff  |     3
 874424811ffffff  |     3
 874424813ffffff  |     3

I set up a simple schema and loaded our target data in using a (mostly) naive python script over the course of a day - it wasn’t fast to ingest, but then I made no real attempt to optimise the process. But given the data generation takes many days, the slow ingest isn’t really a real problem, as it could be fed incrementally anyway. Expressing queries of the sort Alison would like to run wasn’t instantaneous, but nothing was super slow, taking at most tens of seconds.

I added to the schema the centre of tiles, so we can also do queries restricted to say project zones, and because that works well with QGIS (somewhat, see below).

Data size, 300M rows took around 30GB of disk space, which seems good, but then the indexes I added took twice that, so we’re looking at closer to 100 GB total. Thus for the full dataset we’re probably looking at around 1 to 2 TB of storage. This isn’t the end of the world, but will present a short term problem (see below section on docker).

Next steps on this:

I’ve given Alison a tutorial on SQL to run through, and I’ll set her up with access to the database so she can try prodding it herself. If she finds that more advanced SQL is needed I can always pitch in, but I think it’d be good to tool Alison up not to need me if possible.
Try to load another significant chunk of data to see how PostGIS copes.

Docker volumes Vs Ark

Getting PostGIS to run in docker was surprisingly frustrating, based in part of my lack of understanding how docker volumes worked. I’d naively assumed that docker volumes would be a bit like docker images or perhaps VHD files - opaque blobs with a file system within. Turns out that this isn’t the case, and they’re just stored as regular file trees on disk. Whilst this is sad for certain reasons to do with our Ark goals, it’s also made worse in terms of a security/admin point of view as docker will honour the file permissions set in the container on the host - so if the container says “chown to UID 999” in the container, it’ll do so on the guest. This means that if you have rooted docker then user 999 is about to get access to files they might otherwise oughtn’t, and in rootless docker it’s just going to fail (unless you happen to be user 999). I’d not realised that docker had this sort of container leakage before.

This is made worse when we come to our large NAS we use for significant data storage, as that volume shouldn’t let you run chown, which means things fail as per the docker rootless scenario, or as with my error on on our compute server the week before, it’ll succeed to chown things due to a bug on the NAS server and no one will be able to access or fix the file. Sigh.

After digging around with options like looking at Docker volume plugins, in the end I just cheated. I spotted that our second compute server had unallocated disk space left on its root volume, and that the docker data folder was set up as its own logical volume, so after checking with Anil I assigned another 100GB to the docker volume, and just used that. This got me up and running, but doesn’t scale to the kinds of data we eventually want to host.

For that I suspect the easiest thing is to just tweak the Postgres entrypoint script to stop trying to chown files, or to change the script to chown them to the UID/GID that /maps uses.

Other than this being a time sink, it was also sad to note that in terms of Ark process data flows, I’d kinda hoped we’d be able to use docker images or volumes as a data container format for results, as we did say for the forest map tiles for quanitfy.earth. Unfortunately I can’t see currently a way to generate an opaque container from one docker image for another docker image to then consume as the CI progresses. Though in theory a Docker volume plugin might solve this.

QGIS

Whilst QGIS does drive me up the wall at times, it is also quite neat that you can point it at a PostGIS database, and let it display data from a table either directly or via a simple select query.

This was generated on my local machine using an SSH port-forward to Kinabalu. What stops this being an actual useful tool currently is:

At the scale of data we work, if you accidentally cause it to try render the entire 300M rows (which is very easy to do) then QGIS effectively locks up as it processes the data on the main thread and you need to quit it. I became very familiar with this dialog last week:

You can only do simple select queries AFAICT, not anything with GROUP BY or subqueries, which is what we need for the biodiversity work.

At the EEG meeting last week it was hinted that one of the students with more geospatial experience might do an intro to QGIS for those of us on the compsci side of things: I feel that there’s a lot of power locked in QGIS that we don’t leverage because we’ve not come from a GIS background and have other ad-hoc tools we could use.

Coming week

Work on the above todos
Learn a bit about how python notebooks work: Anil is a big fan of them, and I have to confess I find they don’t work for me, so I’ve ignored them, but I don’t disagree that for ecologists they’re potentially a nice UI.
I’d like to write some more OCaml at some point, as I bailed on using OCaml for my import script this week because I was already dealing with enough unknowns, and my OCaml isn’t really at the point yet where that’s the primary focus of any tool I write in OCaml.

Interesting links

Does 4C want to pickup a program to develop solar powered EVs? :)

Tags: weeknotes, life, h3, postgis