Weeknotes: 13th February 2023

Week in review

Testing databases for the biodiversity data

In the biodiversity pipeline spec I wrote out a while ago for our group, one of the unknowns was how to host the data to make it explorable by the ecologists team. The data set is outside the range of what I’d normally consider for a regular relational database: 2.7 billion rows/100GB of data in a single table is my rough back of the envelope calculation, and in production environments I’ve been in RDBMSs stop being manageable at around 100M rows per table.

Alison went through the spec last week and identified the queries she’d like to run as part of the biodiversity paper, and so I’ve started working through my list of databases, loading up sample data, and just checking I can do the queries I want. At the moment I’m squaring off Clickhouse (hat tip to Sadiq for that one), Elastic Search, and old faithful Postgres. I didn’t finish this work in time for my week notes, so you’ll get the exciting summary at some point in the near future.

Drawing arcs for Ark

Off the back of discussing our ocurrent style pipeline plans for the ecology workflows, I started to work out how to align the biodiversity work I’ve been doing with the work Patrick’s been doing on the pixel matching algorithm. I’ve made a start on plotting the full biodiversity calculation pipeline, and Alison did the same. My work in progress graph looks like this (but is not finished!):

flow.png

Alison and I had a catch up about this at the end of the week, and my plan is to work in to my graph all the subtleties I missed that Alison’s captured, so we have one place with it all in. The result will be a yaml description of the pipeline that currently is just being rendered by my second ocaml program, but will longer term move (incrementally) to ocurrent.

On that front I’ve made a start on containerisation of the AoH and tile pipeline stages. My plan currently is to try stop people having to mess around with Python virtualenvs when running this stuff and just run it straight from the canonical image locally just as it would be in the fixed pipeline.

Sysadmin

Our main compute server as been a pain point for people this week, with us hitting disk space issues twice, and some of the other students hitting issues and coming to me. I’ve taken to being more proactive at pushing people to our infrastructure slack of late, which seems to be working well - I’m trying to get the users to be more like a community than see it as a service that I run (as that isn't my job, I just do it to try be helpful!).

The week coming

  • I had a sync up with Anil on Friday, and he wanted the team to shift focus to push the MVP over the line, so this coming week that’s my top priority. I have a couple of tickets on Sadiq’s board, and I’ll aim to get both done this week.
  • Additionally Anil wants some notes for the annual report to Tezos, so I plan to sit down with Patrick and run through our dev side woes from last year, which I’ll decant into a document for Anil to work into his report.
  • Any time left over will be wrapping up the database work I started last week.

Interesting links

Tags: weeknotes, life, tezos, 4C