Weeknotes: 23rd January 2024

Last week

PROPL

Going backward through the week somewhat, it ended with PROPL, where Patrick, Ryan, and I were part of the crew: I both chaired a session and was a discussant (seeding discussion topics during the open discussion section). It went well AFAICT, though I failed to chat to as many people as I probably should have done as I was focussed on my tasks having never chaired a session before (finding my speakers, checking what they needed etc., making sure I had some way to intro them).

The main one I want to follow up with is Vashti Galpin, who has done some work on provenance, and I feel it links to my pyshark efforts.

EEG talk

I wasn’t expecting to give a talk to the Energy and Environment Group until May, but we had a drop out speaker, so I was drafted on. I spent an embarrassing amount of time preparing for my talk, as I realised the talk I was meant to recycle (which I gave to the FIDE2 CDT last term) was now stale, and the different audience that knew more about the area, so I reworked a chunk of it.

The useful part though was it forced me to look at the line between all the little projects I’ve done to try make the ecologists’ lives easier (littlejohn, yirgacheffe, etc.) and work out what worked and what fell flat, and think about what that means.

Littlejohn: naive/basic concurrency tool - lots of usage despite clearly minimal technical work.
Yirgacheffe: does a lot to make life easier, I use it all the time, never seen an ecologist use it.
Fsark: container wrapper that makes it look like you’re using python that has magically all of GDAL etc. installed - some uptake
Pyshark - new, so no use, but adds provenance tracking with hopefully little burden on user

The lesson I think is that ecologists are so busy they don’t have time to learn new things, and so the only things that I’ve done that have been successful are those that fit straight into their current workflows: littlejohn just requires you have your script take args and you know how to make a CSV file. Yirgacheffe, which in theory does a lot, requires you to learn a new API.

There’s a common mantra for startups that to get people to move from product A to product B you can’t give them something that is 2x as good, you have to give them something that is 10x as good, and I guess yirgacheffe either doesn’t deliver that or I’ve done a bad job of communicating that.

The less then for the Shark shell there is we need to ensure it either slots in with no new knowledge required, or does communicate a 10x benefit.

Tropical Moist Forest Evaluation Methodology Implementation

I did two bits on the TMFEMI work:

I stated in a recent week notes that my person opinion is that we should take find_pairs and not implement it in Python, as we lost the code transparency/performance battle a while ago on that and find_potential_matches. Python in particular limits our ability to share data between processes, so we end up running very few concurrent instances of find_pairs, underutilising our many cores on our server. I did at some point look into using shared memory, and whilst numpy does support that, it doesn’t work with numba (at least not for me).

Out of curiosity I took a stab at rewriting the current code in Julia, with the assumption that even if Julia was slightly slower than numba (I know from past experience Julia is generally faster than python + pandas), we’d make up for it in ability to use shared data between threads. Unfortunately the Julia implementation of calculating S is significantly slower than the numba implementation Sadiq (IIRC) did in find_pairs, and even allowing for 200 threads, my Julia version would still be slower.

This is both a nice reminder that the numba stuff is doing some serious work performance wise, and that my limited Julia skills probably don’t help. I didn’t really expect to get close in just a day, but I didn’t expect to fall so far away from where we needed to be.

At some point the TMFEMI has to be handed over to the ecologists to use. To that end I spent some time documenting how to manually run the pipeline without ocurrent, the CI system we on the compsci side have been using to run it, but is too complex to be something we could expect ecologists to want to learn how to manage. Something I do regularly, just there’s a lot of parts to remember. Tom felt the docs were going in the right directlon, so now I need to get Abby or Charlotte to walk through them.

This week

Finish TMFEMI documentation
Catch up on Life visualisations comments