Yirgacheffe: trying to do The Right Thing with geospatial data
11 Nov 2022
Tags: python, gdal, geotiff, geospatial, design
I’ve been trying to find a narrative for writing up this little library/idea I’ve been working on the last few months, to try and tie together this post, and the ideas have finally found form thanks to the the latest episode of The Future of Coding podcast, where they go over the “Worst is Better” trilogy of papers by Richard P. Gabriel, the synopsis of which is the tension that exists when designing a system between:
- “The right thing” - having an elegant interface with all the icky complexity of dealing with complexities hidden inside
- “Worse is better” - having an elegant implementation that exposes the underlying complexities to the system’s user
This is something I’ve been trying to reason about in my time at the Cambridge Center for Carbon Credits (4C) as I assist the ecologists on the project with their computational challenges. A lot of what they’re trying to do not computationally complex, but becomes challenging when you try to work at the scales they do. For example, they want to reason about the habitat of tens of thousands of species with multiple layers of raster maps that go down to 100m2 per pixel (at the equator) - that’s up to 73GB per layer. It’s not just about how do you make such computation performant at that point, but how do you make it safely scale at all without exhausting all the resources you have. Linux, like all unixes, suffers poorly when it runs out of resources, and it’s too easy to take down a server to the point it needs to be power-cycled if you aren’t careful.
I see my role as trying to build bits that go between the ecologists and the machine: not just can I take their code or algorithms and implement them to scale, but can I make it so that they can write the higher-level code to do the interesting planet-saving bit, and I write a system that take that higher-level expression and execute it efficiently and safely on the available computing resources. It's a common lens I use to frame problems like this: how do I make it so that you don’t need someone like me to do your job?
To this end I’ve been working on a small, evolving, geospatial library I’ve named Yirgacheffe that acts as a place for me to give home some of these ideas. As is often the case with such middleware projects, I realised as I listened to the aforementioned podcast, the question that I was trying to answer without knowing it was the same one expressed in “Worse is Better”: where do I hide away all the complexity from the user, and where is it better to accept that there a things you just have to know about the scaling challenges. And further: as one commits more to making something "The Right Thing" are you just making it worse when you do hit the limits of the system than if you'd been up front about it to start with?
The genesis of Yirgacheffe was a failed attempt to add some “The Right Thing” code to an existing project. I was helping tidy up a Python codebase worked on by a couple of the ecologists for a 4C related project, which was processing multiple geospatial datasets loaded using GDAL, the common library used when working geospatial data.
One of the pain-points I found as a software engineer was that the code was littered with repeated frame of reference transforms that I wanted to tidy away. They task they were working on involved using several raster map layers with different bounds - some planet wide, some with a region restricted to where a particular animal lived, and so when combining them they had to do some offset math; this math was then visible on every raster access, and whilst I’m not religious about DRY, it just made the code hard to read (to me at least). An example snippet of the type of thing:
 = {	# up here is a bunch of logic to work out the window of interest
Whilst this code works, carrying around that window struct and applying it repeatedly quickly makes the code harder to read, and it's more fun yet as it's not applied everywhere all the time, just most of it. To solve this I wrote a small wrapper-class for the GDAL raster maps, called Layer, that would let you work out the intersection or union of a set of maps, and then set the offsets per layer once, and then never have to see them again.
 = [,,]
 = Layer.()
for  in :
	layer.()
 = ...
for  in range():
	 = (
		0,
		,
		intersection.x,
		1
	)
	 = np.((*))
	 = elv.(*)
	 = np.(>=min()<=max())
	 = mask.(*)
	 =  *  * 
	 += 
My code looks a little more dense, but hopefully you get the idea: now I can just define the area I'm interested in at the start, and never reference it again; each layer will do the offset math required for me, whereas before you had to remember that different layers had different offsets and there was a lot of duplicated code.
Fortunately for me this pull request was rejected: the original author was concerned about me adding classes to his code that didn't generally use them, and that's understandable - they have to live with the code beyond my little contribution and so need to be happy with it. But I was convinced that this kind of simplification would be needed as we did more work of this nature of 4C, so I moved this code to my own library, and then used it myself when working on GDAL rasters, and it suited me to have these complexities abstracted away where possible.
I say fortunately, as it wasn’t long before more opportunities to try and hide some of the complexities with working with geospatial data presented themselves, and I've found myself more and more leaning on Yirgcheffe to hide the rough edges of working with GDAL as time goes on.
The ecologists I’ve been working with are not coding naive: dealing with large data sets is something they do a lot, so they end up writing data processing scripts in either R, a popular free-software statistics package, or in Python using packages like numpy. Whilst I’d come across the Numpy library as a thing scientists use with Python, it was never something I’d had cause to dabble in before now. For those in a similar boat, the high level synopsis is that Numpy lets you do direct computation on large arrays of data. It turns out it’s popular for good reason: not just that it performs well when handling said large arrays, but also because it makes it quite easy for the data scientist to express their ideas without getting stuck into the tedious programming of working with array - that sounds a bit like somone is trying to do "The Right Thing".
As a programmer, when facing two arrays of data that I need to combine the elements of, there are several ways I’d think to approach this. If I was being particularly old-school might use a for-loop to add together each element in turn:
>>>  = [1,2,3]
>>>  = [4,5,66]
>>>  = []
>>> for ,  in zip():
...   result.(+)
...
>>> 
[5,7,69]
Or these days I’d use a approach like list comprehension or map, basically whatever language feature lets me apply a function to every element in an array without worrying about the size of the array (this time in Swift):
  1> let x = [1, 2, 3]
x: [Int] = 3 values {
  [0] = 1
  [1] = 2
  [2] = 3
}
  2> let y = [4, 5, 6]
y: [Int] = 3 values {
  [0] = 4
  [1] = 5
  [2] = 6
}
  3> let result = zip(x, y).map { $0 + $1}
result: [Int] = 3 values {
  [0] = 5
  [1] = 7
  [2] = 9
}
But I’d always be thinking element wise, as that’s how most programming languages work.
Numpy on the other hand lets you just add the two arrays directly, without thinking about each element individually:
>>> import  as 
>>>  = np.([1,2,3])
>>>  = np.([4,5,6])
>>>  =  + 
>>> 
([5,7,9])
Thus your top level code better approximates whatever algorithm you’re trying to articulate, removing all the gubbins like loops and lists and maps and zips that you’d otherwise need that distract from the overall view of what you’re trying to do. As someone who grew up low-level programming this feels somewhat alien, but I do love the idea that it lets you concentrate on the interesting bits of the original problem you’re trying to solve, rather than getting lost in the details. And given that even seasoned professional programmers are notoriously bad at checking array limits, having numpy worry about those details for the data scientists seems quite a win.
The problem here is that trying to do “The Right Thing” is very hard and you eventually hit a place where it doesn't work. As good as Numpy is, it doesn’t take care of everything for you, and when you hit those limits sometimes it'll let you know directly, and sometimes things get messy. Here's a real example of where we hit that hard enough it took down one of our servers.
As mentioned in the opening, these ecologists work with very big data sets; on the project I've been assisting with a full base map of the world is 400,000 x 200,000 pixels, which is 73 GB of data. This is why, in that first code snippet, you can see that the data is being processed a line at a time: to allow the problem to fit into memory on a regular computer.
At 4C we have a couple of compute servers for tackling some of these bigger jobs: when you're trying to process tens of thousands of species over multiple map layers, you start to push what can be done on your desktop machine in a reasonable amount of time. These machines have 1 Terabyte of RAM, and so are specced well beyond any computer I've used before. But even so, when your map-layers are 73 GB each, you need to be careful still about how you use what hitherto felt like limitless memory.
And this is where Numpy's making it so easy to forget what's in your data arrays can come back to bite you. One of our ecologists wrote a small script to process some maps, forgetting to add the tedious bit about loading data a line a time, and it worked fine for one species of animal, but when they tried to run multiple species at once, the server quickly ran out of memory, and Linux not being that good when you run out of resources we had to end up asking one of the lab admins to reboot the server to get it back. On a group server that sort of incident can not just be annoying, but cause others to lose work. Through no fault of their own, the ecologist had taken out a bit of infrastructure, because Numpy did The Right Thing, but only up to a point.
And to be clear, this is Numpy's fault, not the ecologist's. Numpy tells you not to worry about the array contents, let it do that. But it doesn't do any attempt to fail gracefully when things start to go wrong memory wise. The ecologist is trying to work with data, and is not a systems-level programmer who understands operating system behaviour under low memory conditions. Indeed, if we want to look at who else to blame here, if not at Numpy we can point the finger at the operating system who's virtual-memory system is clearly not working as desired.
And so back to Yirgacheffe, where I try and build a new "Right Thing". This time I add a couple more complexities behind the scenes to free the ecologist users to focus on saving the world and not on how rubbish computers are.
I can hide memory-saving tricks like dynamically turning vector based maps into rasters on demand, rather than pre-computing the entire raster in memory as they did before. From a user's perspective it looks like I've done the same thing, they can just call read_array on chunks of the vector map and get back pixels, but I'm hiding the complexity of only doing on demand rasters to reduce memory consumption.
But that doesn't help with the fact that the ecologist is still worrying about reading data a line at a time. And indeed, whilst we're at it, why a line at a time? If we're offloading this to the GPU, as we could well do given the type of computation involved, then one line is often too small a unit to get a benefit - so do we now need ecologists to be GPU experts too? Even without a GPU, using too small a chunk size is inefficient, and yet it shouldn't be the job of the ecologist to know about disk bandwidth transfer tradeoffs.
Instead, Yirgacheffe now takes the same approach that Numpy does, and overloads the arithmetic operators on its Layer objects, so now our opening example would look like this:
 = [,,]
 = Layer.()
for  in :
	layer.()
 = elv.(lambda(>=min()<=max()))
 =  *  * 
Internally Yirgacheffe will do the chunking to ensure that you don't run out of memory, letting the ecologist worry about the bit they're good at, whilst I worry about the I'm good at :)
From this you might think I'm pitching Yirgacheffe as the be-all and end-all of geospatial libraries, but it's early days, so we're still a distance from that. Every time I reach for Yirgacheffe I find myself expanding on it as I hit new use cases I've not been exposed to - I'm still new to the kind of processing people do in this big data world. But it's a sufficiently "Right Thing" that I do find myself using it when I need to do any geospatial coding myself: I now refuse to work with GDAL directly given that I have something better.
But there is the nagging question in my mind: what will it be that I've failed to take into account in my model that will cause it to fall down? I'm fairly convinced you can't have a perfect abstraction, and the nice thing (as a systems programmer) about "Wrong is better" is that it forces you to address everything that can go wrong up front. This is why I prefer languages like Swift and Go over Python in my own work. The best I can do is try to spot where Yirgacheffe will fail and make it fail fast in such cases.
But at the same time I'm quite excited about the opportunities of having a systems-programming layer here. For example:
- The most pressing is output - I'm still manually creating GDAL datasets to write data into, and I have enough information to remove that tedious task, I've just not had the time yet.
- As alluded to, I'm a big fan of moving data to the GPU for tasks like these, and Yirgacheffe is the perfect place for me to hide these things. I did some experiments with using CUPY as a Numpy compatible way to use Nvidia's CUDA earlier this year, and you get some good results if you shape your data correctly.
- I currently haven't yet pushed concurrency down into Yirgacheffe, but when our servers have 256 cores, that's a definite win, and something that aligns well with the current data chunking model it uses behind the scenes. In particular, when I joined 4C one of my early tasks was to write a small parallel job scheduler, that let the ecologists run the same script many times, but I think for them writing a single script that scales over cores fits better with how they currently work.
- Hiding GDAL's memory leak problem. One issue with scaling these tasks is that the GDAL SWIG bindings leak memory, and for some reason also start to get slow. Currently I'm avoiding that when scaling up tasks by pushing Yirgacheffe into Python child threads, which run in different processes, so when the child thread exists the leaked memory is cleaned up. This is another bit of systems-programmer think that needs to be hidden away.
- For a lot of what GDAL is used for I could actually swap it out for a TIFF image library that is less leaky too - by moving the ecologists away from thinking in direct GDAL terms I can be free to swap out that part where appropriate.
There's lots of opportunity to empower the users here, and I'm looking forward to expanding on Yirgacheffe over the coming months as we work on new tasks.