Yirgacheffe vs The Stack
I've not had time to focus on Yirgacheffe directly of late; although I have some ideas I want to build on in terms of finishing the 2.0 release and then moving into better parallelism support, I've mostly just been a consumer of Yirgacheffe as I've been working on AOH related things (which will appear in their own weeknote shortly). Still, as the most advanced user of the library, I continue to push into corner cases that need work, and so there has been a little progress in the last few weeks. That also included hitting a random round number of commits:
I'll raise a cup of coffee in celebration, though I failed to get the right beans in time!
A few weeks ago I wrote about how I started chasing a bug to do with Yirgacheffe complaining about mismatched projections, and then ended up hunting down another issue which turned out to be GDAL running out of C library file pointers, which is now fixed in GDAL, so I look forward to picking up in a future release. However, I did find time to turn my attention back to the original bug, which hadn't gone away.
The superficial symptom was that when calculating a species richness map, whereby we sum up a set of per-species Area of Habitat (AOH) maps to work out how many species overlap each pixel. The method is simple: we take an AOH raster map, and if a pixel has any area for the species in it we turn that to a 1, and empty cells are 0, then we take all those layers and simply add them together. The problem was that somewhere in all this Yirgacheffe complained that two of the rasters I was trying to sum together had different map projections. This was surprising, as I generated all the AOHs for each species with one job, so this felt unlikely, and if I took smaller samples of the files that in total covered all files, I didn't get the error, so clearly there was something else going on.
The problem was I believed my own hype :)
Originally, Yirgacheffe wasn't so much a declarative library for geospatial work, it just handled doing raster alignment book keeping, so you could pretend all your maps had the same geospatial extent. As such, the original species richness code was written so that it incrementally built up the species richness raster one species at a time. This is a simplification of the original code that removes seasonality handling, but hopefully you get the idea here:
= None
while True:
= input_queue.()
if is None:
break
# Convert the AOH raster which has fractional area per pixel to a binary layer
raster = RasterLayer.layer_from_file(raster_path)
calc = rasters != 0.0
partial = RasterLayer.empty_raster_layer_like(raster, datatype=DataType.UInt16)
calc.save(partial)
# Store the new binary layer into incremental merged results
if merged_result is None:
merged_result = partial
else:
merged_result.reset_window()
# Find the union area of the new update and the incremental version,
# and then make both rasters use that as their virtual area extent
union_area = RasterLayer.find_union([merged_result, partial])
partial.set_window_for_union(union_area)
merged_result.set_window_for_union(union_area)
# Just add the two layers together
merged_calc = partial + merged_result
temp = RasterLayer.empty_raster_layer_like(merged_result)
merged_calc.save(temp)
merged_result = temp
if merged_result is not None:
final = RasterLayer.empty_raster_layer_like(merged_result, filename=output_tif)
merged_result.save(final)
A lot of code, but hopefully you can see how it basically loads one AOH raster at a time (taking the file names from a queue), converts it to binary, and then has this incremental raster stored in merged_result that it keeps adding the new layer to and saving that in memory. Then finally once done we save it to a file.
But as Yirgacheffe got more clever, I realised I didn't need to do all this saving the result as I went, I could just keep adding layers as if they were numbers in a simple expression:
= 0
while True:
= input_queue.()
if is None:
break
= yg.()
= != 0.0
= +
if :
merged_result.()
So much shorter and cleaner thanks to Yirgacheffe's improvements over the last year. And it worked, at least initially. But it hides a bunch of cleverness, which would be fine if it was very clever, or I was very clever, but it wasn't quite clever enough, and that's what bit me.
Whilst the two bits of code look very similar structurally, there is a big difference in when things are evaluated. In the first example every time we load a new species AOH raster we immediately add it to the "rolling total" raster, stored in merged_result. This means that we having this in memory raster map that has to exist entirely in memory whilst this is going on, and that can be quite large. We use 100m per pixel maps for some of this, and to store those in memory requires 300 GB of RAM! This is fine on our big compute servers that have 1TB of RAM each, but no good on say my laptop. Yirgacheffe is clever enough to avoid needing this, and the second version of the code will run quite comfortably on a laptop even with such high resolution rasters. It can do this because when you add two rasters together in Yirgacheffe it doesn't actually do anything, it just notes that at some point you'll want to know the outcome of that request, and defers the actual work until when you save the file in the last line of that code snippet. This is what's called lazy evaluation in computer science. And how does laziness help avoid needing all that memory? Well, we only ever store the result to disk, and it's only on disk we need the entire final result to materialise, and as such Yirgacheffe knows it can split the job into smaller chunks, say just working out the first row of pixels, then the second, then the third, and so forth, and writing those to disk before moving onto the next one.
It's great, it works, clearly I'm a genius for making it all so simple and efficient.
Only I forgot how programming languages work under the hood, literally beginner course in computer science stuff, and that's why it broke 🤦
If we think about how that lazy expression works, I've built up this in memory expression behind the scenes that looks like:
(((raster_1 != 0) + (raster_2 != 0)) + (raster_3 != 0)) + (raster_4 != 0) ...
Which we can visualise like so:
This is for ten rasters, but we process thousands of species, so that graph gets quite big. The thing to note is that this is an expression of nested expressions: that is a complicated way to say that what Python does to evaluate this is that it tries to do the first add in the tree, then sees it needs the result of the second add, so it calls add on that, which in turn realises that it needs the next add, so that second add calls the third add, and so forth all the way down, until you have a large stack of adds. Once it gets to the final add at the end of the expression it can unwind the stack to generate the result.
I'm deliberately using the term stack here, as that's actually what is being build in memory each time you call a function. In memory, every time you call a function some memory needs to be allocated for that function's variable, and that goes onto what is referred to as the stack in computer science. So you can see here for our little example, to add ten rasters together, I'd have a stack that goes at least ten calls deep. The problem is that the stack can't grow forever, as it is only allocated a finite amount of space, and so at some point you run out of memory. To guard against that Python has a limit to how deep a stack you can create, which defaults to 1000 calls:
>>> import
>>> print(())
1000
However, what was odd about the issue I was hitting was that I wasn't getting the expected error of:
RecursionError: maximum recursion depth exceeded
So I dug in more, and actually, the error message, which had been tidied up and hidden by Yirgacheffe unfortunately, was very weird:
...
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/layers/base.py", 101, in
if is not None and is not None and != :
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/_datatypes/mapprojection.py", 93, in __eq__
if self.c != other.c:
^^^^^^^^^^^^^^^^^^^^^
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/_datatypes/mapprojection.py", 93, in __eq__
if self.c != other.c:
^^^^^^^^^^^^^^^^^^^^^
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/pyproj/crs/crs.py", 1594, in __eq__
return self.()
~~~~~~~~~~~^^^^^^^
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/pyproj/crs/crs.py", 1594, in __eq__
return self.()
~~~~~~~~~~~^^^^^^^
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/pyproj/crs/crs.py", 982, in
= CRS.()
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/pyproj/crs/crs.py", 982, in
= CRS.()
TypeError: CRS.() 2 3
It's not actually saying that the two rasters don't match, but that some function within PyProj, the map projection library I'm using, has the wrong number of arguments! This is weird for two reasons: firstly, Yirgacheffe is a paranoid library and is constantly checking map projections are matching, so we call this functions thousands of times with no ill effect before this one, and secondly if I look at the source code for PyProj I can see that the code is correct.
However, a few lines above the error there I can see:
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/_operators/__init__.py", 629, in
= self.rhs.()
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/_operators/__init__.py", 629, in
= self.rhs.()
[981]
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/_operators/__init__.py", 627, in
= self.lhs.()
"/Users/michael/venvs/aoh/lib/python3.13/site-packages/yirgacheffe/_operators/__init__.py", 627, in
= self.lhs.()
We can see that we are over 981 calls deep into evaluating the expression, so we're close to hitting Python's stack limit. Is it just we're unlucky and we actually run out of stack memory ahead of Python's in built protection, and then the stack is getting corrupted? It certainly feels like a stack corruption issue when the error says the code is broken but we know categorically it isn't.
The first thing I check then is raising the stack limit. The default maximum size on macOS is quite generous at 8 MB, or 8KB per stack frame, so this feels unlikely, but given we have what seems to me to be a clear stack corruption, I feel perhaps with all the calls I'm making to C libraries in GDAL and PyPROJ and other things then maybe I have a memory leak somewhere that's pushing me up to the stack limit. So I increase the default stack limit to 64MB, the most you can set it to, but that has no impact. This means perhaps that my theory is wrong, or that Python has some other stack memory limits it is using instead of the system guided one - it's been a while since I played with macOS internals and I've no memory of how the stack limit is enforced, if at all.
Instead I try the easier option, which is telling Python to set its limit in stack frames down from 1000 to 500, as we know we can call beyond 500 nested function calls based on the previous error message. However, when I do that I get the same result, only the repeated warning is now 481 frames deep, not 981. From this we can infer that whatever corruption is happening is related to how Python handles its call recursion limit, rather than we're exceeding the stack memory limits.
Unfortunately at this point I run out of time to push much further trying to debug this. I did try create a stand alone reproduction, but that did just hit Python's expected recursion limit warning. And it did tell me I have a bigger problem here from Yirgacheffe's point of view: even if we didn't see this memory corruption problem we are hitting stack depth limits all the same! I need to add up a total of 34K AOHs for a pipeline line STAR (which is where I hit this first), and so clearly just naively adding together 34K rasters in the way I had hoped wasn't going to work.
This is why I say I believed my own hype. I built Yirgacheffe to take care of hard problems, and most the time it does a good enough job of it that I can forget all that's going on under the hood and I just am a user of this magic system that takes care of all the resource issues that come from computering. Whilst is kinda of good, because then I try to do something silly like add together petabytes of data as if it is nothing, and start to find the limits of what I've built.
Or at least, what I've built so far :)
There is a short term and a long term fix for this. The longer term one will unfortunately have to wait, as I have people on STAR and LIFE teams expecting me to deliver awesomeness, and I've spent a day spelunking computer things they're not fussed about. However, both solutions follow the same underlying principle, so we can talk about that briefly.
If we look at our expression graph again:
We can see this is what we'd call an "unbalanced" graph, which comes from how it was built up in the code using a while loop to just keep adding rasters to the previous one. However, because addition is the way it is, we could just as well have a balanced call graph that gets us the same result:
These two graphs will give us the same result, because addition is both commutative (i.e., a + b is the same as b + a) and associative (i.e., a + (b + c) is the same as (a + b) + c). The interesting thing is that the call depth on the second version is only four additions deep rather than ten. That doesn't seem like much of a win, but it's based on powers of two: two rasters are 1 deep, adding up to four is 2 deep, up to eight is three deep and if we had 1000 rasters to add up the call depth would only be ten deep. Thus we need to balance our trees and we can easily process the 34K rasters for STAR's species richness map (the aforementioned GDAL issue aside).
Now, obviously we could ask the user of Yirgacheffe to make sure they create balanced expressions, but really Yirgacheffe should be doing this, as the point of Yirgacheffe is to deal with computer limitations rather than force them upon the user. Yirgacheffe needs just a little more information about which operators support re-ordering (i.e., which are associative) and it could begin to balance the tree automatically without troubling the user. But that's work I don't have time for right now, so I did a quick work around to get me to the same place: I added a sum operator that takes a list of rasters and adds them together by building a balanaced graph like the one displayed above (indeed, both graphs you see are debug output from Yirgacheffe that I use to show me what's going on!). I also added an any and all operator similar to those in Python and numpy which have similar properties to sum.
So now my new species richness implementation is just:
= []
while True:
= input_queue.()
if is None:
break
= yg.()
= != 0
rasters.()
if :
= yg.()
merged_result.()
Am I satisfied with this? No, for multiple reasons: I suspect not everyone will think to look for a sum operator rather than just add them together, I really do need both to work, and there are other operators these rules can apply to that aren't covered by just sum, any, and all. But I have added documentation for the new methods, so hopefully they are at least somewhat discoverable, and that will need to do until I get time to come back and fix this properly.
Tags: weeknotes, yirgacheffe, aoh, star