Weeknotes: OCaml and point clouds
7 May 2026
I made some small progress towards my dream of a global point cloud viewer this week. Not much progress, but a start. It was riffing off a conversation I had with Anil Madhavapeddy a couple of weeks ago, where we talked about how we could have a single index for all the geospatial data we wanted. Whilst I started with trying to load all of OSM into PostGIS, I then stalled as I realised I didn't really 1) have a motivating example, and 2) didn't understand how spacial indexes work. So with apologies to Anil, I set out of a side-quest from this, but one I hope will make it easier for me to frame that original task when I return to it.
An ever expanding viewer
The motivating example I'm going to start with is my Lidar viewer which you can play with here. This started out as just showing a tiny area of Lidar data to being something where now I can pull in many different types of aligned data to let me explore and compare datasets.
You can read more about this here, but the tl;dr is that it's showing a 7.5km x 7.5km area of Swedish forest with data from Landmäteriet, and doing a bunch of (admittedly shonky right now) juggling of the point data so that you don't load all 120 million or so cloud points, rather trying to keep it under 10 million, which seems to work okay on recent computers and phones. It is also mixing in data from other data sources: you can choose to colour the points based on the Nationella Marktäckedata land cover map. In addition to that it has some data from a couple of runs of a tree detection algorithm with different parameters. Basically, I started with a single source of data, but slowly I'm throwing in other bits of interest to make it a more powerful/useful tool.
Imagine one day we get to QGIS, but 3D as the primary UI. That's not to say 3D is "better", but for a certain class of inspection it definitely is (ideally in the end you need both). I already had someone from a hobby group in Sweden that collect unusual trees reach out to me as they found features of the Kullberg forest not visible on the DSÄ live streams via my little viewer above. The shapes of trees become visible in 3D, but would not be found in a top-down 2D projection.
I said at the time I posted the original viewer that I'd like to eventually scale this up to display the entire country, not just the 9 tiles of point cloud data I have in the current (at time of writing) example. And for that I'll need to start actually having some smarts on the server side; the current implementation is just a single HTML page with some javascript that fetches a static list of tiles. You can change the list of tiles, and it does kinda work, as shown in this example where I'm rendering the area where I live, using data from the UK Environment Agency.
However, it's just another fixed list, and you can't go beyond these little areas. And whilst in theory I could just have a larger static list, at some point that gets silly and you want an active thing on the backend where you tell it where you are and get a list of what layers are available (e.g., the Swedish land cover map is no good in Merseyside), and the local subset of that information for where you're looking, thus we end up with our motivating example for trying to build a single large index of all geospatial data ever :)
R-Trees and LAS files
One thing Anil mentioned in our chat, which I made a note to look up, was R-trees, which is a datastructure I'd heard mention of before but never dug into. It is basically a way to do spacial indexing of data by building a tree whereby data points should be stored based on locality. This is how PostGIS builds its spatial indexes. In trying to understand them I had a little play with this OCaml implementation. One thing I think it's worth noting about R-trees is there is a couple of tuning parameters to play with around how many sub items you allow in each node of the tree before you create another level of the hierarchy. Anil had said that when he tried the aforementioned loading all of OSM into PostGIS the performance wasn’t what he expected, so tuning options is something I want to play with there.
But given we’re in OCaml, and given it’s been a while since I wrote any, I took this opportunity to also dig more into how point cloud files are stored, and started implementing an OCaml LAS file format reader. There’s three related file formats at play here, but thankfully all have the same header structures. You have LAS format, which is an uncompressed file format for storing point cloud data. Within that you then have many different sub formats, depending on what data you want to store alongside just the raw x, y, z data. LAS itself has also been updated over the years, and we’re now at version 1.5, and some of the older internal formats are now deprecated. In practice I’ve found that the UK EA data was in LAS format 1.4, but the Lantmäteriet data is using a quite old version of the file format, 1.2 - which is just to say you need to be actually flexible in dealing with LAS files as there’s many versions in play out there, even for contemporary datasets.
Now, none of the data I get from national portals actually is in LAS, all are using LAZ, which is a compressed version of LAS developed by Martin Isenburg. LAZ is interesting in that it doesn’t compress the entire file, leaving the header data compatible with LAS file readers, and just compresses the data sections, with a tweak to the headers to shop LAS readers attempting to decode LAZ data in error. Despite the library to process these files being called LasZip, the compression isn’t just “apply zip to data”, but is tailored specifically for the way LAS files store point cloud records, and is designed to allow for random access within the data itself (a notable failing of popular LZW compression option often used in GeoTIFFs). I suspect when I want to decode LAZ data I’ll just do a bridge to one of the existing C libraries, but it might make a fun summer student project to write a native OCaml version. The paper does a good job of explaining how the compression works to those not deep into compression, so I do recommend reading it if you’re at all interested.
The third format I was referring to is Cloud Optimised Point Clouds (COPC, which I lean on heavily in my existing viewer. In regular LAS or LAZ files there’s no ordering constraint on the points stored, but in COPC they are organised so you can quickly get either a course high level overview or zoom in on just one subregion without having to read the entire file. COPC files are just LAS or LAZ files, but with that extra constraint applied to the data, and the way LAZ files chunk the compression to allow random access aligns with this.
All of which means that with just a little bit of work I’m able to read the metadata for all three file formats as they’re essentially the same file format :)
Pulling it together
Armed with a library to read point cloud file metadata, and an R-tree implementation, I was set to start building a geodata server in OCaml. I’ve built web stuff in OCaml before - this website for instance - but rather than repeat that, I felt this was finally time to try get into the EIO world my colleagues have been building via cohttp-eio and try to learn what that’s all about. Thus I’ve begin to code up what is currently quite a naive thing that will serve up metadata on a collection of lidar tiles, as well as indexing them in an R-tree so the front end can indicate the current area of view and get the right Lidar data back for it, letting me start to expand out from where I was originally looking, for instance here is the upstream power station from the river I keep showing:
And the dam to the south:
I talked in my last post about how the "river" we are looking at in this series of posts is actually a reservoir, and here is the proof :)
This is just a minor expansion from a 3x3 grid to a roughly 7x7 tile grid, not quite an order of magnitude increase yet, but that’s all the tiles I’ve already downloaded. I’m currently downloading a further 112GB of Sweden to let me try fit the entire high coast (Höga Kusten) region in there. I was going to throw in more UK data here (as that would probably make it more interesting to local collaborators), but I struggled to find primary Lidar data for the UK rather than products derived from the Lidar data (e.g., both raster surface and terrain elevation maps I can get for most the UK, but I can only seem to download point clouds for certain small regions of the UK). I’m probably just looking in the wrong place, so I need to resolve that. We do have point cloud data for Cambridge and London from another project, so I should try starting with that perhaps, but I was hoping to find something more scenic :)
Lots of todos
So a start, rather than a finished thing (the joy of weeknotes rather than proper blog posts). But it has been refreshing to write some OCaml for the last few days having not touched it for several months. Things to do in the near future:
- Get the 112GB of Lidar data into an R-tree and see how it performs, and just keep pushing that up until things break
- Add the OSM data in there too
- Look into tuning options in PostGIS and the Ocaml version
- Improve the rendering so the zoom is less janky
I also want to throw raster tiles in there so I can add appropriate land cover.
One challenge will be that the UK and SE datasets use their own local projection system, so I suspect I'll end up having to map them both to a common projection in the index. Another complicated use case for this will be Tessera tiles, which are all in a local UTM space. In my mind for now my megaindex will just return data in whatever format it's stored in and make dealing with that the client's problem. Down the line we can perhaps address that, but I think the key bit right now is to have something that can ingest data from different local reference systems but still answer a general query about give me all the data around where I am now.
Tags: weeknotes, ocaml, point clouds, LIDAR, geocaml, rtrees