Working with Wikibase from Go

26 Nov 2018

This post is a bit of a quick review of working with the APIs for MediaWiki and more specifically Wikibase, which is a data management extension to Mediawiki to turn a wiki into a pseudo database for humans to edit. The API hasn’t been the easiest to work with, and to try and manage some of those issues I’ve resorted to writing a little abstraction layer in Go that looks like the traditional database abstractions used in modern web development. This post is a capture of the things I struggled with: in part to save other people hitting these issues themselves, and in part incase I’ve missed something and it’s all easier than I think and someone can correct me :)

I’ve been working on a MediaWiki Foundation supported project, Science Source, to help get more verified medical literature into WikiData (which is the data repository side of Wikipedia, extended by a plugin called Wikibase). The first stage of that work has been helping ingest a selection of open access publications into a custom Wikibase enabled server, doing some text mining to highlight common drug and disease terms along the way, so that people can review the papers quickly and methodically.

Wikibase is an extension to a normal mediawiki installation, designed to allow structured knowledge to be represented. Although the name implies a portmanteau of “wiki” and “database” (at least to an engineer like me), it’s much less structured than a database: you have items on which you can record properties (cf records and fields in a database), but there’s no schema defining what you can and can’t add to an item, so you can add as many or as few properties to an item as you like. This I think matches its intended use: a place where humans, rather than computer programs, can manage the knowledge stored in wikibase. But, as with other wiki’s, there does come a point where software starts being used to manage the wikibase contents (bots in the Mediawiki parlance). There is an API that lets you manage wiki pages, and wikibase has an extension to that to let you manage knowledge items and their properties.

Unfortunately as a programmer the APIs for MediaWiki in general, and Wikibase specifically, are not the easiest to work with reliably. As someone who is interested in writing well understood and structured code, ideally in a type safe language, they seem almost designed to work against me at times. What I describe below could just be my limited experience or I’ve missed some key bits of documentation, so please do correct me if I’ve missed things, but from where I sit currently it’s been a frustrating experience using these APIs.

Here’s some of the issues I’ve hit along the way, for Mediawiki in general first:

Authentication wise you can use one of two approaches: username/password or OAuth, neither of which I’d consider ideal. The OAuth API is still at version 1, which isn’t very nice for people writing bots, being aimed at browsers. There is a way to generate a one-time set of OAuth Consumer and Access tokens per user, which is some relief, but it’s not a very good workflow from the user perspective, requiring admin access, whereas it’d be nice for some bots to be runnable by non-admin wiki editors.
If you want to edit anything rather than just read items then your bot first needs to request an edit token which it must present for each edit operation. If you’re using OAuth (as my bot does) this is an extra unnecessary overhead as client permissions are decided at the point the OAuth consumer token is created, so the server should already know what I can and can’t do. However, the fact that when you get this token back from the server it is called “CSRFtoken” is a hint that perhaps the API is just a thin veneer on top the regular web server code, as a CSRF token is what web pages use when they make requests to stop cross site scripting attacks.
The documentation asks that your bot serialise all requests, which strikes me as making the client complex to keep the server simple. If the client makes too many concurrent requests, then I’d certainly expect the server to tell me to go away, but it shouldn’t be an API documented thing that I should serialise requests, particularly when clients are there to do bulk edits and computers are quite good at doing parallel tasks.

On the Wikibase API specifically:

The documentation is not comprensive for the Wikibase API. To find an item you’re simply directed to play with the sandbox rather than given a complete list of options in one place, and for creating an item there’s no explanation of the fact the API will let you create properties on an item at the same time you create it; in the end I just had to use a combination of detective work and trial and error to get things like that last trick to work.
The reason that last thing is so important is that the API is very slow to use, so anything you can do to minimise the number of requests is definitely a plus. When you’re trying to upload 7000 items, and perhaps ten or so properties per item, then the fact you get about an API request a second starts to add up significantly (and remember, no sneaky concurrent requests!)
Outside of that initial create there’s no obvious API for bulk editing, so you are forced to use a request per property for any further updates as far as I can tell.
The type system for property values is not well structured, similar to the API structure overall. A single API call argument may accept or return many different types of JSON structure, making implementation of clients in type safe languages such as Go and Swift a bit of a trial.
There’s also seemingly arbitrary limits on data you can store - you can’t have zero length strings, you have to make a differently structured API request to make that as “no value”. Strings can’t have whitespace at either end of the string either, so if that’s important to what you’re trying to store in the item then you’re out of luck.
Properties that have the type of “TimeData” require you send dates in a format that includes hours/minutes/days (RFC3339), but your request will fail if any of those fields aren’t zero, as internally Wikibase just stores dates, not time (despite the name!).

I could go on, but you get the idea. Again, these limits to me imply a combination of human editing being a primary feature of Wikibase, and no real thought being given to how bots will struggle with it, coupled most likely with a volunteer labour force doing bits and pieces as time allows without anyone in control to enforce some consistency across the API.

But there’s a more troubling consequence as a programmer for some of these limits: although wikibase looks like a database, there’s no concept of transactions here, so if I need to create a group of items together and I fail half way through (say a network error), well tough luck. A real database will let you queue up a bunch of changes and guarantee they either all succeed or all fail, so your database never gets into a partially updated state, but no such luck on Wikidata. It could be that this isn’t the original intention of Wikidata, but given how people seem to be using it now where data items point to one another (aka relations in database parlance), it does seem quite a limitation. It’s not just failure that’s an issue here, it’s also concurrent updates from multiple agents that will potentially fight each other and leave the data in the Wiki in an inconsistent state. Again, you can see this system was aimed at humans editing a knowledge base rather than a bit of software, but it’s rare I hear talk of things like Wikipedia without bots being involved; at it’s core Wikipedia relies on an army of bots to police edits and take down malicious edits etc., so they should be designing for bots more consistently (though it’s hard to get too angry when Wikipedia is not a well funded enterprise, but from a technical point of view it’s far from optimal how it works today).

To try help me abstract all this when working with Wikibase, I’ve written the beginnings of a pseudo-ORM to let me create Wikibase entries in Go (my language of choice for things like this), which is now up on GitHub. An ORM, or Object-relational mapping, is a common tool for the web world to let you map objects in your program to entries in a database. My pseudo-ORM is there to let me work with traditional Go structures and then pass these structures to my Wikibase library and have it create all the correct items and properties in Wikibase. Because I’m working with Go structures which you can readily serialise to JSON, I’ve designed it so that if you JSON store any of these objects it stores all the Wikibase IDs of items and properties so that you don’t have to try rebuild your knowledge of where you put things in Wikibase each time you talk to it.

You can see this in the main ScienceSource ingest tool code base, but as a simple example it lets you define a data model in Go structs, store data in this structs, and then have them sync to Wikidata as items with the appropriate properties on them:

type ExampleWikibaseItem struct {
    wikibase.ItemHeader

    Name             string                 `property:"Name"`
    Birthday         time.Time              `property:"Date of birth"`
    NextOfKin        *wikibase.ItemProprety `property:"Next of kin,omitoncreate"`
    SkateboardsOwned int                    `property:"Skateboards owner"`
}

Having defined the above you can then create and update the wikibase items in a more natural ORM like manner this way by calling create and update methods. Some of the complexity of whether you update things at once bleeds through to the developer still though - the “omitoncreate” option in the above is a hack really to let the developer optimise for the number of API calls that’ll be made as they’re so expensive, but it’s at least a start in trying to rein in some of the complexity.

Properties on wikibase are normally identified with a unique ID like “P12” which would mean “Name” property. However, these numbers are automatically assigned by the software and you can’t control how they’re allocated, which probably seemed very reasonable to the Wikibase creators, but if you then try to set up a production, staging, and test server you have no real way of guaranteeing that the property IDs are in sync between each instance. Thus my library identifies properties by label rather than ID, and requires you go through a mapping stage before you use it. This is why having a schema is nice in databases…

At the moment the library is heavily biased to just creating and uprating records, as that’s all I’ve needed to do so far and managing that has taken up most of my time, but if you’re reading this, you like using Go as a language, and need to work with wikibase please do consider trying this and if you need extra features to make pull requests. I’ll continue to update it as Science Source grows (I do think I’ll need more read access before this project is done), but anything that helps make a reliable abstraction over Wikibase that hides some of the oddities of using it are welcome.

Tech notes by Michael Winston Dales

Working with Wikibase from Go