Planet RDF

It's triples all the way down

February 07

Libby Miller: Updated real_libby GPT-2 chatbot

real_libby, my GPT-2 retrained, slack chatbot hosted on a raspberry pi 4 eventually corrupted the SD card. Then I couldn’t find her brain (I could only find business-slack real_libby who is completely different). And so since I was rebuilding her anyway, I thought I’d get her up to date with Covid and the rest of it.

For the fine-tuning data, since I made the first version in 2019 I’ve more or less stopped using irc ( :’-( ) and instead use Signal. I still use iMessage, and use Whatsapp more. I couldn’t figure out for a while how to get hold of my Signal data so first built an iMessage / Whatsapp version, as that’s pretty easy with my setup (details below, basically sqlite3 databases from an unencrypted backup of my iPhone). I had about 30K lines to retrain with, which I did using this as before, on my M1 macbook pro.

The text/whatsapp version uses exclamation marks too much and goes on about trains excessively. Not super interesting.

It is in fact possible to get Signal messages out as long as you use the desktop client, which I do (although it doesn’t transfer messages between clients, only ones received while that device was authorised). But I still had 5K lines to play with.

I think Signal-libby is more interesting, though she also seems closer to the source crawl, so I’m more nervous about letting her loose. But she’s not said anything bad so far.

Details below for the curious. It’s much like my previous attempt but there were a few fiddly bits.

The Signal version is a bit more apt, I think, and says longer and more complex things.

She makes up urls quite a bit; Signal’s where I share links most often.

Maybe I’ll try a combo version next, see if there’s any improvement.

Getting data

A note on getting data from your phone – unencrypted backups are bad. Most of your data is just there lying about, a bit obfuscated, but trivially easy to get at. The commands below just get out your own data. Baskup helps you get out more.

iMessages are in

/Users/[me]/Library/Application\ Support/MobileSync/Backup/[long number]/[short number]/3d0d7e5fb2ce288813306e4d4636395e047a3d28

sqlite3 3d0d7e5fb2ce288813306e4d4636395e047a3d28
.once libby-imessage.txt
select text from message where is_from_me = 1 and text not like 'Liked%';

Whatsapp are in

/Users/[me]/Library/Application\ Support/MobileSync/Backup/[long number]/[short number]/7c7fba66680ef796b916b067077cc246adacf01d

sqlite3 7c7fba66680ef796b916b067077cc246adacf01d
.once libby-whatsapp.txt

Signals desktop backups are encrypted so you need to use this, which I could only get to work using docker. Signal doesn’t back up from your phone.

Tweaks for finetuning on a M1 mac

git clone
git checkout finetuning
cd gpt-2
mkdir data
mv libby*.txt data/
pip3 install -r requirements.txt
python3 ./ 117M
pip3 install tensorflow-macos # for the M1
PYTHONPATH=src ./ --model_name=117M --dataset data/

tensorflow-macos is tf 2, but that seems ok, even though I only run tf 1.3 on the pi.

rename the model and get the bits you need from the initial model

cp -r checkpoint/run1 models/libby
cp models/117M/{encoder.json,hparams.json,vocab.bpe} models/libby/

Pi 4

The only new part on the Pi 4 was that I had to install a specific version of numpy – the rest is the same as my original instructions here.

pip3 install flask numpy==1.20
curl -O
pip3 install tensorflow-1.13.1-cp37-none-linux_armv7l.whl

Posted at 16:10

Libby Miller: Interactive Flowerbots and legobots with a bit of machine vision

I made a bunch of robots recently building on some of the ideas from libbybot and as an excuse to play with some esp32 cams that Tom introduced me to in the midst of his Gobin Frenzy.

The esp32s cams are kind of amazing and very very cheap. They have a few quirks, but once you get them going it’s easy to do a bit of a machine vision with them. In my case, new to this kind of thing, I used previous work by Richard to find the weight of change between two consecutive images. Dirk has since made better versions using a PID library. I made some felt-faced ones using meccano and then some lego ones using some nice lego-compatible servos from Pimoroni. Because esp32s can run their own webserver you can use websockets or mjpeg to debug them and see what they are doing.

The code’s here, (here’s Dirk’s) and below are a few pictures. There’s a couple of videos in the github repo.

Posted at 16:10

Libby Miller: Time squish

I keep seeing these two odd time effects in my life and wondering if they are connected.

The first is that my work-life has become either extremely intense – and I don’t mean long hours, I mean intense brainwork for maybe a week – that wipes me out – and then the next is inevitably slower and less intense. Basically everything gets bunched up together. I feel like this has something to do with everyone working from home, but I’m not really sure how to explain it (though it reminds me of my time at Joost where we’d have an intense series of meetings with everyone together every few months, because we were distributed. But this type is not organised, it just happens). My partner pointed out that this might simply be poor planning on my part (thanks! I’m quite good at planning actually).

The second is something we’ve noticed at the Cube – people are not committing to doing stuff (coming to an event, volunteering etc) until very close to the event. Something like 20-30% of our tickets for gigs are being sold the day before or on the day. I don’t think it’s people waiting for something better. I wonder if it’s Covid-related uncertainty? (also 10-15% don’t turn up, not sure if that’s relevant).

Anyone else seeing this type of thing?

Posted at 16:10

Libby Miller: Sparkfun Edge, MacOS X, FTDI

More for my reference than anything else. I’ve been trying to get the toolchain set up to use a Sparkfun Edge. I had the Edge, the Beefy3 FTDI breakout, and a working USB cable.

Blurry pic of cats taken using Sparkfun Edge and HIMAX camera

This worked great for the speech example, for me (although the actual tensorflow part never understands my “yes” “no” etc, but anyway, I was able to successfully upload it)

$ git clone --depth 1
$ cd tensorflow
$ gmake -f tensorflow/lite/micro/tools/make/Makefile TARGET=sparkfun_edge micro_speech_bin
$ cp tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/ tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/
$ python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/ --bin tensorflow/lite/micro/tools/make/gen/sparkfun_edge_cortex-m4_micro/bin/micro_speech.bin --load-address 0xC000 --magic-num 0xCB -o main_nonsecure_ota --version 0x0
$ python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/ --load-address 0x20000 --bin main_nonsecure_ota.bin -i 6 -o main_nonsecure_wire --options 0x1
$ export BAUD_RATE=921600
$ export DEVICENAME=/dev/cu.usbserial-DN06A1HD
$ python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/ -b ${BAUD_RATE} ${DEVICENAME} -r 1 -f main_nonsecure_wire.bin -i 6

But then I couldn’t figure out how to generalise it to use other examples – I wanted to use the camera because ages ago I bought a load of tiny cameras to use with the Edge.

So I tried this guide, but couldn’t figure out where it the installer had put the compiler. Seems basic but….??

So in the end I used the first instructions to download the tools, and then the second to actually do the compilation and installation on the board.

$ find . | grep lis2dh12_accelerometer_uart
# you might need this - 
# mv tools/apollo3_scripts/ tools/apollo3_scripts/ 
$ cd ./tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/boards_sfe/edge/examples/lis2dh12_accelerometer_uart/gcc/
$ export PATH="/Users/libbym/personal/mayke2021/tensorflow/tensorflow/lite/micro/tools/make/downloads/gcc_embedded/bin/:$PATH"
$ make clean
$ make COM_PORT=/dev/cu.usbserial-DN06A1HD bootload_asb ASB_UPLOAD_BAUD=921600

etc. Your COM port will be different, find it using

ls /dev/cu*

If like me the FTDI serial port KEEPS VANISHING ARGH – this may help (I’d installed 3rd party FTDI drivers ages ago and they were conflicting with the Apple’s ones. Maybe. Or the reboot fixed it. No idea).

Then you have to use a serial programme to get the image. I used the arduino serial since it was there and then copy and pasted the output into a textfile, at which point you can use


to convert it to a png. Palavers.

Posted at 16:10

Libby Miller: Sock-puppet – an improved, simpler presence robot

Makevember and lockdown have encouraged me to make an improved version of libbybot, which is a physical version of a person for remote participation. I’m trying to think of a better name – she’s not all about representing me, obviously, but anyone who can’t be somewhere but wants to participate. [update Jan 15: she’s now called “sock_puppet”].

This one is much, much simpler to make, thanks to the addition of a pan-tilt hat and a simpler body. It’s also more expressive thanks to these lovely little 5*5 led matrixes.

Her main feature is that – using a laptop or phone – you can see, hear and speak to people in a different physical place to you. I used to use a version of this at work to be in meetings when I was the only remote participant. That’s not much use now of course. But perhaps in the future it might make sense for some people to be remote and some present.

New recent features:

  • easy to make*
  • wears clothes**
  • googly eyes
  • expressive mouth (moves when the remote participant is speaking, can be happy, sad, etc, whatever can be expressed in 25 pixels)
  • can be “told” wifi details using QR codes
  • can move her head a bit (up / down / left / right)

* ish
**a sock

I’m still writing docs, but the repo is here.

Libbybot-lite – portrait by Damian

Posted at 16:10

Libby Miller: MyNatureWatch with a High Quality Raspberry Pi camera

I’ve been using MyNatureWatch setup on my bird table for ages now, and I really love it (you should try it). The standard setup is with a pi zero (though it works fine with other versions of the Pi too). I’ve used the recommended, very cheap, pi zero camera with it, and also the usual pi camera (you can fit it to a zero using a special cable). I got myself one of the newish high quality Pi cameras (you need a lens too, I got this one) to see if I could get some better pics.

I could!

Pigeon portrait using the Pi HQ camera with wide angle lens

I was asked on twitter how easy it is to set up with the HQ camera, so here are some quick notes on how I did it. Short answer – if you use the a recent version of the MyNatureWatch downloadable image it works just fine with no changes. If you are on the older version, you need to upgrade it, which is a bit fiddly because of the way it works (it creates its own wifi access point that you can connect to, so it’s never usually online). It’s perfectly doable with some fiddling, but you need to share your laptop’s network and use ssh.

Blackbird feeding its young, somewhat out of focus

Update (May 2022) – I’d just suggest using a newish release of MyNatureWatch, which works perfectly.

MyNatureWatch Beta – this is much the easiest option. The beta is downloadable here (more details) and was some cool new features such as video. Just install as usual and connect the HQ camera using the zero cable (you’ll have to buy this separately, the HQ camera comes with an ordinary cable). It is a beta and I had a networking problem with it the first time I installed it (the second time it was fine). You could always put it on a new SD card if you don’t want to blat a working installation. Pimoroni have 32GB cards for £9.

The only fiddly bit after that is adjusting the focus. If you are not used to it, the high quality camera CCTV lens is a bit confusing, but it’s possible to lock all the rings so that you can set the focus while it’s in a less awkward position if you like. Here are the instructions for that (pages 9 and 10).

MyNatureWatch older version – to make this work with the HQ camera you’ll need to be comfortable with sharing your computer’s network over USB, and with using ssh. Download the img here, and install on an SD card as usual. Then, connect the camera to the zero using the zero cable (we’ll need it connected to check things are working).

Next, share your network with the Pi. On a mac it’s like this:

Sharing network using system preferences on a Mac

You might not have the RNDIS/Ethernet gadget option there on yours – I just ticked all of them the first time and *handwave* it worked after a couple of tries.

Now connect your zero to your laptop using the zero’s USB port (not its power port) – we’re going to be using the zero as a gadget (which the MyNatureWatch people have already kindly set up for you).

Once it’s powered up as usual, use ssh to login to the pi, like this:

ssh pi@camera.local
password: badgersandfoxes

On a mac, you can always ssh in but can’t necessarily reach the internet from the device. Test that the internet works like this:


This sort of thing means it’s working:

PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=116 time=19.5 ms
64 bytes from ( icmp_seq=2 ttl=116 time=19.6 ms

If it just hangs, try unplugging the zero and trying again. I’ve no idea why it works sometimes and not others.

Once you have it working, stop mynaturewatch using the camera temporarily:

sudo systemctl stop nwcameraserver.service

and try taking a picture:

raspistill -o tmp.jpg

you should get this error:

mmal: Cannot read camera info, keeping the defaults for OV5647
mmal: mmal_vc_component_create: failed to create component '' (1:ENOMEM)
mmal: mmal_component_create_core: could not create component '' (1)
mmal: Failed to create camera component
mmal: main: Failed to create camera component
mmal: Camera is not detected. Please check carefully the camera module is installed correctly

Ok so now upgrade:

sudo apt-get update
sudo apt-get upgrade

you will get a warning about hostapd – click q when you see this. The whole upgrade took about 20 minutes for me.

When it’s done, reboot

sudo reboot

ssh in again, and test again if you want

sudo systemctl stop nwcameraserver.service
raspistill -o tmp.jpg

reenable hostapd:

sudo systemctl unmask hostapd.service
sudo systemctl enable hostapd.service

reboot again, and then you should be able to use it as usual (i.e. connect to its own wifi access point etc).

The only fiddly bit after that is adjusting the focus. I used a gnome for that, but still sometimes get it wrong. If you are not used to it, the high quality camera CCTV lens is a bit confusing – it’s possible to lock all the rings so that you can set the focus while it’s in a less awkward position if you like. Here are the instructions for that (pages 9 and 10).

A gnome

Here’s a few more pictures from the camera.

Posted at 16:10

Libby Miller: Zoom on a Pi 4 (4GB)

It works using chromium not the Zoom app (which only runs on x86, not ARM). I tested it with a two-person, two-video stream call. You need a screen (I happened to have a spare 7″ touchscreen). You also need a keyboard for the initial setup, and a mouse if you don’t have a touchscreen.

The really nice thing is that Video4Linux (bcm2835-v4l2) support has improved so it works with both v1 and v2 raspi cameras, and no need for options bcm2835-v4l2 gst_v4l2src_is_broken=1 🎉🎉



  • Install Raspian Buster
  • Connect the screen keyboard, mouse, camera and speaker/mic. I used a Sennheiser usb speaker / mic, and a standard 2.1 Raspberry pi camera.
  • Boot up. I had to add lcd_rotate=2 in /boot/config.txt for my screen to rotate it 180 degrees.
  • Don’t forget to enable the camera in raspi-config
  • Enable bcm2835-v4l2 – add it to sudo nano /etc/modules
  • I increased swapsize using sudo nano /etc/dphys-swapfile -> CONF_SWAPSIZE=2000 -> sudo /etc/init.d/dphys-swapfile restart
  • I increased GPU memory using sudo nano /boot/config.txt -> gpu_mem=512

You’ll need to set up Zoom and pass capchas using the keyboard and mouse. Once you have logged into Zoom you can often ssh in and start it remotely like this:

export DISPLAY=:0.0
/usr/bin/chromium-browser --kiosk --disable-infobars --disable-session-crashed-bubble --no-first-run

Note the url format – this is what you get when you click “join from my browser”. If you use the standard Zoom url you’ll need to click this url yourself, ignoring the Open xdg-open prompts.


You’ll still need to select the audio and start the video, including allowing it in the browser. You might need to select the correct audio and video, but I didn’t need to.

I experimented a bit with an ancient logitech webcam-speaker-mic and the speaker-mic part worked and video started but stalled – which made me think that a better / more recent webcam might just work.

Posted at 16:10

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Posted at 16:10

Leigh Dodds: Reflecting on 2022

I’ve decided to keep doing the annual end of the year reflections, iterating further on the structure I used in 2021 and 2020.

What follows is a mix of personal reflections on the year, as well as a brief lists of the things I’ve enjoyed watching, reading and playing.


This year I’ve split my time between working as CTO at Energy Sparks, and continuing to do some freelancing.

Energy Sparks

I’m really enjoying this role. It’s great to be doing technical and product work again. And it’s been an interesting year for both myself and the charity.

When I started in the role in mid-2020 I focused on ensuring I understood the application from an architectural and operational point of view. While we use freelance developers, including the very capable Julian Higman, I was doing not just the majority of the development work, but also had sole responsibility for keeping everything up and running.

The true test came when I had to take us through a full platform upgrade which completed in January 2022. It all went relatively smoothly, so I’m pleased that I prioritised investing time in the right areas.

This year, I’ve been starting to develop a deeper understanding of the energy analysis side of Energy Sparks. To date, most of this has been the responsibility of a single analyst/developer who has now largely left the organisation. It’s a big hole to fill even though we’re using him on a freelance basis to help with knowledge transfer.

It’s been really interesting learning more about the UK’s energy data infrastructure as well as energy data analysis in general. But I’ve still got a long way to go here.

This year we managed to win some significant funding, including:

A video about Energy Sparks in which I’m interviewed

We’re ending 2022 with over 641 schools on Energy Sparks which is more than double where we were last year. This is a great achievement and, as you might imagine, ensuring that the service scales as we grow has been at the forefront of my mind. I planned out a series of improvements early in the year which we’ve now delivered.

To extend the service to this many schools, we needed to grow the team. We’re now a team of eight, with a number of freelancers and partners supporting our work. I spent a lot of time in 2022 interviewing for new roles. And I now have a small team of developers (Deb Bassett, IanT).

It’s nice to be back leading a team again. But, even with a small team this means doing less hands-on work and focusing more on enabling others. I’m enjoying it.

Next year will be a big year for us as we’ve got several projects to land and we need to show that we’re having a real impact. If so, then we should be able to get some additional funding. Fingers crossed!


I wrote a summary of my recent freelance work back in October, so I’ll link to that as a summary of my recent projects.

Since then I’ve started a second project with CABI, supporting a team that are providing advice about publishing FAIR data to a number of Gates Foundation projects.

As I wrote in October, I’m enjoying being able to use the experience I’ve developed around open data and data infrastructure, to support and mentor teams who are building data infrastructure, ecosystems and infrastructure. Hopefully I can do more of that next year.

I’ve got availability for freelance work starting in February, if you need some help?

Work feedback

I’ve had two occasions this year when I’ve had unsolicited feedback on my management style. The first from a former colleague at the Open Data Institute, the second from a recent joiner to the Energy Sparks team.

The feedback was really positive, totally out of the blue, and really caught me off guard.

If I’m honest, things didn’t end particularly well at the Open Data Institute. I left feeling pretty deflated. It wasn’t clear that I fitted into the organisation and I ended my time there — which spanned most of its first 10 years — doubting my abilities as a senior leader, questioning whether I’d had any impact (on the organisation or elsewhere) and with a massive boost to my imposter syndrome.

This feedback has really helped me work through a lot of that. And I’m feeling more confident that I’m doing the right things in my current role.

I won’t share any more than that, but I’m going to try and pay things forward and do the same for other people.


My github contribution graph for 2022

Unsurprisingly, I’ve written a lot of code this year.

At the start of the year I wanted to try and do some more creative coding this year. I don’t think I really achieved that, but there’s a few things I’m pleased with:

The last one of those hit the Hacker News front page for a couple of days, so got a lot of traffic. That was fun.

I’ve got a couple of other projects I’ve been tinkering with this year. One of these was a Twitter bot, but that’s going to move to Mastodon now.


A view across the fields in Newton St. Loe

I didn’t manage to start running again in 2022. I’m really not sure why I stopped because I was really enjoying it.

Actually I do: it got cold and wet, and I’d mostly hit my weight loss target. So I just let it taper off. Let’s try again next year.

What I did do in 2022 was a lot of walking. Four of us fell into a regular schedule of weekend walks through the countryside which I really, really loved. I always forget how much being outdoors lifts my mood.

We’ll be doing a lot more of that in 2023.


As before I’ve been tweeting what I’ve been cooking in 2022. Although I largely stopped doing this in November after I shifted over to Mastodon.

I still bookmark recipes here.

Of the recipes and cocktails I logged, they break down as follows:

This year was mostly Sichuan dishes as I had a new recipe book. I also continued my love affair with the dirty martini.

I usually try out a new recipe on a Saturday night, along with a cocktail or two. But I’ve ended up cooking less this year as we shifted up our Saturday night routine. We now frequently nip over to visit a friend in Bradford-on-Avon on a Saturday night, so have ended up taking turns with the cooking.


I’ve already published a blog post with my gardening retro for 2022. Beans for the win.

These aren’t beans. They’re potatoes


At the beginning of 2022 I jumped from Spotify to Tidal. This was for several reasons.

Spotify ditched the subscription tier that I’d been on pretty much since they launched. This was heavily discounted, so my monthly cost was going to go up. I was also increasingly frustrated with them constantly pushing podcasts. I just don’t get on with podcasts. The whole Alex Jones/Infowars thing was the icing on the cake.

Once I realised that Tidal was a serious alternative — same price, better cut for artists, same coverage and no podcasts — it was a no-brainer to switch over. I’ve not had any issues with the service at all.

It’s a shame that they don’t do an end of year wrap-up though. They do give you a playlist of your most listened tracks, though. So here’s my Most Listened 2022. There’s a lot of Wet Leg, Moderat, µ-Ziq, Rival Consoles and Cymande on there.

I’ve also kept up my habit of creating a playlist of “tracks that I loved on first listen, which were released this year“. Here’s my 2022 Tracked playlist.

It contains 176 tracks totalling 13 hours, 56 minutes and 44 seconds of music. The tracks are in order of when I heard them.


I’m continuing to publicly bookmark the articles and papers I’ve read. And I’m now using StoryGraph to log my reading. You can follow me there if you’re interested.

I’ve been through and imported the last few years of reading data I’d collected in a custom spreadsheet. As a service it’s got some limitations and rough edges, but finding it useful so far. StoryGraph tells me that, as of this morning, I’ve read 106 books, which totals 23,238 pages.

I’ve kept up the habit of having one comic book, one novel and one non-fiction book on the go. I’ve read a lot more comics than anything else.

That big dip in reading from March to May was due to Elden Ring.


My favourites this year were:


My favourites this year have been:

Affinities is just lovely. I’m a big fan of the public domain review, so it was nice to have a print copy of so many gorgeous images. It lead me down a lot of interesting rabbit holes, one of which ended up with me reading Cartographies of Time, which is also fascinating.


I’m still working my way through a lot of big Humble Bundle collections I picked up in the last few years, but am increasingly buying new stuff via the Comixology app (which is terrible).

My favourites this year have been:

Notable mentions go to the G. Willow Wilson Ms Marvel books, the Dan Slott She-Hulk collection, and the first two volumes of the latest Swamp Thing reboot by Ram V.

I read the entire run of East of West, which was…OK? Gave up on Casanova as an incoherent mess.


Managed to finally get myself a PS5 this year, so most of my gaming time has been spent on that rather than the PC. I’ve also now got a Playdate which is gorgeous little device. I want to build something for it.

I’m ldodds on both Steam and PSN if you want to add me there.


Haven’t played many board games this year. Although I did pick up a copy of Quacks of Quedlinburg which has become an instant family favourite.

Most of my table-top (“zoom top”?) gaming this year has been:

  • A campaign of Masks which I’ve been running for the last 2 years and which is about to come to a close. I’ve had a lot of fun running it, but think I’m still feeling my way with the system a bit
  • Playing two stories of Good Society. The first using the core ruleset, the second with the “Downstairs at the Abbey” ruleset from the expansion which took on a Lovecraftian tone
  • Brindlewood Bay, which is simply brilliant

I’m really enjoying playing TTRPGs again. The new rulesets are so story focused and so accessible to newcomers, that I’m not sure I’d ever want to play something like D&D again.

I’ve also restarted the adjacent hobby of collecting RPG rulebooks. So I’ve got a growing mass of PDFs and Hardbacks, most of which I probably won’t end up playing, but who cares?!


Favourites this year:

  • Elden Ring, continuing my love affair with FromSoftware
  • Pentiment, which was beautiful and engrossing
  • Deathloop. I loved the Dishonored games and this is just more of that. But with guns and 60s-70s stylings

Only three as I didn’t play many games. I sank a lot of hours into Elden Ring.

I really enjoyed Returnal, but the grind became too frustrating and I put it down. I’ve also been playing a bit of Darkest Dungeon again whilst waiting for the full release of Darkest Dungeon 2.


I started using Letterboxd this year to record the films I’ve been watching. And by “started using” I mean:

  • working through my entire twitter archive to log dates I watched films over the past few years
  • mined my email archives for cinema tickets, to do the same over a longer period
  • worked through long lists of films, actors and directors to at least log I’ve watched a film sometime in the last 30-40 years, even if I don’t have a date

It’s not comprehensive, obviously but I’ve now logged 1693 films.

This year I watched 84 films, 15 of which I’d watched before.

A bar chart of films I watched in 2022

I watched a lot of films in February. I had the half-term week as a holiday and everyone in the house came down with Covid. So I just watched films whilst the rest of the family were in solitarty confinement.

Not all of my film viewing has been sofa based. I managed to get out to the cinema a few times this year (Nope, Bullet Train, Everything Everywhere All At Once). And I also went to the Forbidden Worlds Film Festivals which are my new favourite events.


My favourite TV series this year were:

  • Andor
  • The Peripheral
  • Tokyo Vice
  • Sandman
  • She Hulk / Paper Girls

No real surprises in that list. Although I don’t think I’ve seen many people talking about the Paper Girls adaptation. I thought it was brilliant, so it’s tied with She Hulk for me, which was cleaver, funny but uneven in places.

I should mention that Masterchef remains one of my favourite programmes ever. I don’t really watch reality shows but I’m always clued to both the amateur and professional series. But we did also watch Bake-Off as a family this year. It’s an excuse to bake something for every episode.

I also somehow ended up watching Mortimer & Whitehouse: Gone Fishing this year, despite not being interested in finishing. But at 50 and a half, I guess I’m in the demographic now? It’s a bit too melancholy at times though!

Years after everyone else I also watched all of Succession. 😲 is all I can say.


Unlikely to be many surprises here either, but:

  • Nope
  • Everything Everywhere All At Once
  • Uncut Gems
  • The French Dispatch
  • Boiling Point

Special mention for Unbearable Weight of Massive Talent which was hilarious. And Barbarian which was bonkers.


Favourite channels:

  • Zemalf – thoughtful playthroughs of strategy games. The only channel I’ve ever subscribed to on Twitch and I often tune into the live streams. There’s a friendly community of old gamers there and in the Discord
  • Decino. A doomtuber.
  • StezStixFix. Apparently I like watching someone repair things?
  • 야미보이 Yummyboy. And I also like watching people make street food?


I wrote 33 blog posts in 2022, totalling 23,370 words.

The three new articles that got the most views were:

The three articles across the entire history of my blog that got the most views were:

These are all really low view counts in the scheme of things. But I’m not writing for the views.

I’ve been noodling on a couple of writing projects this year which I’m hoping to make some proper headway on next year.

Everything Else

What about everything else?

There’s still a lot going on at home that I don’t want to write about in detail here.

Watching and supporting my kids as they try to become the people they want to be remains the most challenging and rewarding things I’ve ever done. I just didn’t realise it would still be this hard. I have to remind myself that parenting is a rollercoaster. There’s no downhill: just surprising twists and turns.

I jumped to Mastodon in November, as part of the big Twitter migration. Although to be honest, I had been feeling really disconnected and frustrated with Twitter for some time. Mastodon won’t solve all of that, but it’s less angry and political which is partly what I needed.

My twitter posts are now limited to auto-posts from this blog.

What I’ve been wrestling with this year might be summed up as this: “who are my community, and how do I connect with them?”.

I wasn’t feeling a sense of community on Twitter. Mastodon might offer something different, but I don’t think it will. Its still social media, I think I need to find other ways of connecting, both online and in-person.

I’m using DuoLingo to learn Welsh. Two hundred day streak at the moment.

I’ve still not had Covid. That’s good.

Posted at 16:05

Leigh Dodds: Round up of some current energy sector data infrastructure projects

Now that I work in the energy sector I’m trying to pay closer attention to how the data infrastructure in that area is evolving.

Here’s a round up of some current and recent projects that I’ve been keeping an eye on. Along with some thoughts on their scope, overlaps and potential outcomes.

Ofgem review of data best practices

In 2021 Ofgem published a set of principles, called the “Data Best Practice” guidance. They are a set of eleven principles intended to guide organisations in the energy sector towards publishing more open data.

It encourages a “presumed open” approach and recommends the types of general best practice that you can see in other principles, e.g. FAIR.

The principles are binding for a small number of organisations (those operating the UK’s energy networks) but are otherwise voluntary.

Ofgem recently asked for feedback on the principles, with comments closing at the end of October. The feedback is phrased as a kind of retrospective: what’s going well, what could be improved, what would encourage organisations to adopt them, etc. There’s one specific question about whether providing more concrete guidance on data formats would be helpful.

It will be interesting to see what kind of feedback Ofgem have received. My hope is that it will prompt:

  • An expansion of which organisations should be following the principles. I see no reason why they shouldn’t apply to every licensed organisation in the sector
  • Specific requirements for organisations to publish specific datasets, in well-defined formats, rather than encouraging a “presumed open” approach and then leaving them to work through the details on what can be published and how
  • A proactive regulatory approach to assessing whether organisations are actually applying the principles as expected, e.g. what data should be now expect to see under shared and open licences?

I’ve done quite a bit of work in the last few years supporting organisations in adopting the FAIR data principles. What I’ve learned is that while principles can provide a good basis for building a shared vision, they always need to be supported by specific actionable guidance.

You cannot assume that everyone knows how to put those principles into practice.

Without detailed, sector and dataset specific guidance — use these formats, this metadata and this data should be open, while this data should be shared — people are just left with working through all of the details for themselves. This creates friction. And that friction results in not being published well, and plenty of room for excuses and uncertainty that leaves data not being published at all.

Digital spine feasibility study

In January 2022 the Energy Digitalisation Taskforce published a report that provided a set of recommendations aimed at creating a digitalised Net Zero energy system.

The report includes a number of recommendations aimed at improving the data infrastructure in the sector. Including creating a “data sharing fabric”, an energy asset register, a data catalogue, improving data standards and creating a “digital spine”.

Unfortunately the report doesn’t clearly define what it means by a “fabric” or a “spine”, just that the former is intended to support data sharing and the latter to improve interoperability. In practice there’s a lot of different ways this technical infrastructure might be delivered.

The spine appears like it’s intended to be middleware that sits between individual organisations and the broader energy network making it easier to expose data in standard formats.

This looks to be different to, for example, the NHS spine which fulfils a similar role but includes some centrally coordinated services to ensure connectivity and interoperability across the sector. I’ve not been able to examples of “spines” in other sectors.

BEIS have commissioned a study to determine “the needs case, benefits, scope and costs of an energy system ‘digital spine’”. That procurement completed at the end of November, but I don’t believe the winners have been announced yet.

Unfortunately, from the outside, it feels a bit like a broad solution has been proposed (open-source middleware) which the procurement is then focused on scopin. Rather than starting from a broader user research question of “what is required to improve interoperability in the energy sector?” then identifying the most useful interventions.

For example, I’ve previously written about some low-hanging fruit that would increase interoperability of half-hourly meter data.

An open source middleware layer might be a useful intervention, but there’s risks that other necessary work is overlooked. For example, if an open source middleware is going to convert data into standardised forms, then what do those look like? Do they need to be developed first?

Does the UK energy sector even have a good track record of creating and adopting open source infrastructure? Or is there ground work required to build that acceptance and capacity first?

Maybe all this was covered in the research behind the Taskforce report that recommended the spine, but its not clear at this point. It will be interesting to see the results of the study.

Smart Meter Energy Data Repository Programme

The Smart Meter Energy Data Repository (SEDR) programme is intended to “determine the technical and commercial feasibility of a smart meter energy data repository, quantify the benefits and costs of such a smart meter energy data repository, and simulate how it could work”.

The procurement for this piece of work closed in July, but again I don’t think the winner has been announced. I’ve been interviewed by them as part of their user research, so I know the project is up and running!

Update: 3rd January 2023 the three funded projects have been announced.

The current smart meter network has at most 2 years of data distributed across every smart meter in the UK. Access to data requires querying individual meters by sending them messages.

Would it be useful to have a single repository of data, making it easier to query and work with? What types of applications would benefit from that data infrastructure? What are the privacy implications? What would the technical infrastructure look like? These are all questions that will be considered in this project.

I’ve written before about why the UK energy sector should learn from open banking and just develop standardised APIs to provide a high-level interface to the smart meter network.

I’ve also written about the problems of trying to access half-hourly data for the non-domestic market.

I won’t repeat that here, but will note a few things that I raised in the user research:

  • People are very nervous about the use of smart meter data. For example, see the recent row over the government plans to harvest this data for the Energy Price Guarantee scheme. The current data infrastructure was designed not to have a single large dataset. There are some big trust, privacy and security issues to navigate here
  • The focus should be on improving access to half-hourly data, not just smart meter data. The smart meter roll-out is not finished and smart meters are not mandated (and in some cases not suitable) for non-domestic customers. There’s nothing specific about the data from smart meters that makes it different to process, store or analyse than other sources of half-hourly data.
  • Non-domestic use cases need to be considered too
  • Would running this repository be part of the DCC’s remit, or do we need another data institution with responsibility for managing that infrastructure?
  • Ofgem previously consulted around the benefits of providing a repository of half-hourly data, citing lack of innovation and lack of support for data sharing as a rationale. This highlighted some potential benefits, some resistance in the market but creating it was ultimately kicked down the line. There are things to learn from that project.
  • Providing access to a repository of data could be useful to support prototyping and development of new services. That type of repository might hold synthetic or anonymised data, rather than live, recent data. But there needs to be a pathway to allow developers to move from this type of infrastructure to integrating with the live system. At the moment there aren’t many “DCC Other Providers” offering higher level APIs and these aren’t standardised
  • If the repository did provide access to live data, there’s still a decision to be made about how much is stored. E.g. is data only stored for a limited time, or will it be an historical archive too? These enable different use cases and have different scalability, sustainability and security issues
  • When it comes to analysing smart meter data, you need more than just the half-hourly readings. You need a range of other sources: (historic) tariff information, descriptions of alternative tariffs (to support switching use cases), links between a customer/property and the meters, geospatial data, weather data, etc. Not all of this exists in shared or open formats in the existing infrastructure. How much of this would be part of a planned repository?

This is potentially a critical piece of infrastructure, so it needs some careful planning and execution.

Smart Meter System based Internet of Things applications programme

Another current BEIS programme is the Smart Meter System based Internet of Things applications programme.

This one is looking at whether it is possible to use the existing Smart Meter communications network and infrastructure run by the DCC to support other IoT applications. For example, to support monitoring of “smart buildings” or other parts of the energy data system.

Update: 3rd January 2023 the three funded projects have been announced.

This seems like a sensible approach, as we don’t necessarily need separate infrastructure for what might be very similar requirements.

But the Smart Meter Energy Data Repository project shows that the current infrastructure is not meeting existing needs, so it’s reasonable to assume that there will be additional requirements for these IoT use cases too. Hopefully considerations of these other use cases are at least on the radar of the SEDR review as they might offer additional insights.

As you can see there’s a lot happening around the UK’s energy data infrastructure. I’ve not even touched on the work of Icebreaker One or the SmartDCC “Data for Good” Project.

Posted at 16:05

Leigh Dodds: Gardening Retro 2022

I’ve written a reflection point about growing vegetables for the last two years (2020, 2021) so I’m going to keep going. It’s useful to plan ahead. And it’s nice to think about the spring and summer when it’s so cold and dark outside!

What did I set out to do this year?

My goals for this year were to:

  • Rotate what I was growing across each of the vegetable patches
  • Try out companion planting
  • Look at ways to improve the soil
  • Look into replacing my slowly rotting raised beds
  • Grow more chillies
  • Not get distracted by bees

As always, I did some but not all of these

What changes did I make?

Cucumber plants growing up a raised frame

Unlike last year I didn’t add any new growing areas or buy new pots. This year was mostly about using the space better.

For example, as shown in the photo above, I tried growing the cucumber plants up a raised frame rather than letting them sprawl all over the beds and path. This freed up a lot of space and I was even able to grow some spring onions and letters under the frame. They cropped before the frame was completely overgrown.

I rotated the crops through the beds

I planted more densely across all the beds, and interleaved smaller, faster growing crops (radishes, lettuce, spring onions) amongst slower growing veg (potatoes, sweetcorn). This isn’t quite companion planting, but worked well.

I gave up an entire bed to potatoes. And also tried growing some in pots.

Tried to plant up the peas more densely to fully fill the space under the frame.

I also made sure I planted up all the pots I have which meant getting some more soil and compost.

I didn’t replace any of the beds, I figured they’d got at least another year or so in them. I did repair one of them though.

I didn’t look into soil improvers. And I didn’t grow many more chillies but gave up on strawberries.

I continued to get distracted by bees.

What did we grow?


The final list for this year was (new things in bold).

Basil, Beans, Blueberries, Butternut Squash, Carrots (2 varieties), Cucumber, Jalapeños, Lettuce (2 varieties), Peas, Mint, Potato, Radish, Shallots, Scotch Bonnets, a “Snacking Pepper” (not sure of the variety), Spinach, Spring Onion, Sweetcorn, Swiss Chard, Thyme, Tomatoes (3 varieties)

Things I didn’t grow this year: Strawberries.

What didn’t go so well?

  • We didn’t get any crop from the Blueberries last year, which I expected so was hoping we’d get some this year. They flowered, we had some berries but just as they got ripe enough to pick…the birds ate ’em all
  • Just couldn’t get the peas to grow well this year. I did 2-3 plantings and only ended up with a few scrawny plants. Not sure if it’s the soil, the heat this year, or lack of water
  • Tried a few different varieties of tomatoes and while they cropped, it was just a few tomatoes at a time. Meaning that we just had a handful occasionally. I’d kept them well fed and watered, pruned as advise, so not sure what we’re doing wrong here
  • Butternut squash completely failed, as did a few of the cumber plants
  • The Snacking Pepper only had a few fruits and they were disappointingly tasteless
  • Sweetcorn crop wasn’t great. One good cob on each plant, rather than two
  • Scotch Bonnets weren’t as feisty as I hoped. Maybe they needed more sun?

A bit frustrating this year, that so many things failed or produced a lack lustre crop.

Its was incredibly hot this year, so I’m putting at least some of these issues down to problems keeping everything well watered. Also I think investing in some soil improvement would be sensible now.

What went well?

  • Decent crop of Jalapeños for the second year running. They seem to do well in pots. There were enough for me to freeze a bunch of them
  • I tried growing some basil in a pot, as well as in the ground. They both grew well, and the basil in the pot is now on the window ledge in the kitchen. Still growing well
  • We had a good crop of potatoes, although they were quite small
  • Making better use of the space helped get more out of the ground. And the frame for the cucumber plants worked brilliantly
  • In desperation after the peas almost completely failing, I tried planting some beans. Never grown them before and we don’t regularly eat them. They worked brilliantly. Had a huge crop and they were delicious. Used quite a few in stir-fries. Definitely one for next year.

What will I do differently?

  • Look at some soil improvers
  • It was good to have the basil, thyme and mint available. I want to look at growing more herbs next year
  • I lost a few seedlings to frost. I need to either acclimate them more before planting out and/or put them in my small grow tunnel in pots before planting out. I think I’m rushing to get things into the ground, whereas in the first years of growing I was more cautious
  • If summers are going to continue to be this warm, and we have more hose pipe bans I’ve got to think about water storage. Am not sure I can fit in another water butt, but need to look at options there, as well as maybe adding a second compost bin

Having the space to grow vegetables is a privilege and I’m very glad and very lucky to have the opportunity.

Gardening can be time consuming and frustrating, but I love being able to cook with what I’ve grown myself. Getting out into the garden, doing something physical, seeing things grown is also a nice balm.

Looking forward to next year.

Posted at 16:05

Leigh Dodds: Useful resources for designing data rich pages

One of the big projects we’ve currently got under way at Energy Sparks is redesigning the collection of pages that present the results of our detailed analysis of their energy data to school users.

The existing pages have been around for a few years and our metrics and user testing has shown that they aren’t really performing well. They need to have a clearer content model, better navigation and to a better job of highlighting the key insights to users.

We also need to review and translate all of the content into Welsh as part of the next phase of translating the service into Welsh. As a result we’ve been doing quite a bit of prototyping, user research and testing over the past few months. It’s something I’m really enjoying.

To help inform our thinking I’ve been looking for existing guidance around presenting charts, data and supporting analysis to users.

I thought I’d share a few of the resources that I’ve found useful.

The GDS Design System and Advice

The GDS Design System is a resource that I frequently use as a reference when doing any UX/UI work. Although I’ll often look at other design systems too.

While the design system seems primarily geared towards the development of transactional services, it has some useful patterns that are applicable to the thinking we’ve been doing around our advice pages.

For example we’re considering how to use progressive disclosure to avoid drowning users in details. Patterns like Details, Inset Text and Warning Text are helpful refernces.

The GDS guidance on planning and writing content and recommendations for publishing statistics, making tables accessible and presenting numbers are all useful and relevant resources.

Our users have a wide range of backgrounds and knowledge, but their common characteristic is that they’re all time-poor. We need our analysis to be clear and accessible.

ONS Style Guide

Another resource that I’m frequently referencing is the ONS Style Guide, and in particular, the detailed recommendations around presenting charts and tables.

We need to ensure that our analysis is well-presented, but also has the appropriate footnotes and details that will allow our more advanced users to dig into the details. Reading how the ONS approaches presenting often very detailed statistical data is very helpful.

We also want our charts and tables to demonstrate good practice, as often they’re been looked at by children. It’s important that the service reinforces the good practices they’re being taught around data literacy.

Plain numbers and plain English

I’m lumping these examples together as I discovered them via a helpful Mastodon post:

For obvious reasons we want our advice to be as widely accessible as possible. We’ve got a lot to improve on the site in that area, but I’m confident we can make some quick progress as we build out our new pages and iterate further on existing pages across the site.

As a small team, its really really helpful to be able to benefit from the insights, user testing and design skills of much larger organisations. It’s always important to filter that advice through the lens of your own users and product, but when others work in the open in this way, we can all benefit.

If you have other resources that you think I should be looking at, then leave a comment or drop me a message on Mastodon.

Posted at 16:05

Leigh Dodds: Recreating sci-fi terminals using VHS

I heard about VHS recently. It’s a tool for creating recordings of command-line tools, so you can create little demos and tutorials about how to use them.

You can write a script to run commands, manipulate and theme the terminal and produce output in a range of formats.

I started thinking about how I could enhance some of the technical documentation I’ve been writing recently with little videos. It’d be a nice way to provide an alternative way for people to learn a process or a new tool.

But then I realised I could use it to do something much more fun: recreate some scenes from some sci-fi films.

So I give you sci-fi terminals. A little github repo with some VHS tapes that produce the following output.

Neo receiving messages from Trinity in the Matrix

Trinity hacking using NMAP

Dennis Nedry’s terminal in Jurassic Park

Hacking WOPR in War Games

WOPR realising Thermonuclear war is a mugs game

Ripley asking about Special Order 937

Check the repo if you’re interested in how the tapes work. I’m pretty pleased with the results!

Posted at 16:05

Leigh Dodds: Viewing historical maps of Bath in Google Earth

I’ve been tidying up some of my online presence this week, including getting rid of a server I wasn’t using any more and moving some project around.

One of those was a project I did a few years ago to digitise some historical maps of Bath, georeference them so they can be overlaid onto current web maps, and then publish them for use in Google Earth.

Having fixed up some SSL issues with the serving of the KML and image files, the maps are now working again.

I plan to do some more work on this as I’ve since found a number of other maps that I’ve also been digitising. There are also better ways to publish these maps today.

But for now I thought I’d quickly write up how to use them using current versions of Google Earth.

Google Earth Pro

The desktop version is now called Google Earth Pro. Its still free.

If you want to view an individual map, then click one of the links in the table and your browser will download a file called doc.kml.

With the desktop application installed double-clicking the file should automatically open any KML file in Google Earth. So just click the file to open it.

You’ll probably want to to turn off some of the default map layers like 3D Buildings as otherwise the map will look odd.

You can then explore the file using the normal navigation controls.

One thing I like to do is use the Opacity setting to fade out the historical map slightly so you can see the modern day features underneath.

The best way to import all of the maps into your application is to:

The application will then add a new folder which contains a sub-folder for each of the historical maps. You can then choose which ones to switch on and off. By playing with the opacity to can explore several maps in one go.

Here’s a video of me doing that.

Google Earth on the Web

This version of Google Earth doesn’t support as many features. So it just doesn’t work as well as the desktop version. But you can still view the maps.

  • Visit the website at
  • From the left hand menu click the “Projects” icon
  • Choose new Project and “Import KML file from computer”
  • You can then choose one of the doc.kml files you’ve downloaded from the website

You can turn off 3D Buildings by going to the “Map Styles” menu option and choosing the “Clean” style.

Unfortunately I can’t find a way to change the opacity of layers in this version, so you’re limited in how you can explore the maps.

The web version also has limited KML support so you can’t open the complete folder of maps, which is a shame.

So, there you have it. I’m pleased the maps are working again but lots to do to make the whole experience better.

Posted at 16:05

Leigh Dodds: Highlighting harms when writing design patterns

I enjoy writing design patterns.

I find them a useful way to clarify my thinking around different solutions to problems across a whole range of areas. A well-named pattern can also help to clarify and focus discussion.

I’ve written a whole book of Linked Data patterns and lead a team that produced a set of design patterns for collaborative maintenance of data.

I’ve been planning to revisit some writing and thinking I’ve been doing around capturing design patterns for different models of data access, sharing, governance and modelling. Given all the confusing jargon that is thrown around in this space, I think writing some design patterns might help.

When I’ve been writing design patterns in the past I’ve used a fairly common template:

  • a short description of a problem
  • the context in which this problem might surface
  • an outline of a solution
  • some examples
  • a discussion which might discuss variations of the solution or its limitations
  • links to other patterns

But I’ve been thinking about iterating on this to add a new section: harms.

This would lay out the potential consequences or unintended side effects of adopting the pattern. Both to the system in which they are implemented, but also more broadly.

I started thinking about this after reading a paper that discusses the downsides of poor modelling of sex and gender in health datasets used in machine-learning. I’d highly recommend reading the paper, regardless of whether you work in health or machine-learning. This paper about database design in public services is also work a look while you’re at it. I wrote a summary.

While the sex/gender paper doesn’t describe the issues in terms of design patterns, it’s largely a discussion of the impacts of specific data modelling decisions.

Some of these decisions are just poor. Capturing unnecessary personal data. Simplistic approaches to describing sex and gender.

Work on design patterns has long attempted to highlight poor designs. For example by describing “anti-patterns” or “deceptive design patterns” (don’t call them dark patterns).

But some of the design decisions highlighted in that research paper are more nuanced. Decisions which may have been justified within the scope of a specific system, where their limitations may be understood and minimised, but whose impacts are greatly amplified as data is lifted out of its original context and reused.

This means that there’s not a simple good vs bad decision to record as a pattern. We need an understanding of the potential consequences and harms as an integral part of any pattern.

Some pattern templates include sections for “resulting context” which can be used to capture side effects. But I think a more clearly labelled “harms” section might be better.

If you’ve seen good examples of design patterns that also discuss harms, I’d be interested to read them.

Posted at 16:05

Leigh Dodds: Reflecting on Energy Sparks as a “Community Tech” project

I recently attended the launch event for the new Power To Change Community Tech Fund and have been reading through the essays and report on the Community Tech network website.

It’s great to see this topic getting some attention and much needed funding. It’s also prompted me to reflect a bit on my own experience with community lead projects.

For a few years I was leading Bath: Hacked which was a community group and then a small CIC, run by volunteers, who were trying to foster use of open data by the local communities of Bath & North East Somerset.

We managed to support the council and others in publishing open data. And we ran a lot of events, meetups and hack days to support and encourage use of the data.

For example, Accessible Bath involved us mapping accessibility of shops, restaurants and other locations around the city. We consulted with local people who had mobility issues to identify some useful actions we could undertake, rather than just jumping into the technology. And in the end we worked largely within the technical infrastructure of other existing platforms rather than creating our own.

However its Energy Sparks that has been the lasting legacy of Bath: Hacked.

There’s a blog post by the ODI that talks about the history, so I’m not going to write a full origin story. I just wanted to highlight the community tech aspects.

Energy Sparks was initially a local project. Philip, the founder, has already working with local schools for some time through Transition Bath. He was using spreadsheets to analyse energy data for them and the council.

A Bath: Hacked hack day provided the opportunity for Philip to work with others from the local tech community to prototype an online service. We then took it from there to something that was eventually launched to a range of schools in B&NES.

The team was originally all local. Philip, the energy analyst, working with a mixture of developers, students, educators and others. A mixture of local expertise. The focus was very much on delivering benefits for our local schools.

This feels like it firmly fits within the definition of “community tech”.

My original idea was to offer Energy Sparks as an open source platform that other communities could deploy and use for themselves. With very limited funding, and with the initial team being largely volunteers, I wasn’t sure we could scale up the service to support other areas. Or even get access to the necessary data to make that possible.

So I was interested in exploring a more decentralised model that would allow other areas to run the service for themselves.

While I was working at the Open Data Institute I’d helped lead a short research project looking at how to scale local, open data enabled innovation. I’d seen so many interesting things created for local areas, but also felt frustrated when trying to replicate them locally. Recognising that others might build on your work, and planning for that to happen, seemed to be an important part of making that successful.

So I was keen to see if this model might work for Energy Sparks. Ultimately it didn’t. For several reasons.

Firstly, while we had interest from other areas, there wasn’t the technical capability necessary to actually launch and run a local version of the service. All of the code was available. It was possible to run it on free or very cheap infrastructure, but it still needed some technical skill to customise and deploy. And those who were interested in reusing the service weren’t necessarily technical, or have access to developers or a local community who could support them.

Secondly, it proved easier to secure funding to scale, than it was to help others secure the funding they needed locally. I think this is because funders seem mostly interested in scaling things up, rather than in seeing things replicated.

We’d pieced together funding (and a lot of volunteer effort) from the “ODI Summer Showcase”, Bath & West Community Energy, the Nature Save Trust and others to support the early development.

The scale up support came from the BEIS NDSEMIC innovation competition. That supported hiring a team and investing further in the technology. Allowing more of that spreadsheet based analysis to be turned into an automated tool.

There are clearly economies of scale to be had in scaling up. But it also has its own challenges. While it might not have been right for Energy Sparks, I still think there are benefits to be had from replicating technology locally and not just scaling upwards. It just needs a good alignment of motivation, funding and skills. Something I hadn’t fully appreciated at the outset.

It was through that BEIS funding and then later support from the Ovo Foundation, Centrica and more recently the Department for Education that we’re now able to offer a national service, which is free for state schools. Today we now have a core team of eight people and a network of freelancers.

The code is still open source. But it’s now more of a transparency measure or insurance policy than an attempt to collaboratively build the technology.

Our shift to a national service means our notion of community has changed.

For example, we’re no longer primarily place-based. But there is still place-based activity and engagement through our partnership with Egni. Egni are running a package of educational and creative workshops with a range of schools in Wales. Energy Sparks is a core component of that. Together we’re able to deliver on our goals whilst supporting them in achieving their goals of creating more community-owned clean energy. Working together rather than in competition.

If Energy Sparks is a “platform”, this is the kind of platform we want it to be.

Our community is now the schools and teachers using Energy Sparks to tackle the energy crisis and educate young people about climate change. We’re creating steering groups in Wales and England to involve them more with our planning, whilst continuing to engage through existing networks of sustainability and climate groups.

I’m proud of having had the opportunity to be involved in the project and to continue to be part of the team working to deliver more impact through our work.

I’m looking forward to seeing what other projects spring up from the Power To Change fund.

Posted at 16:05

Leigh Dodds: It’s just a spreadsheet, but it’s still data infrastructure

I’ve found my new favourite example of a well documented, tiny slice of data infrastructure.

I’m going to hazard a guess that it’s probably the simplest dataset that is designated as national statistics. If you can think of one simpler, then let me know.

It’s the weekly road fuel prices data on

This data has been updated every week, without fail, since September 2013.

The CSV has got seven columns in it. You can download it in XML if you want. Or you can grab the Excel which comes with some fancy clipart of petrol pumps.

This is national statistics, so of course there’s a document that describes the methodology for how it is compiled. I’ve read it. At four pages, it’s short, clear and to the point.

There’s a bit more to it, but basically, every Monday someone at BEIS emails six companies and asks them for their prices. By the end of the day they put the responses in the spreadsheet and then it’s published on the Tuesday.


Someone or, more likely some team, at BEIS has been doing that for at least nine years. No one has bothered to automate it away. Probably because it’s not a lot of effort to keep updating the spreadsheet.

If we were designing this from scratch, we’d probably immediately start thinking about services and APIs and data formats. But none of that is really needed.

It just needed a spreadsheet and a commitment to keep publishing the data.

That’s what makes it data infrastructure. The commitment, not the technology.

Posted at 16:05

Leigh Dodds: Notice of plans to erect…another big database

This privacy notice went past in my twitter stream earlier.

It announces that the UK government is planning to create a new database that will some quite detailed data about every electricity meter in the UK. In particular it’ll combine together information about the meter, the energy consumption and billing details associated with that energy supply and detailed information about the personal paying those bills.

Apparently it’s intended to help support fraud prevention around the Energy Price Guarantee (EPG) Scheme.

Unsurprisingly there’s not a great deal of detail beyond a broad outline of the data to be aggregated. But it looks to be a database that will consist of data about all meters, not just smart meters. And, while it says “each and every electricity and gas meter”, I suspect they actually mean every domestic meter.

The database will also apparently contain data about electricity consumption. But not gas? (I suspect that’s an oversight). It’s unclear what granularity of consumption it’ll contain, but I’d hazard that it’ll be daily, monthly or quarterly rather than detailed half-hourly readings.

Reading the notice, my big unanswered question was: OK, but why build a new database?

Specifically, what are the technical requirements of the service to be built around that data, that means that it needs to be held in one big database?

The UK’s smart metering data infrastructure was designed to avoid having a single big database. So why do that here?

Is it really easier to merge and aggregate all this data into one pot than, say carry out some kind of integration with the data already held in energy company systems?

It probably is easier to aggregate. And it’ll probably be easier to build a system around that, than a bunch of loosely joined parts.

But, given the the government’s own desire to have a “digital spine” to support sharing data across energy companies. Or key elements of its own data strategy, shouldn’t it be considering all of the options?

And maybe it has. That’s the problem with privacy notices: they just give you the results of a decision. The decision to process some data. We get no insight into why it needs to be done in this specific way. Even though building trust starts with being transparent from the start.

I found myself thinking about planning notices rather than privacy notices. And then remembered that Dan Hon had written about this recently.

In a world where we have some increasingly sophisticated means of securely sharing data, without it having to be moved around. Tell me why you need to build another great big database, rather than any of those other solutions.

Posted at 16:05

Datagraph: What's New in RDF.rb 0.3.0

It has now been nine months since the initial public release of RDF.rb, our RDF library for Ruby, and today we're happy to announce the release of RDF.rb 0.3.0, a significant milestone.

As the changelog attests, this has been a long release cycle that incorporates 170 commits by 6 different authors. The major new features include transactions and basic graph pattern (BGP) queries, as well as the availability of robust and fast parser/serializer plugins for the RDFa, Notation3, Turtle, and RDF/XML formats, complementing other already previously supported formats. In addition, many bugs have been fixed and general improvements, including significant performance improvements, implemented.

RDF.rb 0.3.0 is immediately available via RubyGems, and can be installed or upgraded to as follows on any Unix box with Ruby and RubyGems:

$ [sudo] gem install rdf

In all the code examples that follow below, we will assume that RDF.rb and the built-in N-Triples parser have already been loaded up like so:

require 'rdf'
require 'rdf/ntriples'

# Enable facile references to standard vocabularies:
include RDF

RDFa, N3, Turtle, and RDF/XML Support

While RDF.rb 0.3.0 continues with our minimalist policy of only supporting the N-Triples serialization format in the core library itself, support for every widely-used RDF serialization format is now available in the form of plugins.

Thanks to the hard work of Gregg Kellogg, the author of RdfContext, there are now RDF.rb 0.3.0-compatible plugins available for the RDFa, Notation3, Turtle, and RDF/XML formats, complementing the already previously-available plugins for the RDF/JSON and TriX formats. See Gregg's blog post for more details on the particulars of these plugins.

We are also pleased to announce that Gregg has joined the RDF.rb core development team, which now consists of him, Ben Lavender, and myself. This merger between the RDF.rb and RdfContext efforts is a perfect match, given that Ben and I have been focused more on storing and querying RDF data while Gregg has been busy single-handedly solving all RDF serialization questions.

To facilitate typical Linked Data use cases, we now also provide a metadistribution of RDF.rb that includes a full set of parsing/serialization plugins; the following will install all of the rdf, rdf-isomorphic, rdf-json, rdf-n3, rdf-rdfa, rdf-rdfxml, and rdf-trix gems in one go:

$ [sudo] gem install linkeddata

Similarly, instead of loading up support for each RDF serialization format one at a time, you can simply use the following to load them all; this is helpful e.g. for the automatic selection of an appropriate parser plugin given a particular file name or extension:

require 'linkeddata'

For a tutorial introduction to RDF.rb's reader and writer APIs, please refer to my previous blog post Parsing and Serializing RDF Data with Ruby.

Query API: Basic Graph Patterns (BGPs)

The query API in RDF.rb 0.3.0 now includes basic graph pattern (BGP) support, which has been a much-requested feature. BGP queries will already be a familiar concept to anyone using SPARQL, and in RDF.rb they are constructed and executed like this:

# Load some RDF.rb project information into an in-memory graph:
graph = RDF::Graph.load("")

# Construct a BGP query for obtaining developers' names and e-mails:
query ={
  :person => {
    RDF.type  => FOAF.Person, => :name,
    FOAF.mbox => :email,

# Execute the query on our in-memory graph, printing out solutions:
query.execute(graph).each do |solution|
  puts "name=#{} email=#{}"

Executing a BGP query returns a solution sequence, encapsulated as an instance of the RDF::Query::Solutions class. Solution sequences provide a number of convenient methods for further narrowing down the returned solutions to what you're actually looking for:

# Filter solutions using a hash:
solutions.filter(:author  => RDF::URI(""))
solutions.filter(:author  => "Arto Bendiken")
solutions.filter(:updated => RDF::Literal(

# Filter solutions using a block:
solutions.filter { |solution| }
solutions.filter { |solution| solution.title =~ /^SPARQL/ }
solutions.filter { |solution| solution.price < 30.5 }
solutions.filter { |solution| solution.bound?(:date) }
solutions.filter { |solution| solution.age.datatype == XSD.integer }
solutions.filter { |solution| == :es }

# Reorder solutions based on a variable:
solutions.order_by(:updated, :created)

# Select particular variables only:, :description)

# Eliminate duplicate solutions:

# Limit the number of solutions:

# Count the number of matching solutions:
solutions.count { |solution| solution.price < 30.5 }

BGP-capable storage adapters should override and implement the following RDF::Queryable method in order to provide storage-specific optimizations for BGP query evaluation:

class MyRepository < RDF::Repository
  def query_execute(query, &block)
    # ...

Repository API: Transactions

The repository API in RDF.rb 0.3.0 now includes basic transaction support:

# Load some RDF.rb project information into an in-memory repository:
repository = RDF::Repository.load("")

# Delete one statement and insert another, atomically:
repository.transaction do |tx|
  subject = RDF::URI('')

  tx.delete [subject,, nil]
  tx.insert [subject,, "RDF.rb 0.3.0"]

As you would expect, if the transaction block raises an exception, the current transaction will be aborted and rolled back; otherwise, the transaction is automatically committed when the block returns.

Transaction-capable storage adapters should override and implement the following three RDF::Repository methods:

class MyRepository < RDF::Repository
  def begin_transaction(context)
    # ...

  def rollback_transaction(tx)
    # ...

  def commit_transaction(tx)
    # ...

The RDF::Transaction objects passed to these methods consist of a sequence of RDF statements to delete from, and a sequence of RDF statements to insert into, a given graph. The default transaction implementation in RDF::Repository simply builds up a transaction object in memory, buffering all inserts/deletes until the transaction is committed, at which point the operations are then executed against the repository.

Note that whether transactions are actually executed atomically depends on the particulars of the storage adapter you're using. For instance, the RDF::DataObjects plugin, which provides a storage adapter supporting SQLite, PostgreSQL, MySQL, and other RDBMS solutions, will certainly be able to offer ACID transaction support (albeit it has not been updated for that, or other 0.3.x features, just yet.)

On the other hand, not e.g. all NoSQL solutions support transactions, so storage adapters for such solutions may choose to omit explicit transaction support and have it supplied by RDF.rb's default implementation.

Performance & Scalability Improvements

In earlier RDF.rb releases, our focus was strongly centered on defining the core APIs that have enabled the thriving plugin ecosystem we can witness today. The focus was not so much, therefore, on the performance of the bundled default implementations of those APIs; in some cases, these implementations could have been described as being of only proof-of-concept quality.

In particular, the in-memory graph and repository implementations were suboptimal in RDF.rb 0.1.x, and only somewhat improved in 0.2.x. However, reflecting the increasing production-readiness of RDF.rb in general, matters have been much improved in RDF.rb 0.3.0.

Of course, performance improvements are an open-ended task, and I'm sure we'll see more work on this front in the future as need arises and time permits. But it's likely that RDF.rb 0.3.0 now offers a sufficient out-of-the-box performance level for many if not most common use cases.

Scalability has also been addressed by making use of enumerators throughout the APIs defined by RDF.rb. That means that all operations are generally performed in a streaming fashion, enabling you to build pipelines for hundreds of millions of RDF statements to flow through while still maintaining constant memory usage by ensuring that the statements are processed one by one.

RSpec 2.x Compatibility

Lastly, RDF.rb 0.3.0 has been upgraded to use and depend on RSpec 2.x instead of the previous 1.3.x branch. This requires minor changes to the spec/spec_helper.rb file in any project that relies on the RDF::Spec library. The most minimal spec_helper.rb contents are now as follows:

require 'rdf/spec'

RSpec.configure do |config|
  config.include RDF::Spec::Matchers

Kudos to Our Contributors

In tandem with the soon 10,000 downloads of RDF.rb on, a very positive sign of all the interest and ongoing work around RDF.rb is our growing contributor list. We thank everyone who has sent in bug reports, and in particular the following people who have contributed patches to RDF.rb and/or an RDF.rb plugin; in alphabetical order:

Călin Ardelean, Christoph Badura, John Fieber, Joey Geiger, James Hetherington, Gabriel Horner, Nicholas Humfrey, Fumihiro Kato, David Nielsen, Thamaraiselvan Poomalai, Keita Urashima, Pius Uzamere, and Hellekin O. Wolf.

(My apologies if I have inadvertently omitted anyone from the previous, and please let me know about it.)

Looking Forward to Hearing From You

As always, if you have feedback regarding RDF.rb please contact us either privately or via the mailing list. Plain and simple bug reports, however, should more preferably go directly to the issue queue on GitHub.

Be sure to follow @datagraph, @bendiken, @bhuga, and @gkellogg on Twitter for the latest updates on RDF.rb as they happen.

Posted at 16:05

Datagraph: Spira: A Linked Data ORM for Ruby

I've just released Spira, a first draft of an RDF ORM, where the 'R' can mean RDF or Resource at your pleasure. It's an easy way to create Ruby objects out of RDF data. The name is from Latin, for 'breath of life'--it's time to give those resource URIs some character. It looks like this (feel free to copy-paste):

require 'spira'
require 'rdf/ntriples'

repo = ""
Spira.add_repository(:default, RDF::Repository.load(repo))

class Person
  include Spira::Resource

  property :name,  :predicate =>
  property :nick,  :predicate => FOAF.nick

jhacker = RDF::URI("").as(Person) #=> "J. Random Hacker"
jhacker.nick #=> "jhacker" = "Some Other Hacker"!

Why a new project?

I try not to start new projects lightly. There's plenty of good stuff out there. But there wasn't quite what I wanted.

First of all, I want to program in Ruby, so it needed to be Ruby. Spira, while different, has a lot of overlap with a traditional ORM, and I was on the fence for a while about starting Spira or trying to implement things in DataMapper. There's already an RDF.rb backend for DataMapper, which is cool, but using it really cuts you off from RDF as RDF. It's more about making RDF work how DataMapper likes it. DataMapper's storage adapter interface is an implicit data model, one that is not RDF's, and it is not quite what I wanted.

On the RDF-specific front, there's ActiveRDF. ActiveRDF is based on SPARQL directly, and thus, while not hiding RDF from you, only gives you access via Redland. The Redland Ruby bindings have problems, and do not represent the entire RDF ecosystem. I wanted to start on something that completely abstracted away the data model, so I could focus on the problem at hand, which means RDF.rb. The difference is in allowing me to focus on what I'm focusing on: there exists a perfectly good, working SPARQL client storage adapter for RDF.rb, but it's one of many pluggable backends instead of a requirement.

Lastly, while both of those projects would represent a workable starting point, this was something of a journey of exploration in terms of semantics. Spira was going to be 'open world' 'open model' from the start; I specifically wanted something that could read foreign data. By 'open model' I mean that Spira does not expect that a class definition is the authoritative, exclusive, or complete definition of a model class. That turns out to make Spira have some important semantic differences from ORMs oriented around object or relational databases. Stumbling on them was part of the fun, and even if I could have twisted DataMapper around the problem, I'm not sure that starting from there would have had me focusing on the core semantics.

So I decided to start something new. To be fair, Spira would suck a lot more were it not for the projects that came before it. In particular, it owes an intellectual debt to DataMapper, which has a generally sane model, readable code, and had to cover a lot of ground that any object-whatever-mapper would. It takes some digging, but as an example, one can find IRC logs where the DataMapper team discusses the ups and downs of identity map implementations in Ruby. That stuff is amazing to have available without spending hundreds of hours fighting it yourself, and again, it saves me a lot of trial and error on ancilliary considerations.

Making things simple

Spira's core use case is allowing programmers to create Ruby objects representing an aspect of an RDF resource. I'm still working on which terminology I like best, but I am leaning towards calling instances of Spira classes 'a projection of a given RDF resource as a Spira resource.' In the simplest of terms, Spira tries to let you create classes that easily get and set values for properties that correspond to RDF predicates. The README will explain it better than I want to in this post (now available in Github and Yardoc flavors).

The hopeful end result is a way to access the RDF data model in a way that agile web programmers have come to expect, without forcing them to get bogged down into a world of ontologies, rule languages, inference levels, and lord knows what-all else. RDF has taken off in the enterprise because of power user features, and we're approaching a critical mass of RDFa publishing, but it's not yet on anyone's radar as a data model for their next weekend project. I think that's a shame--RDF's schema-free model should be the easiest thing in the world to get started on. So in addition to hopefully being an open-model ORM, here's hoping Spira is a step in the adoption of RDF as a day-to-day data model.

So what's 'Open Model' mean?

Any useful abstraction layer is about applying constraints. Normal ORMs hide the power of relational databases to make them into proper object databases. Spira constrains you to a particular aspect of a resource. That means that in the aspect of 'Person', a resource's name is a given predicate, and they only have one. A person might also have a label, multiple names, a comment, function as a category or tag, have friends, have accounts, have tons of other stuff, but if all you want is their age, you just want to say and person.age. The goal here is to let you use data (or at least, to have defined behavior for data) that you cannot say for sure meets any sort of criteria you set in Spira. Spira will have defined behavior for when data does not match a model class, and will still let you use that data easily, pretending it came from a closed system. That's good enough surprisingly often.

That open-model part is where tough semantics come in. As an example, I had intended to publish, with Spira, a reference implementation of SIOC. The SIOC core classes are in widespread use, so surely this would find some use, I figured. But it's not so simple to make a reference implementation unless you limit your possibilities. For example, a SIOC post can have topics (a sub-class of dcterms:subject). These topics are RDF resources which may be one (or, I suppose, both, or neither) of two classes defined in the SIOC types ontology, Category or Tag. These two classes have completely different semantics. Now, a Spira class could be created to deal with either of them, but to use that class usefully, you'd always be checking what it is, since the semantics are different. Spira will eventually have helpers to help you decide what to do here, but the point is that in RDF, a 'reference implementation' often doesn't make sense as a concept. However, this is at least in principle representable in Spira--I'm not sure it could be done in a traditional ORM, as it doesn't really match the single-table inheritance model.

Instead, I hope Spira classes are simple enough--throw away, even--that you can define them when you need them. Indeed, defining them programmatically is obvious with the framework in place, I just haven't done it yet.

Another example of differing semantics would be instance creation. An RDF resource does not 'exist or not'. It's either the subject of triples or not. So what would it mean to create an instance of a Spira resource and save it when it had no fields? Would one save a triple declaring the resource to be an RDF resource? How about saving the RDF type, should that happen if one has not saved fields? There are good arguments for several options. It's just not the same model as the 'find, create, find_or_create' trio of constructors that the world has grown used to, since the identifiers are global and always exist. Primary keys do not come into existence to allow reference to an object, the key is the object. I dodged the question and now do construction based on RDF::URIs.

Instantiation looks either like this:


or like this:


There's no finding or creating. Resources just are. Creating a Spira object is creating the projection of that resource as a class. If you've told Spira about a repository where some information about that resource may or may not exist, great, but it's not required.

As another example, I see a lot of need for validations on creating an instance, not just saving one, as in traditional ORMs. RDF is not like the data fed to a traditional ORM, which is generally created by that ORM or by a known list of applications, managed by a set of hard constraints and schema. RDF data is often found, and used, in the wild.

There's still a ton left to do, but lots of stuff already works. The README has a good rundown of where things stand. I'd enumerate the to-do list, but I'd rather not feed that to Google, and it's long enough anyway that if certain deficencies quickly become obvious, I'd attack them first.

Anyways, hope someone has fun with it. gem install spira are the magic words. If you want to spoil the magic, the code is on Github.

The original version of this post used the term 'Open World' instead of 'Open Model' willy-nilly throughout, but I was corrected from using the term outside its strict meaning in terms of inference. See the comments. If a term exists for what I'm describing at this level of abstraction, I'm all ears.

Posted at 16:05

Datagraph: How RDF Databases Differ from Other NoSQL Solutions

This started out as an answer at Semantic Overflow on how RDF database systems differ from other currently available NoSQL solutions. I've here expanded the answer somewhat and added some general-audience context.

RDF database systems are the only standardized NoSQL solutions available at the moment, being built on a simple, uniform data model and a powerful, declarative query language. These systems offer data portability and toolchain interoperability among the dozens of competing implementations that are available at present, avoiding any need to bet the farm on a particular product or vendor.

In case you're not familiar with the term, NoSQL ("Not only SQL") is a loosely-defined umbrella moniker for describing the new generation of non-relational database systems that have sprung up in the last several years. These systems tend to be inherently distributed, schema-less, and horizontally scalable. Present-day NoSQL solutions can be broadly categorized into four groups:

RDF database systems form the largest subset of this last NoSQL category. RDF data can be thought of in terms of a decentralized directed labeled graph wherein the arcs start with subject URIs, are labeled with predicate URIs, and end up pointing to object URIs or scalar values. Other equally valid ways to understand RDF data include the resource-centric approach (which maps well to object-oriented programming paradigms and to RESTful architectures) and the statement-centric view (the object-attribute-value or EAV model).

Without just now extolling too much the virtues of RDF as a particular data model, the key differentiator here is that RDF database systems embrace and build upon W3C's Linked Data technology stack and are the only standardized NoSQL solutions available at the moment. This means that RDF-based solutions, when compared to run-of-the-mill NoSQL database systems, have benefits such as the following:

  • A simple and uniform standard data model. NoSQL databases typically have one-off, ad-hoc data models and capabilities designed specifically for each implementation in question. As a rule, these data models are neither interoperable nor standardized. Take e.g. Cassandra, which has a somewhat baroque data model that "can most easily be thought of as a four or five dimensional hash" and the specifics of which are described in a wiki page, blog posts here and there, and ultimately only nailed down in version-specific API documentation and the code base itself. Compare to RDF database systems that all share the same well-specified and W3C-standardized data model at their base.

  • A powerful standard query language. NoSQL databases typically do not provide any high-level declarative query language equivalent of SQL. Querying these databases is a programmatic data-model-specific, language-specific and even application-specific affair. Where query languages do exist, they are entirely implementation-specific (think SimpleDB or GQL). SPARQL is a very big win for RDF databases here, providing a standardized and interoperable query language that even non-programmers can make use of, and one which meets or exceeds SQL in its capabilities and power while retaining much of the familiar syntax.

  • Standardized data interchange formats. RDBMSes have (somewhat implementation-specific) SQL dumps, and some NoSQL databases have import/export capability from/to implementation-specific structures expressed in an XML or JSON format. RDF databases, by contrast, all have import/export capability based on well-defined, standardized, entirely implementation-agnostic serialization formats such as N-Triples and N-Quads.

From the preceding points it follows that RDF-based NoSQL solutions enjoy some very concrete advantages such as:

  • Data portability. Should you need to switch between competing database systems in-house, to make use of multiple different solutions concurrently, or to share data with external parties, your data travels with you without needing to write and utilize any custom glue code for converting some ad-hoc export format and data structure into some other incompatible ad-hoc import format and data structure.

  • Toolchain interoperability. The RDBMS world has its various database abstraction layers, but the very concept is nonsensical for NoSQL solutions in general (see "ad-hoc data model"). RDF solutions, however, represent a special case: libraries and toolchains for RDF are typically only loosely coupled to any particular DBMS implementation. Learn to use and program with Jena or Sesame for Java and Scala, RDFLib for Python, or RDF.rb for Ruby, and it generally doesn't matter which particular RDF-based system you are accessing. Just as with RDBMS-based database abstraction layers, your RDF-based code does not need to change merely because you wish to do the equivalent of switching from MySQL to PostgreSQL.

  • No vendor or product lock-in. If the RDF database solution A was easy to get going with but eventually for some reason hits a brick wall, just switch to RDF database solution B or C or any other of the many available interoperable solutions. Unlike switching between two non-RDF solutions, this does not have to be a big deal. Needless to say there are also ecosystem benefits with regards to the available talent pool and the commercial support options.

  • Future proof. With RDF now emerging as the definitive standard for publishing Linked Data on the web, and being entirely built on top of indelibly-established lower-level standards like URIs, it's not an unreasonable bet that your RDF data will still be usable as-is by, say, 2038. It's not at all evident, however, that the same could be asserted for any of the other NoSQL solutions out there at the moment, many which will inevitably prove to be rather short-lived in the big picture.

RDF-based systems also offer unique advantages such as support for globally-addressable row identifiers and property names, web-wide decentralized and dynamic schemas, data modeling standards and tooling for creating and publishing such schemas, metastandards for being able to declaratively specify that one piece of information entails another, and inference engines that implement such data transformation rules.

All these features are mainly due to the characteristics and capabilities of RDF's data model, though, and have already been amply described elsewhere, so I won't go further into them just here and now. If you wish to learn more about RDF in general, a great place to start would be the excellent RDF in Depth tutorial by Joshua Tauberer.

And should you be interested in the growing intersection between the NoSQL and Linked Data communities, you will be certain to enjoy the recording of Sandro Hawke's presentation Toward Standards for NoSQL (slides, blog post) at the NoSQL Live in Boston conference in March 2010.

Posted at 16:05

Datagraph: Parsing and Serializing RDF Data with Ruby

In this tutorial we'll learn how to parse and serialize RDF data using the RDF.rb library for Ruby. There exist a number of Linked Data serialization formats based on RDF, and you can use most of them with RDF.rb.

To follow along and try out the code examples in this tutorial, you need only a computer with Ruby and RubyGems installed. Any recent Ruby 1.8.x or 1.9.x version will do fine, as will JRuby 1.4.0 or newer.

Supported RDF formats

These are the RDF serialization formats that you can parse and serialize with RDF.rb at present:

Format      | Implementation        | RubyGems gem
N-Triples   | RDF::NTriples         | rdf
Turtle      | RDF::Raptor::Turtle   | rdf-raptor
RDF/XML     | RDF::Raptor::RDFXML   | rdf-raptor
RDFa        | RDF::Raptor::RDFa     | rdf-raptor
RDF/JSON    | RDF::JSON             | rdf-json
TriX        | RDF::TriX             | rdf-trix

RDF.rb in and of itself is a relatively lightweight gem that includes built-in support only for the N-Triples format. Support for the other listed formats is available through add-on plugins such as RDF::Raptor, RDF::JSON and RDF::TriX, each one packaged as a separate gem. This approach keeps the core library fleet on its metaphorical feet and avoids introducing any XML or JSON parser dependencies for RDF.rb itself.

Installing support for all these formats in one go is easy enough:

$ sudo gem install rdf rdf-raptor rdf-json rdf-trix
Successfully installed rdf-0.1.9
Successfully installed rdf-raptor-0.2.1
Successfully installed rdf-json-0.1.0
Successfully installed rdf-trix-0.0.3
4 gems installed

Note that the RDF::Raptor gem requires that the Raptor RDF Parser library and command-line tools be available on the system where it is used. Here follow quick and easy Raptor installation instructions for the Mac and the most common Linux and BSD distributions:

$ sudo port install raptor             # Mac OS X with MacPorts
$ sudo fink install raptor-bin         # Mac OS X with Fink
$ sudo aptitude install raptor-utils   # Ubuntu / Debian
$ sudo yum install raptor              # Fedora / CentOS / RHEL
$ sudo zypper install raptor           # openSUSE
$ sudo emerge raptor                   # Gentoo Linux
$ sudo pkg_add -r raptor               # FreeBSD
$ sudo pkg_add raptor                  # OpenBSD / NetBSD

For more information on installing and using Raptor, see our previous tutorial RDF for Intrepid Unix Hackers: Transmuting N-Triples.

Consuming RDF data

If you're in a hurry and just want to get to consuming RDF data right away, the following is really the only thing you need to know:

require 'rdf'
require 'rdf/ntriples'

graph = RDF::Graph.load("")

In this example, we first load up RDF.rb as well as support for the N-Triples format. After that, we use a convenience method on the RDF::Graph class to fetch and parse RDF data directly from a web URL in one go. (The load method can take either a file name or a URL.)

All RDF.rb parser plugins declare which MIME content types and file extensions they are capable of handling, which is why in the above example RDF.rb knows how to instantiate an N-Triples parser to read the foaf.nt file at the given URL.

In the same way, RDF.rb will auto-detect any other RDF file formats as long as you've loaded up support for them using one or more of the following:

require 'rdf/ntriples' # Support for N-Triples (.nt)
require 'rdf/raptor'   # Support for RDF/XML (.rdf) and Turtle (.ttl)
require 'rdf/json'     # Support for RDF/JSON (.json)
require 'rdf/trix'     # Support for TriX (.xml)

Note that if you need to read RDF files containing multiple named graphs (in a serialization format that supports named graphs, such as TriX), you probably want to be using RDF::Repository instead of RDF::Graph:

repository = RDF::Repository.load("")

The difference between the two is that RDF statements in RDF::Repository instances can contain an optional context (i.e. they can be quads), whereas statements in an RDF::Graph instance always have the same context (i.e. they are triples). In other words, repositories contain one or more graphs, which you can access as follows:

repository.each_graph do |graph|
  puts graph.inspect

Introspecting RDF formats

RDF.rb's parsing and serialization APIs are based on the following three base classes:

  • RDF::Format is used to describe particular RDF serialization formats.
  • RDF::Reader is the base class for RDF parser implementations.
  • RDF::Writer is the base class for RDF serializer implementations.

If you know something about the file format you want to parse or serialize, you can obtain a format specifier class for it in any of the following ways:

require 'rdf/raptor'

RDF::Format.for(:rdfxml)       #=> RDF::Raptor::RDFXML::Format
RDF::Format.for(:file_name      => "input.rdf")
RDF::Format.for(:file_extension => "rdf")
RDF::Format.for(:content_type   => "application/rdf+xml")

Once you have such a format specifier class, you can then obtain the parser/serializer implementations for it as follows:

format = RDF::Format.for("input.nt")   #=> RDF::NTriples::Format
reader = format.reader                 #=> RDF::NTriples::Reader
writer = format.writer                 #=> RDF::NTriples::Writer

There also exist corresponding factory methods on RDF::Reader and RDF::Writer directly:

reader = RDF::Reader.for("input.nt")   #=> RDF::NTriples::Reader
writer = RDF::Writer.for("output.nt")  #=> RDF::NTriples::Writer

The above is what RDF.rb relies on internally to obtain the correct parser implementation when you pass in a URL or file name to RDF::Graph.load -- or indeed to any other method that needs to auto-detect a serialization format and to delegate responsibility for parsing/serialization to the appropriate implementation class.

Parsing RDF data

If you need to be more explicit about parsing RDF data, for instance because the dataset won't fit into memory and you wish to process it statement by statement, you'll need to use RDF::Reader directly.

Parsing RDF statements from a file

RDF parser implementations generally support a streaming-compatible subset of the RDF::Enumerable interface, all of which is based on the #each_statement method. Here's how to read in an RDF file enumerated statement by statement:

require 'rdf/raptor'"foaf.rdf") do |reader|
  reader.each_statement do |statement|
    puts statement.inspect

Using with a Ruby block ensures that the input file is automatically closed after you're done with it.

Parsing RDF statements from a URL

As before, you can generally use an http:// or https:// URL anywhere that you could use a file name:

require 'rdf/json'"") do |reader|
  reader.each_statement do |statement|
    puts statement.inspect

Parsing RDF statements from a string

Sometimes you already have the serialized RDF contents in a memory buffer somewhere, for example as retrieved from a database. In such a case, you'll want to obtain the parser implementation class as shown before, and then use directly:

require 'rdf/ntriples'

input = open('').read

RDF::Reader.for(:ntriples).new(input) do |reader|
  reader.each_statement do |statement|
    puts statement.inspect

The RDF::Reader constructor uses duck typing and accepts any input (for example, IO or StringIO objects) that responds to the #readline method. If no input argument is given, input data will by default be read from the standard input.

Serializing RDF data

Serializing RDF data works much the same way as parsing: when serializing to a named output file, the correct serializer implementation is auto-detected based on the given file extension.

Serializing RDF statements into an output file

RDF serializer implementations generally support an append-only subset of the RDF::Mutable interface, primarily the #insert method and its alias #<<. Here's how to write out an RDF file statement by statement:

require 'rdf/ntriples'
require 'rdf/raptor'

data = RDF::Graph.load("")"output.rdf") do |writer|
  data.each_statement do |statement|
    writer << statement

Once again, using with a Ruby block ensures that the output file is automatically flushed and closed after you're done writing to it.

Serializing RDF statements into a string result

A common use case is serializing an RDF graph into a string buffer, for example when serving RDF data from a Rails application. RDF::Writer has a convenience buffer class method that builds up output in a StringIO under the covers and then returns a string when all is said and done:

require 'rdf/ntriples'

output = RDF::Writer.for(:ntriples).buffer do |writer|
  subject =
  writer << [subject, RDF.type, RDF::FOAF.Person]
  writer << [subject,, "J. Random Hacker"]
  writer << [subject, RDF::FOAF.mbox, RDF::URI("")]
  writer << [subject, RDF::FOAF.nick, "jhacker"]

Customizing the serializer output

If a particular serializer implementation supports options such as namespace prefix declarations or a base URI, you can pass in those options to or as keyword arguments:"output.ttl", :base_uri => "")
RDF::Writer.for(:rdfxml).new($stdout, :base_uri => "")

Support channels

That's all for now, folks. For more information on the APIs touched upon in this tutorial, please refer to the RDF.rb API documention. If you have any questions, don't hesitate to ask for help on #swig or the mailing list.

Posted at 16:05

Datagraph: How to Build an SQL Storage Adapter for RDF Data with Ruby

RDF.rb is approaching two thousand downloads on RubyGems, and while it has good documentation it could still use some more tutorials. I recently needed to get RDF.rb working with a PostgreSQL storage backend in order to work with RDF data in a Rails 3.0 application hosted on Heroku. I thought I'd keep track of what I did so that I could discuss the notable parts.

In this tutorial we'll be implementing an RDF.rb storage adapter called RDF::DataObjects::Repository, which is a simplified version of what I eventually ended up with. If you want the real thing, check it out on GitHub and read the docs. This tutorial will only cover the SQLite backend and won't concern itself with database indexes, performance tweaks, or any other distractions from the essential RDF.rb interfaces we'll focus on. There's a copy of the simplified code used in the tutorial at the tutorial's project page. And should you be inspired to build something similar of your own, I have set up an RDF.rb storage adapter skeleton at GitHub. Click fork, grep for lines containing a TODO comment, and dive right in.

I'll mention, briefly, that I chose DataObjects as the database abstraction layer, but I don't want to dwell on that -- this post is about RDF. DataObjects is just a way to use common methods to talk to different databases at the SQL level. It's a leaky abstraction, because we'll want to be using some SQL constraints to enforce statement uniqueness but those constraints need to be done differently for different databases. That means we still have to get down to the level of database-specific SQL, distasteful as that may be in this day and age. However, given that I wanted to be able to target PostgreSQL and SQLite both, DataObjects is still helpful.


You just need a few gems for the example repository. This ought to get you going. Even if you have these, make sure you have the latest; RDF.rb gets updated frequently.

$ sudo gem install rdf rdf-spec rspec do_sqlite3

Testing First

So where do we start? Tests, of course. RDF.rb has factored out its mixin specs to the RDF::Spec gem, which provides the RSpec shared example groups that are also used by RDF.rb for its own tests. Thus, here is the complete spec file for the in-memory reference implementation of RDF::Repository:

require File.join(File.dirname(__FILE__), 'spec_helper')
require 'rdf/spec/repository'

describe RDF::Repository do
  before :each do
    @repository =

  # @see lib/rdf/spec/repository.rb
  it_should_behave_like RDF_Repository

If you haven't seen something like this before, that's an RSpec shared example group, and it's awesome. Anything can use the same specs as RDF.rb itself to verify that it conforms to the interfaces defined by RDF.rb, and that's exactly what we'll be doing in this tutorial. Let's implement that for our repository:

# spec/sqlite3.spec
$:.unshift File.dirname(__FILE__) + "/../lib/"

require 'rdf'
require 'rdf/do'
require 'rdf/spec/repository'
require 'do_sqlite3'

describe RDF::DataObjects::Repository do
  context "The SQLite adapter" do
    before :each do
      @repository = "sqlite3::memory:"

    after :each do
      # DataObjects pools connections, and only allows 8 at once.  We have
      # more than 60 tests.

    it_should_behave_like RDF_Repository

If you're new to RSpec, run the tests with the spec command:

$ spec -cfn spec/sqlite3.spec

These fail miserably right now, of course, since we don't have an implementation. So let's make one.

Initial implementation

RDF.rb's interface for an RDF store is RDF::Repository. That interface is itself composed of a number of mixins: RDF::Enumerable, RDF::Queryable, RDF::Mutable, and RDF::Durable.

RDF::Queryable has a base implementation that works on anything which implements RDF::Enumerable. And RDF::Durable only provides boolean methods for clients to ask if it is durable? or not; the default is that a repository reports that it is indeed durable, so we don't need to do anything there.

The takeaway is that to create an RDF.rb storage adapter, we need to implement RDF::Enumerable and RDF::Mutable, and the rest will fall into place. Indeed, the reference implementation is little more than an array which implements these interfaces.

It turns out we can get away with just three methods to implement those two interfaces: RDF::Enumerable#each, RDF::Mutable#insert_statement, and RDF::Mutable#delete_statement. The default implementations will use these to build up any missing methods. That means we need to implement those first, so that we have a base to pass our tests. Then we can iterate further, replacing methods which iterate over every statement with methods more appropriate for our backend.

Here's a repository which doesn't implement much more than those three methods. We'll use it as a starting point.

# lib/rdf/do.rb

require 'rdf'
require 'rdf/ntriples'
require 'data_objects'
require 'do_sqlite3'
require 'enumerator'

module RDF
  module DataObjects
    class Repository < ::RDF::Repository

      def initialize(options)
        @db =
        exec('CREATE TABLE IF NOT EXISTS quads (
              `subject` varchar(255), 
              `predicate` varchar(255),
              `object` varchar(255), 
              `context` varchar(255), 
              UNIQUE (`subject`, `predicate`, `object`, `context`))')

      # @see RDF::Enumerable#each.
      def each(&block)
        if block_given?
          reader = result('SELECT * FROM quads')
                  :subject   => unserialize(reader.values[0]),
                  :predicate => unserialize(reader.values[1]),
                  :object    => unserialize(reader.values[2]),
                  :context   => unserialize(reader.values[3])))


      # @see RDF::Mutable#insert_statement
      def insert_statement(statement)
        sql = 'REPLACE INTO `quads` (subject, predicate, object, context) VALUES (?, ?, ?, ?)'
                 serialize(statement.object), serialize(statement.context)) 

      # @see RDF::Mutable#delete_statement
      def delete_statement(statement)
        sql = 'DELETE FROM `quads` where (subject = ? AND predicate = ? AND object = ? AND context = ?)'
                 serialize(statement.object), serialize(statement.context)) 

      ## These are simple helpers to serialize and unserialize component
      # fields.  We use an explicit empty string for null values for clarity in
      # this example; we cannot use NULL, as SQLite considers NULLs as
      # distinct from each other when using the uniqueness constraint we
      # added when we created the table.  It would let us insert duplicate
      # with a NULL context.
      def serialize(value)
        RDF::NTriples::Writer.serialize(value) || ''
      def unserialize(value)
        value == '' ? nil : RDF::NTriples::Reader.unserialize(value)

      ## These are simple helpers for DataObjects
      def exec(sql, *args)
      def result(sql, *args)


And we have a repository. Poof, done, that's it. You can get a copy of this intermediate repository at the tutorial page and run the specs for yourself. It's not very efficient for SQL yet, but this is all it takes, strictly speaking.

Since they are so important, the three main methods deserve a little more attention:


Each is the only thing we have to implement to get information out after we've put it in. RDF::Enumerable will provide us tons of things like each_subject, has_subject?, each_predicate, has_predicate?, etc. If you were watching the spec output, you'll notice we ran tests for RDF::Queryable. The default implementation will use RDF::Enumerable's methods to implement basic querying. This means we can already do things like:

# Note that #load actually comes from insert_statement, see below
repo.query(:subject =>''))
#=> RDF::Enumerable of statements with given URI as subject

Note that if a block is not sent, it's defined to return an Enumerable::Enumerator.

RDF::Queryable, which defines #query, is probably the thing we can improve the most on with SQL as opposed to the reference implementation. We'll revisit it below.


insert_statement inserts an RDF::Statement into the repository. It's pretty straightforward. It gives us access to default implementations of things like RDF::Mutable#load, which will load a file by name or import a remote resource:

#=> 10


delete_statement deletes an RDF::Statement. Again, it's straightforward, and it's used to implement things like RDF::Mutable#clear, which empties the repository:

#=> 0

Iterate and Improve

Since we already have a nice test suite that we can pass, we can add functionality incrementally. For example, let's implement RDF::Enumerable#count in a fashion that does not require us to enumerate each statement, which is clearly not ideal for a SQL-based system:

# lib/rdf/do.rb

def count
  result = result('SELECT COUNT(*) FROM quads')!

The tests still pass, we can move on. Wash, rinse, repeat; probably every method in RDF::Enumerable and RDF::Mutable can be done more efficiently with SQL.


RDF::Queryable is worth mentioning on its own, because the interface takes a lot of options. Specifically, it can take a Hash, a smashed Array, an RDF::Statement, or a Query object. Fortunately, we can call super to defer to the reference implementation if we get arguments we don't understand, so we can again be iterative here.

We can start by implementing the hash version, which is the most convienent for doing the actual SQL query later. The hash version takes a hash which may have keys for :subject, :predicate, :object, and :context, and returns an RDF::Enumerable which contains all statements matching those parameters

# lib/rdf/do.rb

      def query(pattern, &block)
        case pattern
          when Hash
            statements = []
            reader = query_hash(pattern)
              statements <<
                      :subject   => unserialize(reader.values[0]),
                      :predicate => unserialize(reader.values[1]),
                      :object    => unserialize(reader.values[2]),
                      :context   => unserialize(reader.values[3]))
            case block_given?
              when true
                statements.extend(RDF::Enumerable, RDF::Queryable)

      def query_hash(hash)
        conditions = []
        params = []
        [:subject, :predicate, :object, :context].each do |resource|
          unless hash[resource].nil?
            conditions << "#{resource.to_s} = ?"
            params     << serialize(hash[resource])
        where = conditions.empty? ? "" : "WHERE "
        where << conditions.join(' AND ')
        result('SELECT * FROM quads ' + where, *params)

Our specs still pass. Note this trick:

statements.extend(RDF::Enumerable, RDF::Queryable)

RDF::Queryable is defined to return something which implements RDF::Enumerable and RDF::Queryable. Since the only thing we need to implement RDF::Enumerable is #each, and Array already implements that, we can simply extend this Array instance with the mixins and return it.

Note also that while we have taken care of the hard part, we're still calling the reference implementation if we don't know how to handle our arguments. Now we can start adding those other query arguments:

# lib/rdf/do.rb

      def query(pattern, &block)
        case pattern
          when RDF::Statement
          when Array
          when Hash

Our specs still pass! Moving on, there's a lot more we can implement. And once we have implemented it in a straightforward way, we can still implement things like multiple inserts, paging, and more, all transparant to the user. You can see the full list of methods to implement in the docs, but don't be afraid to dive into the code.

If you do, don't forget that RDF.rb is completely public domain, so if you want to copy-paste to bootstrap your implementation, feel free.

Any questions?

Hopefully this is enough to get you started. Remember, the code is at the tutorial page, and don't forget to check out the storage adapter skeleton. The RDF.rb documentation have a lot of information on the APIs you'll be using.

And last but not least, a good place to ask questions or leave a comment is on the W3C RDF-Ruby mailing list.

Posted at 16:05

Datagraph: RDF for Intrepid Unix Hackers: Transmuting N-Triples

This is the second part in an ongoing RDF for Intrepid Unix Hackers article series. In the previous part, we learned how to process RDF data in the line-oriented, whitespace-separated N-Triples serialization format by pipelining standard Unix tools such as grep and awk.

That was all well and good, but what to do if your RDF data isn't already in N-Triples format? Today we'll see how to install and use the excellent Raptor RDF Parser Library to convert RDF from one serialization format to another.

Installing the Raptor RDF Parser tools

The Raptor toolkit includes a handy command-line utility called rapper, which can be used to convert RDF data between most of the various popular RDF serialization formats.

Installing Raptor is straightforward on most development and deployment platforms; here's how to install Raptor on Mac OS X with MacPorts and on any of the most common Linux and BSD distributions:

$ [sudo] port install raptor             # Mac OS X with MacPorts
$ [sudo] fink install raptor-bin         # Mac OS X with Fink
$ [sudo] aptitude install raptor-utils   # Ubuntu / Debian
$ [sudo] yum install raptor              # Fedora / CentOS / RHEL
$ [sudo] zypper install raptor           # openSUSE
$ [sudo] emerge raptor                   # Gentoo Linux
$ [sudo] pacman -S raptor                # Arch Linux
$ [sudo] pkg_add -r raptor               # FreeBSD
$ [sudo] pkg_add raptor                  # OpenBSD / NetBSD

The subsequent examples all assume that you have successfully installed Raptor and thus have the rapper utility available in your $PATH. To make sure that rapper is indeed available, just ask it to output its version number as follows:

$ rapper --version

We'll be using version 1.4.21 for this tutorial, but any 1.4.x release from 1.4.5 onwards should do fine for present purposes -- so don't worry if your distribution provides a slightly older version.

Should you have any trouble getting rapper set up, you can ask for help on the #swig channel on IRC or on the Raptor mailing list.

Transmuting RDF/XML into N-Triples

RDF/XML is the standard RDF serialization specified by W3C back before the dot-com bust. Despite some newer, more human-friendly formats, a great deal of the RDF data out there in the wild is still made available in this format.

For example, every valid RSS 1.0-compatible feed is, in principle, also a valid RDF/XML document (but note that the same is not true for non-RDF formats like RSS 2.0 or Atom). So, let's grab the RSS feed for this blog and define a Bash shell alias for converting RDF/XML into N-Triples using rapper:

$ alias rdf2nt="rapper -i rdfxml -o ntriples"

$ curl > index.rdf

$ rdf2nt index.rdf > index.nt
rapper: Parsing URI file://index.rdf with parser rdfxml
rapper: Serializing with serializer ntriples
rapper: Parsing returned 106 triples

Pretty easy, huh? It gets even easier, because rapper actually supports fetching URLs directly. Typically Raptor is built with libcurl support, so it supports the same set of URL schemes as does the curl command itself. This means that e.g. any http://, https:// and ftp:// input arguments will work right out of the box, so that we can combine our previous last two commands as follows:

$ rdf2nt > index.nt
rapper: Parsing URI with parser rdfxml
rapper: Serializing with serializer ntriples
rapper: Parsing returned 106 triples

Transmuting Turtle into N-Triples

After RDF/XML, Turtle is probably the most widespread RDF format out there. It is a subset of Notation3 and a superset of N-Triples, hitting a sweet spot for both expressiveness and conciseness. It is also much more pleasant to write by hand than XML, so personal FOAF files in particular tend to be authored in Turtle and then converted, e.g. using rapper, into a variety of formats when published on the Linked Data web.

For this next example, let's grab my FOAF file in Turtle format and convert it into N-Triples:

$ alias ttl2nt="rapper -i turtle -o ntriples"

$ ttl2nt > foaf.nt
rapper: Parsing URI with parser turtle
rapper: Serializing with serializer ntriples
rapper: Parsing returned 16 triples

Just as easy as with RDF/XML. And you'll notice that this time around we did the downloading and the conversion in a single step by letting rapper worry about fetching the data directly from the URL in question.

Transmuting N-Triples into other formats

Conversely, you can of course also use rapper to convert any N-Triples input data into other RDF serialization formats such as Turtle, RDF/XML and RDF/JSON. You need only swap the arguments to the -i and -o options and you're good to go.

So, let's define a couple more handy aliases:

$ alias nt2ttl="rapper -i ntriples -o turtle"
$ alias nt2rdf="rapper -i ntriples -o rdfxml-abbrev"
$ alias nt2json="rapper -i ntriples -o json"

Now we can quickly and easily convert any N-Triples data into other RDF formats:

$ nt2ttl  index.nt > index.ttl
$ nt2rdf  index.nt > index.rdf
$ nt2json index.nt > index.json

We can define similar aliases for any input/output permutation provided by rapper. To find out the full list of input and output RDF serialization formats supported by your version of the program, run rapper --help:

$ rapper --help
Main options:
  -i FORMAT, --input FORMAT   Set the input format/parser to one of:
    rdfxml          RDF/XML (default)
    ntriples        N-Triples
    turtle          Turtle Terse RDF Triple Language
    trig            TriG - Turtle with Named Graphs
    rss-tag-soup    RSS Tag Soup
    grddl           Gleaning Resource Descriptions from Dialects of Languages
    guess           Pick the parser to use using content type and URI
    rdfa            RDF/A via librdfa
  -o FORMAT, --output FORMAT  Set the output format/serializer to one of:
    ntriples        N-Triples (default)
    turtle          Turtle
    rdfxml-xmp      RDF/XML (XMP Profile)
    rdfxml-abbrev   RDF/XML (Abbreviated)
    rdfxml          RDF/XML
    rss-1.0         RSS 1.0
    atom            Atom 1.0
    dot             GraphViz DOT format
    json-triples    RDF/JSON Triples
    json            RDF/JSON Resource-Centric

Defining more rapper aliases

Copy and paste the following code snippet into your ~/.bash_aliases or ~/.bash_profile file, and you will always have these aliases available when working with RDF data on the command line:

# rapper aliases from
alias any2nt="rapper -i guess -o ntriples"         # Anything to N-Triples
alias any2ttl="rapper -i guess -o turtle"          # Anything to Turtle
alias any2rdf="rapper -i guess -o rdfxml-abbrev"   # Anything to RDF/XML
alias any2json="rapper -i guess -o json"           # Anything to RDF/JSON
alias nt2ttl="rapper -i ntriples -o turtle"        # N-Triples to Turtle
alias nt2rdf="rapper -i ntriples -o rdfxml-abbrev" # N-Triples to RDF/XML
alias nt2json="rapper -i ntriples -o json"         # N-Triples to RDF/JSON
alias ttl2nt="rapper -i turtle -o ntriples"        # Turtle to N-Triples
alias ttl2rdf="rapper -i turtle -o rdfxml-abbrev"  # Turtle to RDF/XML
alias ttl2json="rapper -i turtle -o json"          # Turtle to RDF/JSON
alias rdf2nt="rapper -i rdfxml -o ntriples"        # RDF/XML to N-Triples
alias rdf2ttl="rapper -i rdfxml -o turtle"         # RDF/XML to Turtle
alias rdf2json="rapper -i rdfxml -o json"          # RDF/XML to RDF/JSON
alias json2nt="rapper -i json -o ntriples"         # RDF/JSON to N-Triples
alias json2ttl="rapper -i json -o ntriples"        # RDF/JSON to N-Triples
alias json2rdf="rapper -i json -o ntriples"        # RDF/JSON to N-Triples

Since each of these aliases is a mnemonic patterned after the file extensions for the input and output formats involved, remembering these is easy as pie. Note also that I've included four any2* aliases that specify guess as the input format to let rapper try and automatically detect the serialization format for the input stream.

A big thanks goes out to Dave Beckett for having developed Raptor and for giving us the superbly useful N-Triples and Turtle serialization formats. I personally use rapper and these aliases just about every single day, and I hope you find them as useful as I have.

Stay tuned for more upcoming installments of RDF for Intrepid Unix Hackers.

Lest there be any doubt, all the code in this tutorial is hereby released into the public domain using the Unlicense. You are free to copy, modify, publish, use, sell and distribute it in any way you please, with or without attribution.

Posted at 16:05

Datagraph: The Curious Case of RDF Graph Isomorphism

The first time I ever sat down to write some real RDF code, I started, as one always should, with some tests. Most of them went fine, but then I had to write a test that compared the equality of two graphs; I think this was for a parser in Scala, sometime last year, but I've lost track of what exactly I was looking at. In any case, what a can of worms I opened.

It turns out that graph equality in RDF is hard. The combination of blank and non-blank nodes makes it a graph isomorphism problem that I have not found an exact equivalence for in straight-up graph theory. Graphs with named vertices and edges have easy solutions, graphs with unnamed vertices and edges have other, difficult solutions. The difference, depending on the type of graph, can be between O(n) and O(n!) on the number of nodes, so when selecting a possible solution, we'd like to avoid solutions that don't take naming into account.

The isomorphism problem is hard enough that many popular RDF implementations don't even include a solution for it. RDFLib for Python has an approximation with a to-do note, I don't see an appropriate function in Redland's model API, and Sesame has an implementation with the following comment:

// FIXME: this recursive implementation has a high risk of
// triggering a stack overflow

My Java is rusty and I have no intention of polishing it up for this blog post, but I believe Sesame's implementation has factorial complexity.

Now, don't get me wrong. Those are all free projects, and it's a tough problem to do right. We over at Datagraph just made do without an isomorphism function in either Scala or Ruby for several months rather than solve it. So this is not intended to be a cheap shot at those projects -- in fact, we use both Redland and Sesame, and quite happily. And if I'm wrong on the sparse nature of this landscape, someone please correct me.

However, we're developing a new RDF library for Ruby, so when it came time to really solve the problem, we wanted to solve it right. Like most problems in computer science, it's actually old news. Jeremy Carroll solved it and implemented it for Jena either before or after writing a great paper on the topic. What I'm about to describe is more or less his algorithm, and while I slightly adjusted the following to my style, I'm not about to say much that his paper doesn't. So just go read the paper if that's your preference.

The algorithm can be described as a refinement of a naive O(n!) graph isomorphism algorithm, in which each blank node is mapped onto each other blank node, followed by a consistency check. The magic stems from RDF having these nifty global identifiers for most vertices and all edges. If we're smart about it, we can eliminate substantially all of the possible mappings before we try even our first speculative mapping.

I haven't done the math, but it would seem that one could generate a pathological case graph which would be O(n!). On the other hand, since RDF does not allow blank node predicates, and because the algorithm terminates on the first match, I haven't yet figured out how to create such a pathological graph for this algorithm. Graphs tend to be either open enough to have a large number of solutions, one of which will be found quickly, or tight enough to have only one.

The algorithm works as follows:

  1. Compare graph sizes and all statements without blank nodes. If they do not match, fail.
  2. Repeat, for each graph:
    1. Repeat, for each blank node:
      1. Mark the node as grounded or not. A grounded node has only non-blank nodes or grounded nodes in statements in which it appears.
      2. Create a signature for the node. A signature consists of a canonical representation of all of the statements a node appears in.
    2. Terminate unless we marked a node as grounded on this run.
  3. Map grounded blank nodes to the other graph's grounded blank nodes where signatures match.
  4. If all nodes are mapped, we have a bijection, which we can return.
  5. Select ungrounded nodes from each graph with identical signatures. Mark them as grounded, then recurse to step 2.
  6. If no ungrounded nodes have the same signature, or we have tried all matching pairs, a bijection does not exist. Fail.

In something approaching day-to-day English, what's happening here is that after eliminating the simple possibilities, we're generating a hash of all of the elements that appear with a given node in a graph. We then create a node-to-hash mapping. As the hashes will be the same for blank nodes on both input graphs, we use that hash to eliminate possible matchings before we try them. Instead of trying every mapping, we try mappings only on nodes with the same signature. The end result is an algorithm that requires a fairly pathological case to recurse at all, let alone to recurse deeply. Nice.

At any rate, you can see the details, along with some test cases to play with, in RDF::Isomorphic for RDF.rb. This blog post coincides with release 0.1.0, which features a slightly improved signature algorithm, reducing the number of rounds required in some cases. The documentation is also greatly improved -- I spent more time on this problem than I ever intended to, so I hope this can be a readable summary of the algorithm for anyone coming across this in the future. Of course, RDF.rb's structure means almost anything using RDF.rb can be tested for isomorphism now, so hopefully it won't ever occur to you to read the code.

Of course, RDF::Isomorphic is in the public domain, so should you find my implementation worthy, feel free to copy the code as directly as your framework or programming language allows. And please feel free to do that without any obligation to provide attribution or any such silliness.

Posted at 16:05

Datagraph: RDF.rb: A Public-Domain RDF Library for Ruby

We have just released version 0.1.0 of RDF.rb, our RDF library for Ruby. This is the first generally useful release of the library, so I will here introduce the design philosophy and object model of the library as well as provide a tutorial to using its core classes.

RDF.rb has extensive API documentation with many inline code examples, enjoys comprehensive RSpec coverage, and is immediately available via RubyGems:

$ [sudo] gem install rdf

Once installed, to load up the library in your own Ruby projects you need only do:

require 'rdf'

The RDF.rb source code repository is hosted on GitHub. You can obtain a local working copy of the source code as follows:

$ git clone git://

The Design Philosophy

The design philosophy for RDF.rb differs somewhat from previous efforts at RDF libraries for Ruby. Instead of a feature-packed RDF library that attempts to include everything but the kitchen sink, we have rather aimed for something like a lowest common denominator with well-defined, finite requirements.

Thus, RDF.rb is perhaps quickest described in terms of what it isn't and what it hasn't:

  • RDF.rb does not have any dependencies other than the Addressable gem which provides improved URI handling over Ruby's standard library. We also guarantee that RDF.rb will never add any hard dependencies that would compromise its use on popular alternative Ruby implementations such as JRuby.

  • RDF.rb does not provide any resource-centric, ORM-like abstractions to hide the essential statement-oriented nature of the API. Such abstractions may be useful, but they are beyond the scope of RDF.rb itself.

  • RDF.rb does not, and will not, include built-in support for any RDF serialization formats other than N-Triples and N-Quads. However, it does define a DSL and common API for adding support for other formats via third-party plugin gems. There presently exist RDF.rb-compatible RDF::JSON and RDF::TriX gems that add initial RDF/JSON and TriX support, respectively.

  • RDF.rb does not, and will not, include built-in support for any particular persistent RDF storage systems. However, it does define the interfaces that such storage adapters could be written to. Again, add-on gems are the way to go, and there already exists an in-the-works RDF.rb-compatible RDF::Sesame gem that enables using Sesame 2.0 HTTP endpoints with the repository interface defined by RDF.rb.

  • RDF.rb does not, and will not, include any built-in RDF Schema or OWL inference capabilities. There exists an in-the-works RDF.rb-compatible RDFS gem that is intended to provide a naive proof-of-concept implementation of a forward-chaining inference engine for the RDF Schema entailment rules.

  • RDF.rb does not include any built-in SPARQL functionality per se, though it will soon provide support for basic graph pattern (BGP) matching and could thus conceivably be used as the basis for a SPARQL engine written in Ruby.

  • RDF.rb does not come with a license statement, but rather with the stringent hope that you have a nice day. RDF.rb is 100% free and unencumbered public domain software. You can copy, modify, use, and hack on it without any restrictions whatsoever. This means that authors of other RDF libraries for Ruby are perfectly welcome to steal any of our code, with or without attribution. So, if some code snippet or file may be of use to you, feel free to copy it and relicense it under whatever license you have released your own library with -- no need to include any copyright notices from us (since there are none), or even to mention us in the credits (we won't mind).

So that's what RDF.rb is not, but perhaps more important is what we want it to be. There's no reason for simple RDF-based solutions to require enormous complex libraries, storage engines, significant IDE configuration or XML pushups. We're hoping to bring RDF to a world of agile programmers and startups, and to bring existing Linked Data enthusiasts to a platform that encourages rapid innovation and programmer happiness. And maybe everyone can have some fun along the way!

It is also our hope that the aforementioned minimalistic design approach and extremely liberal licensing can help lead to the emergence of a semi-standard Ruby object model for RDF, that is, a common core class hierarchy and API that could be largely interoperable between a number of RDF libraries for Ruby.

With that in mind, let's proceed to have a look at RDF.rb's core object model.

The Object Model

While RDF.rb is built to take full advantage of Ruby's duck typing and mixins, it does also define a class hierarchy of RDF objects. If nothing else, this inheritance tree is useful for case/when matching and also adheres to the principle of least surprise for developers hailing from less dynamic programming languages.

The RDF.rb core class hierarchy looks like the following, and will seem instantly familiar to anyone acquainted with Sesame's object model:

RDF.rb class hierarchy

The five core RDF.rb classes, all of them ultimately inheriting from RDF::Value, are:

  • RDF::Literal represents plain, language-tagged or datatyped literals.
  • RDF::URI represents URI references (URLs and URNs).
  • RDF::Node represents anonymous nodes (also known as blank nodes).
  • RDF::Statement represents RDF statements (also known as triples).
  • RDF::Graph represents anonymous or named graphs containing zero or more statements.

In addition, the two core RDF.rb interfaces (known as mixins in Ruby parlance) are:

  • RDF::Enumerable provides RDF-specific iteration methods for any collection of RDF statements.
  • RDF::Queryable provides RDF-specific query methods for any collection of RDF statements.

Let's take a quick tour of each of these aforementioned core classes and mixins.

Working with RDF::URI

URI references (URLs and URNs) are represented in RDF.rb as instances of the RDF::URI class, which is based on the excellent Addressable::URI library.

Creating a URI reference

The RDF::URI constructor is overloaded to take either a URI string (anything that responds to #to_s, actually) or an options hash of URI components. This means that the following are two equivalent ways of constructing the same URI reference:

uri ="")

uri ={
  :scheme => 'http',
  :host   => '',
  :path   => '/',

The supported URI components are explained in the API documentation for

Getting the string representation of a URI

Turning a URI reference back into a string works as usual in Ruby:

uri.to_s        #=> ""

Navigating URI hierarchies

RDF::URI supports the same set of instance methods as does Addressable::URI. This means that the following methods, and many more, are available:

uri ="")

uri.absolute?   #=> true
uri.relative?   #=> false
uri.scheme      #=> "http"
uri.authority   #=> ""        #=> ""
uri.port        #=> nil
uri.path        #=> "/gems/rdf"
uri.basename    #=> "rdf"

In addition, RDF::URI supports several convenience methods that can help you navigate URI hierarchies without breaking a sweat:

uri ="")
uri = uri.join("gems", "rdf")

uri.to_s        #=> ""

uri.parent      #=>"")
uri.root        #=>"")

Working with RDF::Node

Blank nodes are represented in RDF.rb as instances of the RDF::Node class.

Creating a blank node with an implicit identifier

The simplest way to create a new blank node is as follows:

bnode =

This will create a blank node with an identifier based on the internal Ruby object ID of the RDF::Node instance. This nicely serves us as a unique identifier for the duration of the Ruby process:   #=> "2158816220"
bnode.to_s #=> "_:2158816220"

Creating a blank node with a UUID identifier

You can also provide an explicit blank node identifier to the RDF::Node constructor. This is particularly useful when serializing or parsing RDF data, where you generally need to maintain a mapping of blank node identifiers to blank node instances.

The constructor argument can be any string or any object that responds to #to_s. For example, say that you wanted to create a blank node instance having a globally-unique UUID as its identifier. Here's how you would do this with the help of the UUID gem:

require 'uuid'

bnode =

The above is a fairly common use case, so RDF.rb actually provides a convenience class method for creating UUID-based blank nodes. The following will use either the UUID or the UUIDTools gem, whichever happens to be available:

bnode = RDF::Node.uuid
bnode.to_s #=> "_:504c0a30-0d11-012d-3f50-001b63cac539"

Working with RDF::Literal

All three types of RDF literals -- plain, language-tagged and datatyped -- are represented in RDF.rb as instances of the RDF::Literal class.

Creating a plain literal

Create plain literals by passing in a string to the RDF::Literal constructor:

hello ="Hello, world!")

hello.plain?         #=> true
hello.has_language?  #=> false
hello.has_datatype?  #=> false

Note, however, that in most RDF.rb interfaces you will not in fact need to wrap language-agnostic, non-datatyped strings into RDF::Literal instances; this is done automatically when needed, allowing you the convenience of, say, passing in a plain old Ruby string as the object value when constructing an RDF::Statement instance.

Creating a language-tagged literal

To create language-tagged literals, pass in an additional ISO language code to the :language option of the RDF::Literal constructor:

hello ="Hello!", :language => :en)

hello.has_language?  #=> true
hello.language       #=> :en

The language code can be given as either a symbol, a string, or indeed anything that responds to the #to_s method:"Hello!", :language => :en)"Wazup?", :language => :"en-US")"Hej!",   :language => "sv")"¡Hola!", :language => ["es"])

Creating an explicitly datatyped literal

Datatyped literals are created similarly, by passing in a datatype URI to the :datatype option of the RDF::Literal constructor:

date ="2010-12-31", :datatype =>

date.has_datatype?   #=> true
date.datatype        #=>

The datatype URI can be given as any object that responds to either the #to_uri method or the #to_s method. In the example above, we've called the #date method on the RDF::XSD vocabulary class which represents the XML Schema datatypes vocabulary; this returns an RDF::URI instance representing the URI for the xsd:date datatype.

Creating implicitly datatyped literals

You'll be glad to hear that you don't necessarily have to always explicitly specify a datatype URI when creating a datatyped literal. RDF.rb supports a degree of automatic mapping between Ruby classes and XML Schema datatypes.

In most common cases, you can just pass in the Ruby value to the RDF::Literal constructor as-is, with the correct XML Schema datatype being automatically set by RDF.rb:

today =

today.has_datatype?  #=> true
today.datatype       #=>

The following implicit datatype mappings are presently supported by RDF.rb:               #=> RDF::XSD.boolean                #=> RDF::XSD.boolean                 #=> RDF::XSD.integer #=> RDF::XSD.integer              #=> RDF::XSD.double      #=>  #=> RDF::XSD.dateTime            #=> RDF::XSD.dateTime

Working with RDF::Statement

RDF statements are represented in RDF.rb as instances of the RDF::Statement class. Statements can be triples -- constituted of a subject, a predicate, and an object -- or they can be quads that also have an additional context indicating the named graph that they are part of.

Creating an RDF statement

Creating a triple works exactly as you'd expect:

subject   ="")
predicate = RDF::DC.creator
object    =""), predicate, object)

The subject should be an RDF::Resource, the predicate an RDF::URI, and the object an RDF::Value. These constraints are not enforced, however, allowing you to use any duck-typed equivalents as components of statements.

Creating an RDF statement with a context

Pass in a URI reference in an extra :context option to the RDF::Statement constructor to create a quad:

context   ="")
subject   ="")
predicate = RDF::DC.creator
object    =""), predicate, object, :context => context)

Creating an RDF statement from a hash

It's also worth mentioning that the RDF::Statement constructor is overloaded to enable instantiating statements from an options hash, as follows:{
  :subject   =>""),
  :predicate => RDF::DC.creator,
  :object    =>""),

The :context option can also be given, as before. Use whichever method of instantiating statements that you happen to prefer.

Statement objects also support a #to_hash method that provides the inverse operation:

statement.to_hash   #=> { :subject   => ...,
                    #     :predicate => ..., 
                    #     :object    => ... }

Accessing RDF statement components

Access the RDF statement components -- the subject, the predicate, and the object -- as follows:

statement.subject   #=> an RDF::Resource
statement.predicate #=> an RDF::URI
statement.object    #=> an RDF::Value

Since statements can also have an optional context, the following will return either nil or else an RDF::Resource instance:

statement.context   #=> an RDF::Resource or nil

Working directly with triples and quads

Because RDF.rb is duck-typed, you can often directly use a three- or four-item Ruby array in place of an RDF::Statement instance. This can sometimes feel less cumbersome than instantiating a statement object, and it may also save some memory if you need to deal with a very large amount of in-memory RDF statements. We'll see some examples of doing this this later on.

Converting from statement objects to Ruby arrays is trivial:

statement.to_triple #=> [subject, predicate, object]
statement.to_quad   #=> [subject, predicate, object, context]

Likewise, instantiating a statement object from a triple represented as a Ruby array is straightforward enough:*[subject, predicate, object])

Working with RDF::Graph

RDF graphs are represented in RDF.rb as instances of the RDF::Graph class. Note that most of the functionality in this class actually comes from the RDF::Enumerable and RDF::Queryable mixins, which we'll examine further below.

Creating an anonymous graph

Creating a new unnamed graph works just as you'd expect:

graph =

graph.named? #=> false
graph.to_uri #=> nil

Creating a named graph

To create a named graph, just pass in a blank node or a URI reference to the RDF::Graph constructor:

graph ="")

graph.named? #=> true
graph.to_uri #=>"")

Adding statements to a graph

To insert RDF statements into a graph, use the #<< operator or the #insert method:

graph << statement


Let's add some RDF statements to an unnamed graph, taking advantage of the aforementioned duck-typing convenience that lets us represent triples directly using Ruby arrays, and plain literals directly using Ruby strings:

rdfrb ="")
arto  ="")

graph = do
  self << [rdfrb, RDF::DC.title,   "RDF.rb"]
  self << [rdfrb, RDF::DC.creator, arto]

If you prefer, you can also be more explicit and use the equivalent #insert method form instead of the #<< operator:

graph.insert([rdfrb, RDF::DC.title,   "RDF.rb"])
graph.insert([rdfrb, RDF::DC.creator, arto])

Deleting statements from a graph

To delete RDF statements from a graph, use the #delete method:


Deleting the statements we inserted in the previous example works like so:

graph.delete([rdfrb, RDF::DC.title,   "RDF.rb"])
graph.delete([rdfrb, RDF::DC.creator, arto])

Alternatively, we can use wildcard matching (where nil stands for a "match anything" wildcard) to simply delete every statement in the graph that has a particular subject:

graph.delete([rdfrb, nil, nil])

For even more convenience, since non-existent array subscripts in Ruby return nil, the following abbreviation is exactly equivalent to the previous example:


Working with RDF::Enumerable

RDF::Enumerable is a mixin module that provides RDF-specific iteration methods for any object capable of yielding RDF statements.

In what follows we will consider some of the key RDF::Enumerable methods specifically as used in instances of the RDF::Graph class.

Checking whether any statements exist

Just as with most of Ruby's built-in collection classes, graphs support an #empty? predicate method that returns a boolean:

graph.empty?      #=> true or false

Checking how many statements exist

You can use #count -- or if you prefer, the equivalent alias #size -- to return the number of RDF statements in a graph:


Checking whether a specific statement exists

If you need to check whether a specific RDF statement is included in the graph, use the following method:

graph.has_statement?(, predicate, object))

There also exists an otherwise equivalent convenience method that takes a Ruby array as its argument instead of an RDF::Statement instance:

graph.has_triple?([subject, predicate, object])

Checking whether a specific value exists

If you need to check whether a particular value is included in the graph as a component of one or more statements, use one of the following three methods:



graph.has_object?("Hello!", :language => :en))

Enumerating all statements

The following method yields every statement in the graph as an RDF::Statement instance:

graph.each_statement do |statement|
  puts statement.inspect

You can also use #each as a shorter alias for #each_statement, though we ourselves consider using the more explicit form to be stylistically preferred.

If you don't require RDF::Statement instances and simply want to get directly at the triple components of statements, do the following instead:

graph.each_triple do |subject, predicate, object|
  puts [subject, predicate, object].inspect

Similarly, you can enumerate the graph using quads as well:

graph.each_quad do |subject, predicate, object, context|
  puts [subject, predicate, object, context].inspect

Note that for unnamed graphs, the yielded context will always be nil; for named graphs, it will always be the same RDF::Resource instance as would be returned by calling graph.context.

Obtaining all statements

If instead of enumerating statements one-by-one you wish to obtain all the data in a graph in one go as an array of statements, the following method does just that:

graph.statements  #=> [RDF::Statement(subject1, predicate1, object1), ...]

Naturally, there also exist the usual alternative methods that give you the statements in the form of raw triples or quads represented as Ruby arrays:

graph.triples     #=> [[subject1, predicate1, object1], ...]
graph.quads       #=> [[subject1, predicate1, object1, context1], ...]

Enumerating all values

A particularly useful set of methods is the following, which yield unique statement components from a graph:

graph.each_subject   { |value| puts value.inspect }
graph.each_predicate { |value| puts value.inspect }
graph.each_object    { |value| puts value.inspect }

For instance, #each_subject yields every unique statement subject in the graph, never yielding the same subject twice.

Obtaining all unique values

Again, instead of yielding unique values one-by-one, you can obtain them in one go with the following methods:

graph.subjects    #=> [subject1, subject2, subject3, ...]
graph.predicates  #=> [predicate1, predicate2, predicate3, ...]
graph.objects     #=> [object1, object2, object3, ...]

Here, #subjects returns an array containing all unique statement subjects in the graph, and #predicates and #objects do the same for statement predicates and objects respectively.

Working with RDF::Queryable

RDF::Queryable is a mixin that provides RDF-specific query methods for any object capable of yielding RDF statements. At present this means simple subject-predicate-object queries, but extended basic graph pattern matching will be available in a future release of RDF.rb.

In what follows we will consider RDF::Queryable methods specifically as used in instances of the RDF::Graph class.

Querying for specific statements

The simplest type of query is one that specifies all statement components, as in the following:

statements = graph.query([subject, predicate, object])

The result set here would contain either no statements if the query didn't match (that is, the given statement didn't exist in the graph), or otherwise at the most the single matched statement.

The #query method can also take a block, in which case matching statements are yielded to the block one after another instead of returned as a result set:

graph.query([subject, predicate, object]) do |statement|
  puts statement.inspect

Querying with wildcard components

You can replace any of the query components with nil to perform a wildcard match. For example, in the following we query for all dc:title values for a given subject resource:

rdfrb ="")

graph.query([rdfrb, RDF::DC.title, nil]) do |statement|
  puts "dc:title = #{statement.object.inspect}"

We can also query for any and all statements related to a given subject resource:

graph.query([rdfrb, nil, nil]) do |statement|
  puts "#{statement.predicate.inspect} = #{statement.object.inspect}"

The result sets returned by #query also implement RDF::Enumerable and RDF::Queryable, so it is possible to chain several queries to incrementally refine a result set:

graph.query([rdfrb]).query([nil, RDF::DC.title])

Likewise, it is of course possible to chain RDF::Queryable operations with methods from RDF::Enumerable:

graph.query([nil, RDF::DC.title]).each_subject do |subject|
  puts subject.inspect

The Mailing List

If you have feedback regarding RDF.rb, please contact us either privately or via the mailing list. Bug reports should go to the issue queue on GitHub.

Coming Up

In upcoming RDF.rb tutorials we will see how to work with existing RDF vocabularies, how to serialize and parse RDF data using RDF.rb, how to write an RDF.rb plugin, how to use RDF.rb with Ruby on Rails 3.0, and much more. Stay tuned!

Posted at 16:05

Datagraph: RDF for Intrepid Unix Hackers: Grepping N-Triples

The N-Triples format is the lowest common denominator for RDF serialization formats, and turns out to be a very good fit to the Unix paradigm of line-oriented, whitespace-separated data processing. In this tutorial we'll see how to process N-Triples data by pipelining standard Unix tools such as grep, wc, cut, awk, sort, uniq, head and tail.

To follow along, you will need access to a Unix box (Mac OS X, Linux, or BSD) with a Bash-compatible shell. We'll be using curl to fetch data over HTTP, but you can substitute wget or fetch if necessary. A couple of the examples require a modern AWK version such as gawk or mawk; on Linux distributions you should be okay by default, but on Mac OS X you will need to install gawk or mawk from MacPorts as follows:

$ sudo port install mawk
$ alias awk=mawk

Grokking N-Triples

Each N-Triples line encodes one RDF statement, also known as a triple. Each line consists of the subject (a URI or a blank node identifier), one or more characters of whitespace, the predicate (a URI), some more whitespace, and finally the object (a URI, blank node identifier, or literal) followed by a dot and a newline. For example, the following N-Triples statement asserts the title of my website:

<> <> "Arto Bendiken" .

This is an almost perfect format for Unix tooling; the only possible further improvement would have been to define the statement component separator to be a tab character, which would have simplified obtaining the object component of statements -- as we'll see in a bit.

Getting N-Triples

Many RDF data dumps are made available as compressed N-Triples files. DBpedia, the RDFization of Wikipedia, is a prominent example. For purposes of this tutorial I've prepared an N-Triples dataset containing all Drupal-related RDF statements from DBpedia 3.4, which is the latest release at the moment and reflects Wikipedia as of late September 2009.

I prepared the sample dataset by downloading all English-language core datasets (20 N-Triples files totaling 2.1 GB when compressed) and crunching through them as follows:

$ bzgrep Drupal *.nt.bz2 > drupal.nt

To save you from gigabyte-sized downloads and an hour of data crunching, you can just grab a copy of the resulting drupal.nt file as follows:

$ curl > drupal.nt

The sample dataset totals 294 RDF statements and weighs in at 70 KB.

Counting N-Triples

The first thing we want to do is count the number of triples in an N-Triples dataset. This is straightforward to do, since each triple is represented by one line in an N-Triples input file and there are a number of Unix tools that can be used to count input lines. For example, we could use either of the following commands:

$ cat drupal.nt | wc -l

$ cat drupal.nt | awk 'END { print NR }'

Since we'll be using a lot more of AWK throughout this tutorial, let's stick with awk and define a handy shell alias for this operation:

$ alias rdf-count="awk 'END { print NR }'"

$ cat drupal.nt | rdf-count

Note that, for reasons of comprehensibility, the previous examples as well as most of the subsequent ones assume that we're dealing with "clean" N-Triples datasets that don't contain comment lines or other miscellania. The DBpedia data dumps fit this bill very well. However, further onwards I will give "fortified" versions of these commands that can correctly deal with arbitrary N-Triples files.

Measuring N-Triples

We at Datagraph frequently use the N-Triples representation as the canonical lexical form of an RDF statement, and work with content-addressable storage systems for RDF data that in fact store statements using their N-Triples representation. In such cases, it is often useful to know some statistical characteristics of the data to be loaded in a mass import, so as to e.g. be able to fine-tune the underlying storage for optimum space efficiency.

A first useful statistic is to know the typical size of a datum, i.e. the line length of an N-Triples statement, in the dataset we're dealing with. AWK yields us N-Triples line lengths without much trouble:

$ alias rdf-lengths="awk '{ print length }'"

$ cat drupal.nt | rdf-lengths | head -n5

Note that N-Triples is an ASCII format, so the numbers above reflect both the byte sizes of input lines as well as the ASCII character count of input lines. All non-ASCII characters are escaped in N-Triples, and for present purposes we'll be talking in terms of ASCII characters only.

The above list of line lengths in and of itself won't do us much good; we want to obtain aggregate information for the whole dataset at hand, not for individual statements. It's too bad that Unix doesn't provide commands for simple numeric aggregate operations such as the minimum, maximum and average of a list of numbers, so let's see if we can remedy that.

One way to define such operations would be to pipe the above output to an RPN shell calculator such as dc and have it perform the needed calculations. The complexity of this would go somewhat beyond mere shell aliases, however. Thankfully, it turns out that AWK is well-suited to writing these aggregate operations as well. Here's how we can extend our earlier pipeline to boil the list of line lengths down to an average:

$ alias avg="awk '{ s += \$1 } END { print s / NR }'"

$ cat drupal.nt | rdf-lengths | avg

The above, incidentally, is an example of a simple map/reduce operation: a sequence of input values is mapped through a function, in this case length(line), to give a sequence of output values (the line lengths) that is then reduced to a single aggregate value (the average line length). Though I won't go further into this just now, it is worth mentioning in passing that N-Triples is an ideal format for massively parallel processing of RDF data using Hadoop and the like.

Now, we can still optimize and simplify the above some by combining both steps of the operation into a single alias that outputs an average line length for the given input stream, like so:

$ alias rdf-length-avg="awk '\
  { s += length }
  END { print s / NR }'"

Likewise, it doesn't take much more to define an alias for obtaining the maximum line length in the input dataset:

$ alias rdf-length-max="awk '\
  BEGIN { n = 0 } \
  { if (length > n) n = length } \
  END { print n }'"

Getting the minimum line length is only slightly more complicated. Instead of comparing against a zero baseline like above, we need to instead define a "roof" value to compare against. In the following, I've picked an arbitrarily large number, making the (at present) reasonable assumption that no N-Triples line will be longer than a billion ASCII characters, which would amount to somewhat less than a binary gigabyte:

$ alias rdf-length-min="awk '\
  BEGIN { n = 1e9 } \
  { if (length > 0 && length < n) n = length } \
  END { print (n < 1e9 ? n : 0) }'"

Now that we have some aggregate operations to crunch N-Triples data with, let's analyze our sample DBpedia dataset using the three aliases defined above:

$ cat drupal.nt | rdf-length-avg

$ cat drupal.nt | rdf-length-max

$ cat drupal.nt | rdf-length-min

We can see from the output that N-Triples line lengths in this dataset vary considerably: from less than a hundred bytes to several kilobytes, but being on average in the range of two hundred bytes. This variability is to be expected for DBpedia data, given that many RDF statements in such a dataset contain a long textual description as their object literal whereas others contain merely a simple integer literal.

Many other statistics, such as the median line length or the standard deviation of the line lengths, could conceivably be obtained in a manner similar to what I've shown above. I'll leave those as exercises for the reader, however, as further stats regarding the raw N-Triples lines are unlikely to be all that generally interesting.

Parsing N-Triples

It's time to move on to getting at the three components -- the subject, the predicate and the object -- that constitute RDF statements.

We have two straightforward choices for obtaining the subject and predicate: the cut command and good old awk. I'll show both aliases:

$ alias rdf-subjects="cut -d' ' -f 1 | uniq"
$ alias rdf-subjects="awk '{ print \$1 }' | uniq"

While cut might shave off some microseconds compared to awk here, AWK is still the better choice for the general case, as it allows us to expand the alias definition to ignore empty lines and comments, as we'll see later. On our sample data, though, either form works fine.

You may have noticed and wondered about the pipelined uniq after cut and awk. This is simply a low-cost, low-grade deduplication filter: it drops consequent duplicate values. For an ordered dataset (where the input N-Triples lines are already sorted in lexical order), it will get rid of all duplicate subjects. In an unordered dataset, it won't do much good, but it won't do much harm either (what's a microsecond here or there?)

To fully deduplicate the list of subjects for a (potentially) unordered dataset, apply another uniq filter after a sort operation as follows:

$ cat drupal.nt | rdf-subjects | sort | uniq | head -n5

I've not made sort an integral part of the rdf-subjects alias because sorting the subjects is an expensive operation with resource usage proportional to the number of statements processed; when processing a billion-triple N-Triples stream, it is usually simply better to not care too much about ordering.

Getting the predicates from N-Triples data works exactly the same way as getting the subjects:

$ alias rdf-predicates="cut -d' ' -f 2 | uniq"
$ alias rdf-predicates="awk '{ print \$2 }' | uniq"

Again, you can apply sort in conjunction with uniq to get the list of unique predicate URIs in the dataset:

$ cat drupal.nt | rdf-predicates | sort | uniq | tail -n5

Obtaining the object component of N-Triples statements, however, is somewhat more complicated than getting the subject or the predicate. This is due to the fact that object literals can contain whitespace that will throw off the whitespace-separated field handling of cut and awk that we've relied on so far. Not to worry, AWK can still get us the results we want, but I won't attempt to explain how the following alias works; just be happy that it does:

$ alias rdf-objects="awk '{ ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"

The output of rdf-objects is the N-Triples encoded object URI, blank node identifier or object literal. URIs are output in the same format as subjects and predicates, with enclosing angle brackets; language-tagged literals include the language tag, and datatyped literals include the datatype URI:

$ cat drupal.nt | rdf-objects | sort | uniq | head -n5

Another very useful operation to have is getting the list of object literal datatypes used in an N-Triples dataset. This is also a somewhat involved alias definition, and requires a modern AWK version such as gawk or mawk:

$ alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 1, length(\$3)-2) }' | uniq"

$ cat drupal.nt | rdf-datatypes | sort | uniq

As we can see, most object literals in this dataset are untyped strings, but there are some decimal and integer values as well as year + month literals.

Aliasing N-Triples

As promised, here follow more robust versions of all the aforementioned Bash aliases. Just copy and paste the following code snippet into your ~/.bash_aliases or ~/.bash_profile file, and you will always have these aliases available when working with N-Triples data on the command line.

# N-Triples aliases from
alias rdf-count="awk '/^\s*[^#]/ { n += 1 } END { print n }'"
alias rdf-lengths="awk '/^\s*[^#]/ { print length }'"
alias rdf-length-avg="awk '/^\s*[^#]/ { n += 1; s += length } END { print s/n }'"
alias rdf-length-max="awk 'BEGIN { n=0 } /^\s*[^#]/ { if (length>n) n=length } END { print n }'"
alias rdf-length-min="awk 'BEGIN { n=1e9 } /^\s*[^#]/ { if (length>0 && length<n) n=length } END { print (n<1e9 ? n : 0) }'"
alias rdf-subjects="awk '/^\s*[^#]/ { print \$1 }' | uniq"
alias rdf-predicates="awk '/^\s*[^#]/ { print \$2 }' | uniq"
alias rdf-objects="awk '/^\s*[^#]/ { ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"
alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 2, length(\$3)-4) }' | uniq"

I should also note that though I've spoken throughout only in terms of N-Triples, most of the above aliases will work fine also for input in N-Quads format.

In the next installments of RDF for Intrepid Unix Hackers, we'll attempt something a little more ambitious: building a rdf-query alias to perform subject-predicate-object queries on N-Triples input. We'll also see what to do if your RDF data isn't already in N-Triples format, learning how to install and use the Raptor RDF Parser Library to convert RDF data between the various popular RDF serialization formats. Stay tuned.

Lest there be any doubt, all the code in this tutorial is hereby released into the public domain using the Unlicense. You are free to copy, modify, publish, use, sell and distribute it in any way you please, with or without attribution.

Posted at 16:05

Datagraph: Hacking on RDF in Ruby

RDF.rb is easily the most fun RDF library I've used. It uses Ruby's dynamic system of mixins to create a library that's very easy to use.

If you're new at Ruby, you might know about mixins in other languages--Scala traits, for example, are almost exactly functionally equivalent. They're distinctly more powerful than Java interfaces or abstract classes. A mixin is basically an interface and an abstract class rolled into one. Rather than extend an abstract class, one includes a mixin into your own class. A mixin will usually require that a given class implement a particular method. Ruby's own Enumerable class, for example, requires that implementing classes implement #each. For that tiny bit of trouble, you get a ton of methods (listed here), including iterators, mapping, partitions, conversion to arrays, and more. (If you're new to Ruby, it might also help you to know that #method_name means 'an instance method named method_name').

RDF.rb uses the principle extensively. RDF::Repository is, in fact, little more than an in-memory reference implementation for 4 traits: RDF::Enumerable, RDF::Mutable, RDF::Queryable, and RDF::Durable. RDF::Sesame::Repository has the exact same interface as the in-memory representation, but is based entirely on a Sesame server. In order to work as a repository, RDF::Sesame::Repository only had to extend the reference implementation and implement #each, #insert_statement, and #delete_statement. Nice! Of course, implementing those took some doing, but it's still exceedingly easy.

RDF::Enumerable is the key here. For implementing an #each that yields RDF::Statement objects, one gains a ton of functionality: #each_subject, #each_predicate, #each_object, #each_context, #has_subject?, #has_triple?, and more. It's a key abstraction that provides huge amounts of functionality.

But the module system goes the other way--not only is it easy to implement new RDF models, existing ones are easily extended. I recently wrote RDF::Isomorphic, which extends RDF::Enumerable with #bijection_to and #isomorphic_with? methods. The module-based system provided by RDF.rb means that my isomorphic methods are now available on RDF::Sesame::Repositories, and indeed anything which includes RDF::Enumerable. This is everything from repositories to graphs to query results! In fact, query results themselves implement RDF::Enumerable, and thus implement RDF::Queryable and can be checked for isomorphism, or whatever else you want to add. This is functionality that Sesame does not have natively, and which I wrote for a completely different purpose (testing parsers). Every RDF::Enumerable gets it for free because I wanted to compare 2 textual formats. Neat!

For example, here's what it takes to extend any RDF collection, from RDF::Isomorphic:

require 'rdf'
module RDF
  # Isomorphism for RDF::Enumerables
  module Isomorphic

    def isomorphic_with(other)
      # code that uses #each, or any other method from RDF::Enumerable goes here

    def bijection_to(other)
      # code that uses #each, or any other method from RDF::Enumerable goes here

  # re-open RDF::Enumerable and add the isomorphic methods
  module Enumerable
    include RDF::Isomorphic

Of course, this just can't be done without monkey patching. Mixins and monkey patching together make for a powerful toolkit. To my knowledge, this is the first RDF library that takes advantage of these features.

It's possible to provide powerful features to a wide range of implementations with this. RDF.rb does not yet have a inference layer, but any such layer would instantly work for any store which implements RDF::Enumerable. Want to prototype some custom business logic that operates over existing RDF data? Copy it into a local repository and hack away. No need for the production RDF store to be the same at all, but you can still apply the same code.

As a counter-example, compare this to the Java RDF ecosystem. There are some excellent implementations (RDF::Isomorphic is heavily in debt to Jena), but they're all incompatible. Jena's check for isomorphism is not really translatable to Sesame, or anything else. RDF.rb, in addition to providing a reference implementation, acts as an abstraction layer for underlying RDF implementations. The difference is night and day--with RDF.rb, you only need to implement a feature once, at the API layer, to have it apply to any implementation. This is not a knock at the very talented people behind those Java implementations; making this happen is a lot of work in a language without monkey patching, and RDF.rb is only as good as it is because of the significant influences those projects have been on Arto's design.

The end result of the mixin-based approach is a system that is incredibly easy to extend, and just downright fun. It would be a fairly simple task to extend a Ruby class completely unrelated to RDF with an #each method that yields statements, allowing it to work in RDF::Enumerable. Voila, your existing classes now have an RDF representation. Along the same lines, if one is bothered by the statement-oriented nature of RDF.rb, building a system which took a resource-oriented view would not require one to 'break away' from the RDF.rb ecosystem. Just build your subject-oriented model objects and implement #each, and away you go--you can now run RDF queries and test isomorphism on your model. Build it to accept an RDF::Enumerable in the constructor and you can use any existing repository or query to initialize your model.

RDF.rb is not yet ready for production use, but it's under heavy development and already quite useful. Give it a shot. You can post any issues in the GitHub issue queue.

Posted at 16:05

Datagraph: Using jQuery with Rails 3.0 Beta

One of the most talked about features in Rails 3 is its plug & play architecture with various frameworks like Datamapper in place of ActiveRecord for the ORM or jQuery for javascript. However, I've yet to see much info on how to actually do this with the javascript framework.

Fortunately, it looks like a lot of the hard work has already been done. Rails now emits HTML that is compatible with the unobtrusive approach to javascript. Meaning, instead of seeing a delete link like this:

<a href="/users/1" onclick="if (confirm('Are you sure?')) { var f = document.createElement('form'); = 'none'; this.parentNode.appendChild(f); f.method = 'POST'; f.action = this.href;var m = document.createElement('input'); m.setAttribute('type', 'hidden'); m.setAttribute('name', '_method'); m.setAttribute('value', 'delete'); f.appendChild(m);f.submit(); };return false;">Delete</a>

you'll now see it written as

<a rel="nofollow" data-method="delete" data-confirm="Are you sure?" class="delete" href="/user/1">Delete</a>

This makes it very easy for a javascript driver to come along, pick out and identify the relevant pieces, and attach the appropriate handlers.

So, enough blabbing. How do you get jQuery working with Rails 3? I'll try to make this short and sweet.

Grab the jQuery driver at and put it in your javascripts directory. The file is at src/rails.js

Include jQuery (I just use the google hosted version) and the driver in your application layout or view. In HAML it would look something like.

= javascript_include_tag ""
= javascript_include_tag 'rails'

Rails requires an authenticity token to do form posts back to the server. This helps protect your site against CSRF attacks. In order to handle this requirement the driver looks for two meta tags that must be defined in your page's head. This would look like:

<meta name="csrf-token" content="<%= form_authenticity_token %>" />
<meta name="csrf-param" content="authenticity_token" />

In HAML this would be:

%meta{:name => 'csrf-token', :content => form_authenticity_token}
%meta{:name => 'csrf-param', :content => 'authenticity_token'}

Update: Jeremy Kemper points out that the above meta tags can written out with a single call to "csrf_meta_tag".

That should be all you need. Remember, this is still a work in progress, so don't be surprised if there's a few bugs. Please also note this has been tested with Rails 3.0.0.beta.

Posted at 16:05

Datagraph: Is W3C Going the Wrong Direction with SPARQL 1.1?

The W3C SPARQL working group (previously the Data Access Working Group) has recently released their first versions of the updated SPARQL standards, or SPARQL 1.1. The group's roadmap has these finalized a year from now, but they have asked for comments and I suppose these are mine.

I believe that these documents are a step further down a wrong path for SPARQL and, to a lesser degree, for RDF in general.

The latest round of changes includes a number of changes to SPARQL, including aggregate functions, subqueries, projection expressions, negations, updates and deletions, more specific HTTP protocol bindings, service discovery, entailment regimes, and a RESTful protocol for managing RDF graphs (the last one is not really just SPARQL, but it's in the updates).

So I'll start with my comments, which are mostly critical.

To start, an RDF-specific complaint, not really related to the rest of the post. Why would the one mandated format to be supported in the new RESTful RDF graph management interface be RDF/XML? What would it take for a the semweb community to move on from this failed standard, which has had known issues for more than 5 years? (those two issues were raised in 2001 and are currently marked 'postponed') Why should such an increasingly irrelevant standard as RDF/XML be chosen instead of the widely-supported and easy to implement N3, N-Triples, or Turtle?

As for SPARQL, the 1.1 standards continue to give named graphs first class citizen status, both in the web APIs and in more SPARQL syntax than they had before. It's not so much triples as quads these days. Other meta-metadata, such as time of assertion or validity time, are not covered. While named graphs are admittedly a particularly often-found case, why does it need to invade the syntax of SPARQL? Not every use case needs named graphs, but every SPARQL implementor must support them. The 1.1 standard now includes precedence rules when for named graph and base URIs when they conflict in HTTP query options and inside the query itself, attempting to solve this self-created problem.

How about subqueries? What about variables during insertions? What about subqueries during insertions? Do we really need implementors to consider these kinds of things for every SPARQL endpoint on the web?

None of these things is really all that bad by itself, but one must consider the bigger picture. SPARQL 1.0 was released in January of 2008 (with some comment period before that) and there is still no implementation of a SPARQL engine in PHP or Ruby (exceptions apply, see [1]). One does not increase the participation of that ecosystem by adding a selection of entailment regimes to the standard.

While a SPARQL implementation exists for the excellent RDFLib in Python, it's only one of the current big 3 (with Ruby and PHP) in web development, and there's only one. The fact that no SPARQL engines exist for Ruby or PHP should be considered a failure of the standard. Why are we adding complexity when there is no SQLite for SPARQL? Why are there at least 3 monolithic Java implementations (Jena, Sesame, Boca), all financially sponsored to some degree or another, but so little 'in the wild'? How long can RDFLib herd 16 cats as committers on the project? While I don't have a lot of direct experience with RDFLib, I pity the project 'leads' (I cannot find evidence that the project is sponsored or that anyone is 'in charge') trying to look towards the future of implementing 6 working papers of new standards.

One of the biggest success stories for semweb in widespread use is the Drupal RDF module, which has found wide acceptance in the Drupal community and started an ecosystem of modules. Drupal 7 will output RDFa by default and Drupal 6 supports a ton of wonderful features, including reversing the RSS 1.0 to 2.0 downgrade back to RDF. But Drupal remains a producer of simple triples and a consumer of SPARQL queries generated by other endpoints. Data in those sites remains locked down. Why? Because implementing SPARQL in PHP is nontrivial, and in a chicken-egg problem, nobody's paying for it before someone has a need for SPARQL.

I could go on, but these are symptoms (well, not that RDF/XML thing, I don't think there's a good reason for that). I feel that the working group is attempting to solve the wrong problem. Namely, it is attempting to define a somewhat-human-readable query language, SPARQL that works for almost all use cases. But why must the whole 'kitchen sink' be well-defined? Such a standards body should be attempting to define the easiest possible thing to implement and extend, not the the last tool anyone would ever use.

The SPARQL 1.0 standard's grammar was well-defined as a context free grammar. It also had extension functions, which were uniquely defined by URIs. Why the distinction between CFG elements and extension functions? Why not make syntax elements like named graphs and aggregate functions as discoverable as extensions? Well, the reason is that it's hard to write a parser of a human-readable format and make those things optional and discoverable. (Here's a SPARQL parser implementation in Scala, a language with powerful pattern matching features for good parsing, and it's 500 lines of code. It compiles to S-expressions, the parsing of which is about 30 lines. Hmm.)

If the protocol had been defined as S-expressions, the distinction would not exist and the syntax could be as expandable as the current functions (the current syntax would just be more functions). The new 1.1 service discovery mechanism is excellent and extendible and would allow the standard to grow dynamically instead of becoming bogged down in features for particular use cases. New baseline implementations of SPARQL would be easy to implement and grow incrementally, and the current human-readable format can be implemented in terms of these expressions.

The web of ontologies has grown with ad-hoc definitions created by people used to fill their needs. Standards grow organically around the ones that are needed most, others languish. Why should SPARQL functions have this kind of flexibility, but not the syntax? The distinction makes implementation overly difficult and is slowing the expansion of the Semantic Web.

In fact, it turns out that Jena has been parsing to S-expressions for some time. If you're an implementor, why would you do it any other way, especially when the standard can change as much as it does in 1.1? Any implementation will have to come up with something equivalent to S-expressions if you are going to be able to upgrade your engine implementation to meet standards like this when they are finalized. If people are doing it anyway, why not just make it the standard?

The SPARQL Working Group should be working on a definition for a function list and discovery protocol for S-expressions, and not for what we currently call SPARQL. What we call SPARQL is something that should compile to a simpler standard if various vendors want to implement it. S-expressions allow maximally simple parsing maximally simple serialization, and the ability to do feature discovery on core features of the language, not just portions which are blessed with the ability to be extended. S-expressions are easier for machines to generate for wide variety of automated use cases, far wider, I would venture, than the set of use cases for the human-readable queries.

Please, please, please do not doom the world to write the SPARQL equivalent of SQLAlchemy and ActiveRecord for the next 20 years! We can define a standard that machines can use natively. Now's the time.

At any rate, that's my beef in a nutshell. The working group won't come up with a successful standard until it's easy enough to implement it that workable implementations appear in the languages that are defining the web today. And when people can use those languages to implement that standard without an army of VC-funded engineers.

The SPARQL 1.1 proposals make the standard better than before, but it's not the standard we need. The SPARQL algebra is what needed expansion and specification, not the syntax.

[1]: The PHP ARC project has an implementation, but it attempts to directly convert SPARQL to an SQL query on particular table layout in MySQL, and is difficult to convert to general use. Despite SPARQL's complexity, ARC managed to implement this in just 6400 lines of code. The parser alone is 2000 lines and the engine another 4400. The serialization/parsing libraries, however, are fine, and were integrated successfully into the Drupal RDF module. The PHP RAP project has also done some good work and is perhaps more wrappable than ARC, but implements only a subset of SPARQL.

Posted at 16:05

Copyright of the postings is owned by the original blog authors. Contact us.