It's triples all the way down
real_libby, my GPT-2 retrained, slack chatbot hosted on a raspberry pi 4 eventually corrupted the SD card. Then I couldn’t find her brain (I could only find business-slack real_libby who is completely different). And so since I was rebuilding her anyway, I thought I’d get her up to date with Covid and the rest of it.
For the fine-tuning data, since I made the first version in 2019 I’ve more or less stopped using irc ( :’-( ) and instead use Signal. I still use iMessage, and use Whatsapp more. I couldn’t figure out for a while how to get hold of my Signal data so first built an iMessage / Whatsapp version, as that’s pretty easy with my setup (details below, basically sqlite3 databases from an unencrypted backup of my iPhone). I had about 30K lines to retrain with, which I did using this as before, on my M1 macbook pro.
The text/whatsapp version uses exclamation marks too much and goes on about trains excessively. Not super interesting.
It is in fact possible to get Signal messages out as long as you use the desktop client, which I do (although it doesn’t transfer messages between clients, only ones received while that device was authorised). But I still had 5K lines to play with.
I think Signal-libby is more interesting, though she also seems closer to the source crawl, so I’m more nervous about letting her loose. But she’s not said anything bad so far.
Details below for the curious. It’s much like my previous attempt but there were a few fiddly bits.
The Signal version is a bit more apt, I think, and says longer and more complex things.
She makes up urls quite a bit; Signal’s where I share links most often.
Maybe I’ll try a combo version next, see if there’s any improvement.
Getting data
A note on getting data from your phone – unencrypted backups are bad. Most of your data is just there lying about, a bit obfuscated, but trivially easy to get at. The commands below just get out your own data. Baskup helps you get out more.
iMessages are in
/Users/[me]/Library/Application\
Support/MobileSync/Backup/[long number]/[short
number]/3d0d7e5fb2ce288813306e4d4636395e047a3d28
sqlite3 3d0d7e5fb2ce288813306e4d4636395e047a3d28
.once libby-imessage.txt
select text from message where is_from_me = 1 and text not like
'Liked%';
Whatsapp are in
/Users/[me]/Library/Application\
Support/MobileSync/Backup/[long number]/[short
number]/7c7fba66680ef796b916b067077cc246adacf01d
sqlite3 7c7fba66680ef796b916b067077cc246adacf01d
.once libby-whatsapp.txt
SELECT ZTEXT from ZWAMESSAGE where ZISFROMME='1' and
ZTEXT!='';
Signals desktop backups are encrypted so you need to use this, which I could only get to work using docker. Signal doesn’t back up from your phone.
Tweaks for finetuning on a M1 mac
git clone git@github.com:nshepperd/gpt-2.git
git checkout finetuning
cd gpt-2
mkdir data
mv libby*.txt data/
pip3 install -r requirements.txt
python3 ./download_model.py 117M
pip3 install tensorflow-macos # for the M1
PYTHONPATH=src ./train.py --model_name=117M --dataset
data/
tensorflow-macos is tf 2, but that seems ok, even though I only run tf 1.3 on the pi.
rename the model and get the bits you need from the initial model
cp -r checkpoint/run1 models/libby
cp models/117M/{encoder.json,hparams.json,vocab.bpe}
models/libby/
Pi 4
The only new part on the Pi 4 was that I had to install a specific version of numpy – the rest is the same as my original instructions here.
pip3 install flask numpy==1.20
curl -O
https://www.piwheels.org/simple/tensorflow/tensorflow-1.13.1-cp37-none-linux_armv7l.whl
pip3 install
tensorflow-1.13.1-cp37-none-linux_armv7l.whl
Posted at 16:10
I made a bunch of robots recently building on some of the ideas from libbybot and as an excuse to play with some esp32 cams that Tom introduced me to in the midst of his Gobin Frenzy.
The esp32s cams are kind of amazing and very very cheap. They have a few quirks, but once you get them going it’s easy to do a bit of a machine vision with them. In my case, new to this kind of thing, I used previous work by Richard to find the weight of change between two consecutive images. Dirk has since made better versions using a PID library. I made some felt-faced ones using meccano and then some lego ones using some nice lego-compatible servos from Pimoroni. Because esp32s can run their own webserver you can use websockets or mjpeg to debug them and see what they are doing.
The code’s here, (here’s Dirk’s) and below are a few pictures. There’s a couple of videos in the github repo.
Posted at 16:10
I keep seeing these two odd time effects in my life and wondering if they are connected.
The first is that my work-life has become either extremely intense – and I don’t mean long hours, I mean intense brainwork for maybe a week – that wipes me out – and then the next is inevitably slower and less intense. Basically everything gets bunched up together. I feel like this has something to do with everyone working from home, but I’m not really sure how to explain it (though it reminds me of my time at Joost where we’d have an intense series of meetings with everyone together every few months, because we were distributed. But this type is not organised, it just happens). My partner pointed out that this might simply be poor planning on my part (thanks! I’m quite good at planning actually).
The second is something we’ve noticed at the Cube – people are not committing to doing stuff (coming to an event, volunteering etc) until very close to the event. Something like 20-30% of our tickets for gigs are being sold the day before or on the day. I don’t think it’s people waiting for something better. I wonder if it’s Covid-related uncertainty? (also 10-15% don’t turn up, not sure if that’s relevant).
Anyone else seeing this type of thing?
Posted at 16:10
More for my reference than anything else. I’ve been trying to get the toolchain set up to use a Sparkfun Edge. I had the Edge, the Beefy3 FTDI breakout, and a working USB cable.
This worked great for the speech example, for me (although the actual tensorflow part never understands my “yes” “no” etc, but anyway, I was able to successfully upload it)
$ git clone --depth 1 https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ gmake -f tensorflow/lite/micro/tools/make/Makefile TARGET=sparkfun_edge micro_speech_bin
$ cp tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/keys_info0.py tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/keys_info.py
$ python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/create_cust_image_blob.py --bin tensorflow/lite/micro/tools/make/gen/sparkfun_edge_cortex-m4_micro/bin/micro_speech.bin --load-address 0xC000 --magic-num 0xCB -o main_nonsecure_ota --version 0x0
$ python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/create_cust_wireupdate_blob.py --load-address 0x20000 --bin main_nonsecure_ota.bin -i 6 -o main_nonsecure_wire --options 0x1
$ export BAUD_RATE=921600
$ export DEVICENAME=/dev/cu.usbserial-DN06A1HD
$ python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/tools/apollo3_scripts/uart_wired_update.py -b ${BAUD_RATE} ${DEVICENAME} -r 1 -f main_nonsecure_wire.bin -i 6
But then I couldn’t figure out how to generalise it to use other examples – I wanted to use the camera because ages ago I bought a load of tiny cameras to use with the Edge.
So I tried this guide, but couldn’t figure out where it the installer had put the compiler. Seems basic but….??
So in the end I used the first instructions to download the tools, and then the second to actually do the compilation and installation on the board.
$ find . | grep lis2dh12_accelerometer_uart
# you might need this -
# mv tools/apollo3_scripts/keys_info0.py tools/apollo3_scripts/keys_info.py
$ cd ./tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/boards_sfe/edge/examples/lis2dh12_accelerometer_uart/gcc/
$ export PATH="/Users/libbym/personal/mayke2021/tensorflow/tensorflow/lite/micro/tools/make/downloads/gcc_embedded/bin/:$PATH"
$ make clean
$ make COM_PORT=/dev/cu.usbserial-DN06A1HD bootload_asb ASB_UPLOAD_BAUD=921600
etc. Your COM port will be different, find it using
ls /dev/cu*
If like me the FTDI serial port KEEPS VANISHING ARGH – this may help (I’d installed 3rd party FTDI drivers ages ago and they were conflicting with the Apple’s ones. Maybe. Or the reboot fixed it. No idea).
Then you have to use a serial programme to get the image. I used the arduino serial since it was there and then copy and pasted the output into a textfile, at which point you can use
tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.2.0/boards_sfe/common/examples/hm01b0_camera_uart/utils/raw2bmp.py
to convert it to a png. Palavers.
Posted at 16:10
Makevember and lockdown have encouraged me to make an improved version of libbybot, which is a physical version of a person for remote participation. I’m trying to think of a better name – she’s not all about representing me, obviously, but anyone who can’t be somewhere but wants to participate. [update Jan 15: she’s now called “sock_puppet”].
This one is much, much simpler to make, thanks to the addition of a pan-tilt hat and a simpler body. It’s also more expressive thanks to these lovely little 5*5 led matrixes.
Her main feature is that – using a laptop or phone – you can see, hear and speak to people in a different physical place to you. I used to use a version of this at work to be in meetings when I was the only remote participant. That’s not much use now of course. But perhaps in the future it might make sense for some people to be remote and some present.
New recent features:
* ish
**a sock
I’m still writing docs, but the repo is here.
Posted at 16:10
I’ve been using MyNatureWatch setup on my bird table for ages now, and I really love it (you should try it). The standard setup is with a pi zero (though it works fine with other versions of the Pi too). I’ve used the recommended, very cheap, pi zero camera with it, and also the usual pi camera (you can fit it to a zero using a special cable). I got myself one of the newish high quality Pi cameras (you need a lens too, I got this one) to see if I could get some better pics.
I could!
I was asked on twitter how easy it is to set up with the HQ camera, so here are some quick notes on how I did it. Short answer – if you use the a recent version of the MyNatureWatch downloadable image it works just fine with no changes. If you are on the older version, you need to upgrade it, which is a bit fiddly because of the way it works (it creates its own wifi access point that you can connect to, so it’s never usually online). It’s perfectly doable with some fiddling, but you need to share your laptop’s network and use ssh.
Update (May 2022) – I’d just suggest using a newish release of MyNatureWatch, which works perfectly.
MyNatureWatch Beta – this is much the
easiest option. The beta is downloadable
here (more
details) and was some cool new features such as video. Just
install as usual and connect the HQ camera using the zero cable
(you’ll have to
buy this separately, the HQ camera comes with an ordinary
cable). It is a beta and I had a networking problem with
it the first time I installed it (the second time it was fine). You
could always put it on a new SD card if you don’t want to blat a
working installation. Pimoroni have
32GB cards for £9.
The only fiddly bit after that is adjusting the focus. If you are not used to it, the high quality camera CCTV lens is a bit confusing, but it’s possible to lock all the rings so that you can set the focus while it’s in a less awkward position if you like. Here are the instructions for that (pages 9 and 10).
MyNatureWatch older version – to make this work with the HQ camera you’ll need to be comfortable with sharing your computer’s network over USB, and with using ssh. Download the img here, and install on an SD card as usual. Then, connect the camera to the zero using the zero cable (we’ll need it connected to check things are working).
Next, share your network with the Pi. On a mac it’s like this:
You might not have the RNDIS/Ethernet gadget option there on yours – I just ticked all of them the first time and *handwave* it worked after a couple of tries.
Now connect your zero to your laptop using the zero’s USB port (not its power port) – we’re going to be using the zero as a gadget (which the MyNatureWatch people have already kindly set up for you).
Once it’s powered up as usual, use ssh to login to the pi, like this:
ssh pi@camera.local password: badgersandfoxes
On a mac, you can always ssh in but can’t necessarily reach the internet from the device. Test that the internet works like this:
ping www.google.com
This sort of thing means it’s working:
PING www.google.com (216.58.204.228) 56(84) bytes of data. 64 bytes from lhr48s22-in-f4.1e100.net (216.58.204.228): icmp_seq=1 ttl=116 time=19.5 ms 64 bytes from lhr48s22-in-f4.1e100.net (216.58.204.228): icmp_seq=2 ttl=116 time=19.6 ms
If it just hangs, try unplugging the zero and trying again. I’ve no idea why it works sometimes and not others.
Once you have it working, stop mynaturewatch using the camera temporarily:
sudo systemctl stop nwcameraserver.service
and try taking a picture:
raspistill -o tmp.jpg
you should get this error:
mmal: Cannot read camera info, keeping the defaults for OV5647 mmal: mmal_vc_component_create: failed to create component 'vc.ril.camera' (1:ENOMEM) mmal: mmal_component_create_core: could not create component 'vc.ril.camera' (1) mmal: Failed to create camera component mmal: main: Failed to create camera component mmal: Camera is not detected. Please check carefully the camera module is installed correctly
Ok so now upgrade:
sudo apt-get update sudo apt-get upgrade
you will get a warning about hostapd – click
q
when you see this. The whole upgrade took
about 20 minutes for me.
When it’s done, reboot
sudo reboot
ssh in again, and test again if you want
sudo systemctl stop nwcameraserver.service raspistill -o tmp.jpg
sudo systemctl unmask hostapd.service sudo systemctl enable hostapd.service
reboot again, and then you should be able to use it as usual (i.e. connect to its own wifi access point etc).
The only fiddly bit after that is adjusting the focus. I used a gnome for that, but still sometimes get it wrong. If you are not used to it, the high quality camera CCTV lens is a bit confusing – it’s possible to lock all the rings so that you can set the focus while it’s in a less awkward position if you like. Here are the instructions for that (pages 9 and 10).
Here’s a few more pictures from the camera.
Posted at 16:10
It works using chromium not the Zoom app (which only runs on x86, not ARM). I tested it with a two-person, two-video stream call. You need a screen (I happened to have a spare 7″ touchscreen). You also need a keyboard for the initial setup, and a mouse if you don’t have a touchscreen.
The really nice thing is that Video4Linux (bcm2835-v4l2) support
has improved so it works with both v1 and v2 raspi cameras, and no
need for
options bcm2835-v4l2 gst_v4l2src_is_broken=1
So:
You’ll need to set up Zoom and pass capchas using the keyboard and mouse. Once you have logged into Zoom you can often ssh in and start it remotely like this:
export DISPLAY=:0.0
/usr/bin/chromium-browser --kiosk --disable-infobars --disable-session-crashed-bubble --no-first-run https://zoom.us/wc/XXXXXXXXXX/join/
Note the url format – this is what you get when you click “join from my browser”. If you use the standard Zoom url you’ll need to click this url yourself, ignoring the Open xdg-open prompts.
You’ll still need to select the audio and start the video, including allowing it in the browser. You might need to select the correct audio and video, but I didn’t need to.
I experimented a bit with an ancient logitech webcam-speaker-mic and the speaker-mic part worked and video started but stalled – which made me think that a better / more recent webcam might just work.
Posted at 16:10
… or, the lack of it.
A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.
Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.
IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.
On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.
Offerings on the SaaS level provide an equally poor picture:
In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.
Posted at 16:10
I’ve decided to keep doing the annual end of the year reflections, iterating further on the structure I used in 2021 and 2020.
What follows is a mix of personal reflections on the year, as well as a brief lists of the things I’ve enjoyed watching, reading and playing.
This year I’ve split my time between working as CTO at Energy Sparks, and continuing to do some freelancing.
I’m really enjoying this role. It’s great to be doing technical and product work again. And it’s been an interesting year for both myself and the charity.
When I started in the role in mid-2020 I focused on ensuring I understood the application from an architectural and operational point of view. While we use freelance developers, including the very capable Julian Higman, I was doing not just the majority of the development work, but also had sole responsibility for keeping everything up and running.
The true test came when I had to take us through a full platform upgrade which completed in January 2022. It all went relatively smoothly, so I’m pleased that I prioritised investing time in the right areas.
This year, I’ve been starting to develop a deeper understanding of the energy analysis side of Energy Sparks. To date, most of this has been the responsibility of a single analyst/developer who has now largely left the organisation. It’s a big hole to fill even though we’re using him on a freelance basis to help with knowledge transfer.
It’s been really interesting learning more about the UK’s energy data infrastructure as well as energy data analysis in general. But I’ve still got a long way to go here.
This year we managed to win some significant funding, including:
We’re ending 2022 with over 641 schools on Energy Sparks which is more than double where we were last year. This is a great achievement and, as you might imagine, ensuring that the service scales as we grow has been at the forefront of my mind. I planned out a series of improvements early in the year which we’ve now delivered.
To extend the service to this many schools, we needed to grow the team. We’re now a team of eight, with a number of freelancers and partners supporting our work. I spent a lot of time in 2022 interviewing for new roles. And I now have a small team of developers (Deb Bassett, IanT).
It’s nice to be back leading a team again. But, even with a small team this means doing less hands-on work and focusing more on enabling others. I’m enjoying it.
Next year will be a big year for us as we’ve got several projects to land and we need to show that we’re having a real impact. If so, then we should be able to get some additional funding. Fingers crossed!
I wrote a summary of my recent freelance work back in October, so I’ll link to that as a summary of my recent projects.
Since then I’ve started a second project with CABI, supporting a team that are providing advice about publishing FAIR data to a number of Gates Foundation projects.
As I wrote in October, I’m enjoying being able to use the experience I’ve developed around open data and data infrastructure, to support and mentor teams who are building data infrastructure, ecosystems and infrastructure. Hopefully I can do more of that next year.
I’ve got availability for freelance work starting in February, if you need some help?
I’ve had two occasions this year when I’ve had unsolicited feedback on my management style. The first from a former colleague at the Open Data Institute, the second from a recent joiner to the Energy Sparks team.
The feedback was really positive, totally out of the blue, and really caught me off guard.
If I’m honest, things didn’t end particularly well at the Open Data Institute. I left feeling pretty deflated. It wasn’t clear that I fitted into the organisation and I ended my time there — which spanned most of its first 10 years — doubting my abilities as a senior leader, questioning whether I’d had any impact (on the organisation or elsewhere) and with a massive boost to my imposter syndrome.
This feedback has really helped me work through a lot of that. And I’m feeling more confident that I’m doing the right things in my current role.
I won’t share any more than that, but I’m going to try and pay things forward and do the same for other people.
Unsurprisingly, I’ve written a lot of code this year.
At the start of the year I wanted to try and do some more creative coding this year. I don’t think I really achieved that, but there’s a few things I’m pleased with:
The last one of those hit the Hacker News front page for a couple of days, so got a lot of traffic. That was fun.
I’ve got a couple of other projects I’ve been tinkering with this year. One of these was a Twitter bot, but that’s going to move to Mastodon now.
I didn’t manage to start running again in 2022. I’m really not sure why I stopped because I was really enjoying it.
Actually I do: it got cold and wet, and I’d mostly hit my weight loss target. So I just let it taper off. Let’s try again next year.
What I did do in 2022 was a lot of walking. Four of us fell into a regular schedule of weekend walks through the countryside which I really, really loved. I always forget how much being outdoors lifts my mood.
We’ll be doing a lot more of that in 2023.
As before I’ve been tweeting what I’ve been cooking in 2022. Although I largely stopped doing this in November after I shifted over to Mastodon.
I still bookmark recipes here.
Of the recipes and cocktails I logged, they break down as follows:
This year was mostly Sichuan dishes as I had a new recipe book. I also continued my love affair with the dirty martini.
I usually try out a new recipe on a Saturday night, along with a cocktail or two. But I’ve ended up cooking less this year as we shifted up our Saturday night routine. We now frequently nip over to visit a friend in Bradford-on-Avon on a Saturday night, so have ended up taking turns with the cooking.
I’ve already published a blog post with my gardening retro for 2022. Beans for the win.
At the beginning of 2022 I jumped from Spotify to Tidal. This was for several reasons.
Spotify ditched the subscription tier that I’d been on pretty much since they launched. This was heavily discounted, so my monthly cost was going to go up. I was also increasingly frustrated with them constantly pushing podcasts. I just don’t get on with podcasts. The whole Alex Jones/Infowars thing was the icing on the cake.
Once I realised that Tidal was a serious alternative — same price, better cut for artists, same coverage and no podcasts — it was a no-brainer to switch over. I’ve not had any issues with the service at all.
It’s a shame that they don’t do an end of year wrap-up though. They do give you a playlist of your most listened tracks, though. So here’s my Most Listened 2022. There’s a lot of Wet Leg, Moderat, µ-Ziq, Rival Consoles and Cymande on there.
I’ve also kept up my habit of creating a playlist of “tracks that I loved on first listen, which were released this year“. Here’s my 2022 Tracked playlist.
It contains 176 tracks totalling 13 hours, 56 minutes and 44 seconds of music. The tracks are in order of when I heard them.
I’m continuing to publicly bookmark the articles and papers I’ve read. And I’m now using StoryGraph to log my reading. You can follow me there if you’re interested.
I’ve been through and imported the last few years of reading data I’d collected in a custom spreadsheet. As a service it’s got some limitations and rough edges, but finding it useful so far. StoryGraph tells me that, as of this morning, I’ve read 106 books, which totals 23,238 pages.
I’ve kept up the habit of having one comic book, one novel and one non-fiction book on the go. I’ve read a lot more comics than anything else.
That big dip in reading from March to May was due to Elden Ring.
My favourites this year were:
My favourites this year have been:
Affinities is just lovely. I’m a big fan of the public domain review, so it was nice to have a print copy of so many gorgeous images. It lead me down a lot of interesting rabbit holes, one of which ended up with me reading Cartographies of Time, which is also fascinating.
I’m still working my way through a lot of big Humble Bundle collections I picked up in the last few years, but am increasingly buying new stuff via the Comixology app (which is terrible).
My favourites this year have been:
Notable mentions go to the G. Willow Wilson Ms Marvel books, the Dan Slott She-Hulk collection, and the first two volumes of the latest Swamp Thing reboot by Ram V.
I read the entire run of East of West, which was…OK? Gave up on Casanova as an incoherent mess.
Managed to finally get myself a PS5 this year, so most of my gaming time has been spent on that rather than the PC. I’ve also now got a Playdate which is gorgeous little device. I want to build something for it.
I’m ldodds on both Steam and PSN if you want to add me there.
Haven’t played many board games this year. Although I did pick up a copy of Quacks of Quedlinburg which has become an instant family favourite.
Most of my table-top (“zoom top”?) gaming this year has been:
I’m really enjoying playing TTRPGs again. The new rulesets are so story focused and so accessible to newcomers, that I’m not sure I’d ever want to play something like D&D again.
I’ve also restarted the adjacent hobby of collecting RPG rulebooks. So I’ve got a growing mass of PDFs and Hardbacks, most of which I probably won’t end up playing, but who cares?!
Favourites this year:
Only three as I didn’t play many games. I sank a lot of hours into Elden Ring.
I really enjoyed Returnal, but the grind became too frustrating and I put it down. I’ve also been playing a bit of Darkest Dungeon again whilst waiting for the full release of Darkest Dungeon 2.
I started using Letterboxd this year to record the films I’ve been watching. And by “started using” I mean:
It’s not comprehensive, obviously but I’ve now logged 1693 films.
This year I watched 84 films, 15 of which I’d watched before.
I watched a lot of films in February. I had the half-term week as a holiday and everyone in the house came down with Covid. So I just watched films whilst the rest of the family were in solitarty confinement.
Not all of my film viewing has been sofa based. I managed to get out to the cinema a few times this year (Nope, Bullet Train, Everything Everywhere All At Once). And I also went to the Forbidden Worlds Film Festivals which are my new favourite events.
My favourite TV series this year were:
No real surprises in that list. Although I don’t think I’ve seen many people talking about the Paper Girls adaptation. I thought it was brilliant, so it’s tied with She Hulk for me, which was cleaver, funny but uneven in places.
I should mention that Masterchef remains one of my favourite programmes ever. I don’t really watch reality shows but I’m always clued to both the amateur and professional series. But we did also watch Bake-Off as a family this year. It’s an excuse to bake something for every episode.
I also somehow ended up watching Mortimer & Whitehouse: Gone Fishing this year, despite not being interested in finishing. But at 50 and a half, I guess I’m in the demographic now? It’s a bit too melancholy at times though!
Years after everyone else I also watched all of
Succession.
is all I can say.
Unlikely to be many surprises here either, but:
Special mention for Unbearable Weight of Massive Talent which was hilarious. And Barbarian which was bonkers.
Favourite channels:
I wrote 33 blog posts in 2022, totalling 23,370 words.
The three new articles that got the most views were:
The three articles across the entire history of my blog that got the most views were:
These are all really low view counts in the scheme of things. But I’m not writing for the views.
I’ve been noodling on a couple of writing projects this year which I’m hoping to make some proper headway on next year.
What about everything else?
There’s still a lot going on at home that I don’t want to write about in detail here.
Watching and supporting my kids as they try to become the people they want to be remains the most challenging and rewarding things I’ve ever done. I just didn’t realise it would still be this hard. I have to remind myself that parenting is a rollercoaster. There’s no downhill: just surprising twists and turns.
I jumped to Mastodon in November, as part of the big Twitter migration. Although to be honest, I had been feeling really disconnected and frustrated with Twitter for some time. Mastodon won’t solve all of that, but it’s less angry and political which is partly what I needed.
My twitter posts are now limited to auto-posts from this blog.
What I’ve been wrestling with this year might be summed up as this: “who are my community, and how do I connect with them?”.
I wasn’t feeling a sense of community on Twitter. Mastodon might offer something different, but I don’t think it will. Its still social media, I think I need to find other ways of connecting, both online and in-person.
I’m using DuoLingo to learn Welsh. Two hundred day streak at the moment.
I’ve still not had Covid. That’s good.
Posted at 16:05
Now that I work in the energy sector I’m trying to pay closer attention to how the data infrastructure in that area is evolving.
Here’s a round up of some current and recent projects that I’ve been keeping an eye on. Along with some thoughts on their scope, overlaps and potential outcomes.
In 2021 Ofgem published a set of principles, called the “Data Best Practice” guidance. They are a set of eleven principles intended to guide organisations in the energy sector towards publishing more open data.
It encourages a “presumed open” approach and recommends the types of general best practice that you can see in other principles, e.g. FAIR.
The principles are binding for a small number of organisations (those operating the UK’s energy networks) but are otherwise voluntary.
Ofgem recently asked for feedback on the principles, with comments closing at the end of October. The feedback is phrased as a kind of retrospective: what’s going well, what could be improved, what would encourage organisations to adopt them, etc. There’s one specific question about whether providing more concrete guidance on data formats would be helpful.
It will be interesting to see what kind of feedback Ofgem have received. My hope is that it will prompt:
I’ve done quite a bit of work in the last few years supporting organisations in adopting the FAIR data principles. What I’ve learned is that while principles can provide a good basis for building a shared vision, they always need to be supported by specific actionable guidance.
You cannot assume that everyone knows how to put those principles into practice.
Without detailed, sector and dataset specific guidance — use these formats, this metadata and this data should be open, while this data should be shared — people are just left with working through all of the details for themselves. This creates friction. And that friction results in not being published well, and plenty of room for excuses and uncertainty that leaves data not being published at all.
In January 2022 the Energy Digitalisation Taskforce published a report that provided a set of recommendations aimed at creating a digitalised Net Zero energy system.
The report includes a number of recommendations aimed at improving the data infrastructure in the sector. Including creating a “data sharing fabric”, an energy asset register, a data catalogue, improving data standards and creating a “digital spine”.
Unfortunately the report doesn’t clearly define what it means by a “fabric” or a “spine”, just that the former is intended to support data sharing and the latter to improve interoperability. In practice there’s a lot of different ways this technical infrastructure might be delivered.
The spine appears like it’s intended to be middleware that sits between individual organisations and the broader energy network making it easier to expose data in standard formats.
This looks to be different to, for example, the NHS spine which fulfils a similar role but includes some centrally coordinated services to ensure connectivity and interoperability across the sector. I’ve not been able to examples of “spines” in other sectors.
BEIS have commissioned a study to determine “the needs case, benefits, scope and costs of an energy system ‘digital spine’”. That procurement completed at the end of November, but I don’t believe the winners have been announced yet.
Unfortunately, from the outside, it feels a bit like a broad solution has been proposed (open-source middleware) which the procurement is then focused on scopin. Rather than starting from a broader user research question of “what is required to improve interoperability in the energy sector?” then identifying the most useful interventions.
For example, I’ve previously written about some low-hanging fruit that would increase interoperability of half-hourly meter data.
An open source middleware layer might be a useful intervention, but there’s risks that other necessary work is overlooked. For example, if an open source middleware is going to convert data into standardised forms, then what do those look like? Do they need to be developed first?
Does the UK energy sector even have a good track record of creating and adopting open source infrastructure? Or is there ground work required to build that acceptance and capacity first?
Maybe all this was covered in the research behind the Taskforce report that recommended the spine, but its not clear at this point. It will be interesting to see the results of the study.
The Smart Meter Energy Data Repository (SEDR) programme is intended to “determine the technical and commercial feasibility of a smart meter energy data repository, quantify the benefits and costs of such a smart meter energy data repository, and simulate how it could work”.
The procurement for this piece of work closed in July, but again I don’t think the winner has been announced. I’ve been interviewed by them as part of their user research, so I know the project is up and running!
Update: 3rd January 2023 the three funded projects have been announced.
The current smart meter network has at most 2 years of data distributed across every smart meter in the UK. Access to data requires querying individual meters by sending them messages.
Would it be useful to have a single repository of data, making it easier to query and work with? What types of applications would benefit from that data infrastructure? What are the privacy implications? What would the technical infrastructure look like? These are all questions that will be considered in this project.
I’ve written before about why the UK energy sector should learn from open banking and just develop standardised APIs to provide a high-level interface to the smart meter network.
I’ve also written about the problems of trying to access half-hourly data for the non-domestic market.
I won’t repeat that here, but will note a few things that I raised in the user research:
This is potentially a critical piece of infrastructure, so it needs some careful planning and execution.
Another current BEIS programme is the Smart Meter System based Internet of Things applications programme.
This one is looking at whether it is possible to use the existing Smart Meter communications network and infrastructure run by the DCC to support other IoT applications. For example, to support monitoring of “smart buildings” or other parts of the energy data system.
Update: 3rd January 2023 the three funded projects have been announced.
This seems like a sensible approach, as we don’t necessarily need separate infrastructure for what might be very similar requirements.
But the Smart Meter Energy Data Repository project shows that the current infrastructure is not meeting existing needs, so it’s reasonable to assume that there will be additional requirements for these IoT use cases too. Hopefully considerations of these other use cases are at least on the radar of the SEDR review as they might offer additional insights.
As you can see there’s a lot happening around the UK’s energy data infrastructure. I’ve not even touched on the work of Icebreaker One or the SmartDCC “Data for Good” Project.
Posted at 16:05
I’ve written a reflection point about growing vegetables for the last two years (2020, 2021) so I’m going to keep going. It’s useful to plan ahead. And it’s nice to think about the spring and summer when it’s so cold and dark outside!
My goals for this year were to:
As always, I did some but not all of these
Unlike last year I didn’t add any new growing areas or buy new pots. This year was mostly about using the space better.
For example, as shown in the photo above, I tried growing the cucumber plants up a raised frame rather than letting them sprawl all over the beds and path. This freed up a lot of space and I was even able to grow some spring onions and letters under the frame. They cropped before the frame was completely overgrown.
I rotated the crops through the beds
I planted more densely across all the beds, and interleaved smaller, faster growing crops (radishes, lettuce, spring onions) amongst slower growing veg (potatoes, sweetcorn). This isn’t quite companion planting, but worked well.
I gave up an entire bed to potatoes. And also tried growing some in pots.
Tried to plant up the peas more densely to fully fill the space under the frame.
I also made sure I planted up all the pots I have which meant getting some more soil and compost.
I didn’t replace any of the beds, I figured they’d got at least another year or so in them. I did repair one of them though.
I didn’t look into soil improvers. And I didn’t grow many more chillies but gave up on strawberries.
I continued to get distracted by bees.
The final list for this year was (new things in bold).
Basil, Beans, Blueberries, Butternut Squash, Carrots (2 varieties), Cucumber, Jalapeños, Lettuce (2 varieties), Peas, Mint, Potato, Radish, Shallots, Scotch Bonnets, a “Snacking Pepper” (not sure of the variety), Spinach, Spring Onion, Sweetcorn, Swiss Chard, Thyme, Tomatoes (3 varieties)
Things I didn’t grow this year: Strawberries.
A bit frustrating this year, that so many things failed or produced a lack lustre crop.
Its was incredibly hot this year, so I’m putting at least some of these issues down to problems keeping everything well watered. Also I think investing in some soil improvement would be sensible now.
Having the space to grow vegetables is a privilege and I’m very glad and very lucky to have the opportunity.
Gardening can be time consuming and frustrating, but I love being able to cook with what I’ve grown myself. Getting out into the garden, doing something physical, seeing things grown is also a nice balm.
Looking forward to next year.
Posted at 16:05
One of the big projects we’ve currently got under way at Energy Sparks is redesigning the collection of pages that present the results of our detailed analysis of their energy data to school users.
The existing pages have been around for a few years and our metrics and user testing has shown that they aren’t really performing well. They need to have a clearer content model, better navigation and to a better job of highlighting the key insights to users.
We also need to review and translate all of the content into Welsh as part of the next phase of translating the service into Welsh. As a result we’ve been doing quite a bit of prototyping, user research and testing over the past few months. It’s something I’m really enjoying.
To help inform our thinking I’ve been looking for existing guidance around presenting charts, data and supporting analysis to users.
I thought I’d share a few of the resources that I’ve found useful.
The GDS Design System is a resource that I frequently use as a reference when doing any UX/UI work. Although I’ll often look at other design systems too.
While the design system seems primarily geared towards the development of transactional services, it has some useful patterns that are applicable to the thinking we’ve been doing around our advice pages.
For example we’re considering how to use progressive disclosure to avoid drowning users in details. Patterns like Details, Inset Text and Warning Text are helpful refernces.
The GDS guidance on planning and writing content and recommendations for publishing statistics, making tables accessible and presenting numbers are all useful and relevant resources.
Our users have a wide range of backgrounds and knowledge, but their common characteristic is that they’re all time-poor. We need our analysis to be clear and accessible.
Another resource that I’m frequently referencing is the ONS Style Guide, and in particular, the detailed recommendations around presenting charts and tables.
We need to ensure that our analysis is well-presented, but also has the appropriate footnotes and details that will allow our more advanced users to dig into the details. Reading how the ONS approaches presenting often very detailed statistical data is very helpful.
We also want our charts and tables to demonstrate good practice, as often they’re been looked at by children. It’s important that the service reinforces the good practices they’re being taught around data literacy.
I’m lumping these examples together as I discovered them via a helpful Mastodon post:
For obvious reasons we want our advice to be as widely accessible as possible. We’ve got a lot to improve on the site in that area, but I’m confident we can make some quick progress as we build out our new pages and iterate further on existing pages across the site.
As a small team, its really really helpful to be able to benefit from the insights, user testing and design skills of much larger organisations. It’s always important to filter that advice through the lens of your own users and product, but when others work in the open in this way, we can all benefit.
If you have other resources that you think I should be looking at, then leave a comment or drop me a message on Mastodon.
Posted at 16:05
I heard about VHS recently. It’s a tool for creating recordings of command-line tools, so you can create little demos and tutorials about how to use them.
You can write a script to run commands, manipulate and theme the terminal and produce output in a range of formats.
I started thinking about how I could enhance some of the technical documentation I’ve been writing recently with little videos. It’d be a nice way to provide an alternative way for people to learn a process or a new tool.
But then I realised I could use it to do something much more fun: recreate some scenes from some sci-fi films.
So I give you sci-fi terminals. A little github repo with some VHS tapes that produce the following output.
Check the repo if you’re interested in how the tapes work. I’m pretty pleased with the results!
Posted at 16:05
I’ve been tidying up some of my online presence this week, including getting rid of a server I wasn’t using any more and moving some project around.
One of those was a project I did a few years ago to digitise some historical maps of Bath, georeference them so they can be overlaid onto current web maps, and then publish them for use in Google Earth.
Having fixed up some SSL issues with the serving of the KML and image files, the maps are now working again.
I plan to do some more work on this as I’ve since found a number of other maps that I’ve also been digitising. There are also better ways to publish these maps today.
But for now I thought I’d quickly write up how to use them using current versions of Google Earth.
Google Earth Pro
The desktop version is now called Google Earth Pro. Its still free.
If you want to view an individual map, then click one of the links in the table and your browser will download a file called doc.kml.
With the desktop application installed double-clicking the file should automatically open any KML file in Google Earth. So just click the file to open it.
You’ll probably want to to turn off some of the default map layers like 3D Buildings as otherwise the map will look odd.
You can then explore the file using the normal navigation controls.
One thing I like to do is use the Opacity setting to fade out the historical map slightly so you can see the modern day features underneath.
The best way to import all of the maps into your application is to:
The application will then add a new folder which contains a sub-folder for each of the historical maps. You can then choose which ones to switch on and off. By playing with the opacity to can explore several maps in one go.
Here’s a video of me doing that.
Google Earth on the Web
This version of Google Earth doesn’t support as many features. So it just doesn’t work as well as the desktop version. But you can still view the maps.
You can turn off 3D Buildings by going to the “Map Styles” menu option and choosing the “Clean” style.
Unfortunately I can’t find a way to change the opacity of layers in this version, so you’re limited in how you can explore the maps.
The web version also has limited KML support so you can’t open the complete folder of maps, which is a shame.
So, there you have it. I’m pleased the maps are working again but lots to do to make the whole experience better.
Posted at 16:05
I enjoy writing design patterns.
I find them a useful way to clarify my thinking around different solutions to problems across a whole range of areas. A well-named pattern can also help to clarify and focus discussion.
I’ve written a whole book of Linked Data patterns and lead a team that produced a set of design patterns for collaborative maintenance of data.
I’ve been planning to revisit some writing and thinking I’ve been doing around capturing design patterns for different models of data access, sharing, governance and modelling. Given all the confusing jargon that is thrown around in this space, I think writing some design patterns might help.
When I’ve been writing design patterns in the past I’ve used a fairly common template:
But I’ve been thinking about iterating on this to add a new section: harms.
This would lay out the potential consequences or unintended side effects of adopting the pattern. Both to the system in which they are implemented, but also more broadly.
I started thinking about this after reading a paper that discusses the downsides of poor modelling of sex and gender in health datasets used in machine-learning. I’d highly recommend reading the paper, regardless of whether you work in health or machine-learning. This paper about database design in public services is also work a look while you’re at it. I wrote a summary.
While the sex/gender paper doesn’t describe the issues in terms of design patterns, it’s largely a discussion of the impacts of specific data modelling decisions.
Some of these decisions are just poor. Capturing unnecessary personal data. Simplistic approaches to describing sex and gender.
Work on design patterns has long attempted to highlight poor designs. For example by describing “anti-patterns” or “deceptive design patterns” (don’t call them dark patterns).
But some of the design decisions highlighted in that research paper are more nuanced. Decisions which may have been justified within the scope of a specific system, where their limitations may be understood and minimised, but whose impacts are greatly amplified as data is lifted out of its original context and reused.
This means that there’s not a simple good vs bad decision to record as a pattern. We need an understanding of the potential consequences and harms as an integral part of any pattern.
Some pattern templates include sections for “resulting context” which can be used to capture side effects. But I think a more clearly labelled “harms” section might be better.
If you’ve seen good examples of design patterns that also discuss harms, I’d be interested to read them.
Posted at 16:05
I recently attended the launch event for the new Power To Change Community Tech Fund and have been reading through the essays and report on the Community Tech network website.
It’s great to see this topic getting some attention and much needed funding. It’s also prompted me to reflect a bit on my own experience with community lead projects.
For a few years I was leading Bath: Hacked which was a community group and then a small CIC, run by volunteers, who were trying to foster use of open data by the local communities of Bath & North East Somerset.
We managed to support the council and others in publishing open data. And we ran a lot of events, meetups and hack days to support and encourage use of the data.
For example, Accessible Bath involved us mapping accessibility of shops, restaurants and other locations around the city. We consulted with local people who had mobility issues to identify some useful actions we could undertake, rather than just jumping into the technology. And in the end we worked largely within the technical infrastructure of other existing platforms rather than creating our own.
However its Energy Sparks that has been the lasting legacy of Bath: Hacked.
There’s a blog post by the ODI that talks about the history, so I’m not going to write a full origin story. I just wanted to highlight the community tech aspects.
Energy Sparks was initially a local project. Philip, the founder, has already working with local schools for some time through Transition Bath. He was using spreadsheets to analyse energy data for them and the council.
A Bath: Hacked hack day provided the opportunity for Philip to work with others from the local tech community to prototype an online service. We then took it from there to something that was eventually launched to a range of schools in B&NES.
The team was originally all local. Philip, the energy analyst, working with a mixture of developers, students, educators and others. A mixture of local expertise. The focus was very much on delivering benefits for our local schools.
This feels like it firmly fits within the definition of “community tech”.
My original idea was to offer Energy Sparks as an open source platform that other communities could deploy and use for themselves. With very limited funding, and with the initial team being largely volunteers, I wasn’t sure we could scale up the service to support other areas. Or even get access to the necessary data to make that possible.
So I was interested in exploring a more decentralised model that would allow other areas to run the service for themselves.
While I was working at the Open Data Institute I’d helped lead a short research project looking at how to scale local, open data enabled innovation. I’d seen so many interesting things created for local areas, but also felt frustrated when trying to replicate them locally. Recognising that others might build on your work, and planning for that to happen, seemed to be an important part of making that successful.
So I was keen to see if this model might work for Energy Sparks. Ultimately it didn’t. For several reasons.
Firstly, while we had interest from other areas, there wasn’t the technical capability necessary to actually launch and run a local version of the service. All of the code was available. It was possible to run it on free or very cheap infrastructure, but it still needed some technical skill to customise and deploy. And those who were interested in reusing the service weren’t necessarily technical, or have access to developers or a local community who could support them.
Secondly, it proved easier to secure funding to scale, than it was to help others secure the funding they needed locally. I think this is because funders seem mostly interested in scaling things up, rather than in seeing things replicated.
We’d pieced together funding (and a lot of volunteer effort) from the “ODI Summer Showcase”, Bath & West Community Energy, the Nature Save Trust and others to support the early development.
The scale up support came from the BEIS NDSEMIC innovation competition. That supported hiring a team and investing further in the technology. Allowing more of that spreadsheet based analysis to be turned into an automated tool.
There are clearly economies of scale to be had in scaling up. But it also has its own challenges. While it might not have been right for Energy Sparks, I still think there are benefits to be had from replicating technology locally and not just scaling upwards. It just needs a good alignment of motivation, funding and skills. Something I hadn’t fully appreciated at the outset.
It was through that BEIS funding and then later support from the Ovo Foundation, Centrica and more recently the Department for Education that we’re now able to offer a national service, which is free for state schools. Today we now have a core team of eight people and a network of freelancers.
The code is still open source. But it’s now more of a transparency measure or insurance policy than an attempt to collaboratively build the technology.
Our shift to a national service means our notion of community has changed.
For example, we’re no longer primarily place-based. But there is still place-based activity and engagement through our partnership with Egni. Egni are running a package of educational and creative workshops with a range of schools in Wales. Energy Sparks is a core component of that. Together we’re able to deliver on our goals whilst supporting them in achieving their goals of creating more community-owned clean energy. Working together rather than in competition.
If Energy Sparks is a “platform”, this is the kind of platform we want it to be.
Our community is now the schools and teachers using Energy Sparks to tackle the energy crisis and educate young people about climate change. We’re creating steering groups in Wales and England to involve them more with our planning, whilst continuing to engage through existing networks of sustainability and climate groups.
I’m proud of having had the opportunity to be involved in the project and to continue to be part of the team working to deliver more impact through our work.
I’m looking forward to seeing what other projects spring up from the Power To Change fund.
Posted at 16:05
I’ve found my new favourite example of a well documented, tiny slice of data infrastructure.
I’m going to hazard a guess that it’s probably the simplest dataset that is designated as national statistics. If you can think of one simpler, then let me know.
It’s the weekly road fuel prices data on gov.uk.
This data has been updated every week, without fail, since September 2013.
The CSV has got seven columns in it. You can download it in XML if you want. Or you can grab the Excel which comes with some fancy clipart of petrol pumps.
This is national statistics, so of course there’s a document that describes the methodology for how it is compiled. I’ve read it. At four pages, it’s short, clear and to the point.
There’s a bit more to it, but basically, every Monday someone at BEIS emails six companies and asks them for their prices. By the end of the day they put the responses in the spreadsheet and then it’s published on the Tuesday.
Simple.
Someone or, more likely some team, at BEIS has been doing that for at least nine years. No one has bothered to automate it away. Probably because it’s not a lot of effort to keep updating the spreadsheet.
If we were designing this from scratch, we’d probably immediately start thinking about services and APIs and data formats. But none of that is really needed.
It just needed a spreadsheet and a commitment to keep publishing the data.
That’s what makes it data infrastructure. The commitment, not the technology.
Posted at 16:05
This privacy notice went past in my twitter stream earlier.
It announces that the UK government is planning to create a new database that will some quite detailed data about every electricity meter in the UK. In particular it’ll combine together information about the meter, the energy consumption and billing details associated with that energy supply and detailed information about the personal paying those bills.
Apparently it’s intended to help support fraud prevention around the Energy Price Guarantee (EPG) Scheme.
Unsurprisingly there’s not a great deal of detail beyond a broad outline of the data to be aggregated. But it looks to be a database that will consist of data about all meters, not just smart meters. And, while it says “each and every electricity and gas meter”, I suspect they actually mean every domestic meter.
The database will also apparently contain data about electricity consumption. But not gas? (I suspect that’s an oversight). It’s unclear what granularity of consumption it’ll contain, but I’d hazard that it’ll be daily, monthly or quarterly rather than detailed half-hourly readings.
Reading the notice, my big unanswered question was: OK, but why build a new database?
Specifically, what are the technical requirements of the service to be built around that data, that means that it needs to be held in one big database?
The UK’s smart metering data infrastructure was designed to avoid having a single big database. So why do that here?
Is it really easier to merge and aggregate all this data into one pot than, say carry out some kind of integration with the data already held in energy company systems?
It probably is easier to aggregate. And it’ll probably be easier to build a system around that, than a bunch of loosely joined parts.
But, given the the government’s own desire to have a “digital spine” to support sharing data across energy companies. Or key elements of its own data strategy, shouldn’t it be considering all of the options?
And maybe it has. That’s the problem with privacy notices: they just give you the results of a decision. The decision to process some data. We get no insight into why it needs to be done in this specific way. Even though building trust starts with being transparent from the start.
I found myself thinking about planning notices rather than privacy notices. And then remembered that Dan Hon had written about this recently.
In a world where we have some increasingly sophisticated means of securely sharing data, without it having to be moved around. Tell me why you need to build another great big database, rather than any of those other solutions.
Posted at 16:05
It has now been nine months since the initial public release of RDF.rb, our RDF library for Ruby, and today we're happy to announce the release of RDF.rb 0.3.0, a significant milestone.
As the changelog attests, this has been a long release cycle that incorporates 170 commits by 6 different authors. The major new features include transactions and basic graph pattern (BGP) queries, as well as the availability of robust and fast parser/serializer plugins for the RDFa, Notation3, Turtle, and RDF/XML formats, complementing other already previously supported formats. In addition, many bugs have been fixed and general improvements, including significant performance improvements, implemented.
RDF.rb 0.3.0 is immediately available via RubyGems, and can be installed or upgraded to as follows on any Unix box with Ruby and RubyGems:
$ [sudo] gem install rdf
In all the code examples that follow below, we will assume that RDF.rb and the built-in N-Triples parser have already been loaded up like so:
require 'rdf'
require 'rdf/ntriples'
# Enable facile references to standard vocabularies:
include RDF
While RDF.rb 0.3.0 continues with our minimalist policy of only supporting the N-Triples serialization format in the core library itself, support for every widely-used RDF serialization format is now available in the form of plugins.
Thanks to the hard work of Gregg Kellogg, the author of RdfContext, there are now RDF.rb 0.3.0-compatible plugins available for the RDFa, Notation3, Turtle, and RDF/XML formats, complementing the already previously-available plugins for the RDF/JSON and TriX formats. See Gregg's blog post for more details on the particulars of these plugins.
We are also pleased to announce that Gregg has joined the RDF.rb core development team, which now consists of him, Ben Lavender, and myself. This merger between the RDF.rb and RdfContext efforts is a perfect match, given that Ben and I have been focused more on storing and querying RDF data while Gregg has been busy single-handedly solving all RDF serialization questions.
To facilitate typical Linked
Data use cases, we now also provide a metadistribution of
RDF.rb that includes a full set of parsing/serialization plugins;
the following will install all of the rdf
,
rdf-isomorphic
, rdf-json
,
rdf-n3
, rdf-rdfa
,
rdf-rdfxml
, and rdf-trix
gems in one
go:
$ [sudo] gem install linkeddata
Similarly, instead of loading up support for each RDF serialization format one at a time, you can simply use the following to load them all; this is helpful e.g. for the automatic selection of an appropriate parser plugin given a particular file name or extension:
require 'linkeddata'
For a tutorial introduction to RDF.rb's reader and writer APIs, please refer to my previous blog post Parsing and Serializing RDF Data with Ruby.
The query API in RDF.rb 0.3.0 now includes basic graph pattern (BGP) support, which has been a much-requested feature. BGP queries will already be a familiar concept to anyone using SPARQL, and in RDF.rb they are constructed and executed like this:
# Load some RDF.rb project information into an in-memory graph:
graph = RDF::Graph.load("http://rdf.rubyforge.org/doap.nt")
# Construct a BGP query for obtaining developers' names and e-mails:
query = RDF::Query.new({
:person => {
RDF.type => FOAF.Person,
FOAF.name => :name,
FOAF.mbox => :email,
}
})
# Execute the query on our in-memory graph, printing out solutions:
query.execute(graph).each do |solution|
puts "name=#{solution.name} email=#{solution.email}"
end
Executing a BGP query returns a solution sequence, encapsulated
as an instance of the RDF::Query::Solutions
class. Solution sequences provide a number of convenient methods
for further narrowing down the returned solutions to what you're
actually looking for:
# Filter solutions using a hash:
solutions.filter(:author => RDF::URI("http://ar.to/#self"))
solutions.filter(:author => "Arto Bendiken")
solutions.filter(:updated => RDF::Literal(Date.today))
# Filter solutions using a block:
solutions.filter { |solution| solution.author.literal? }
solutions.filter { |solution| solution.title =~ /^SPARQL/ }
solutions.filter { |solution| solution.price < 30.5 }
solutions.filter { |solution| solution.bound?(:date) }
solutions.filter { |solution| solution.age.datatype == XSD.integer }
solutions.filter { |solution| solution.name.language == :es }
# Reorder solutions based on a variable:
solutions.order_by(:updated)
solutions.order_by(:updated, :created)
# Select particular variables only:
solutions.select(:title)
solutions.select(:title, :description)
# Eliminate duplicate solutions:
solutions.distinct
# Limit the number of solutions:
solutions.offset(20).limit(10)
# Count the number of matching solutions:
solutions.count
solutions.count { |solution| solution.price < 30.5 }
BGP-capable storage adapters should override and implement the
following RDF::Queryable
method in order to provide storage-specific optimizations for BGP
query evaluation:
class MyRepository < RDF::Repository
def query_execute(query, &block)
# ...
end
end
The repository API in RDF.rb 0.3.0 now includes basic transaction support:
# Load some RDF.rb project information into an in-memory repository:
repository = RDF::Repository.load("http://rdf.rubyforge.org/doap.nt")
# Delete one statement and insert another, atomically:
repository.transaction do |tx|
subject = RDF::URI('http://rubygems.org/gems/rdf')
tx.delete [subject, DOAP.name, nil]
tx.insert [subject, DOAP.name, "RDF.rb 0.3.0"]
end
As you would expect, if the transaction block raises an exception, the current transaction will be aborted and rolled back; otherwise, the transaction is automatically committed when the block returns.
Transaction-capable storage adapters should override and
implement the following three RDF::Repository
methods:
class MyRepository < RDF::Repository
def begin_transaction(context)
# ...
end
def rollback_transaction(tx)
# ...
end
def commit_transaction(tx)
# ...
end
end
The RDF::Transaction
objects passed to these methods consist of a sequence of RDF
statements to delete from, and a sequence of RDF statements to
insert into, a given graph. The default transaction implementation
in RDF::Repository
simply builds up a transaction
object in memory, buffering all inserts/deletes until the
transaction is committed, at which point the operations are then
executed against the repository.
Note that whether transactions are actually executed
atomically depends on the particulars of the storage adapter you're
using. For instance, the RDF::DataObjects
plugin, which provides a storage adapter supporting SQLite,
PostgreSQL, MySQL, and other RDBMS solutions, will certainly be
able to offer ACID transaction support (albeit it has not been
updated for that, or other 0.3.x features, just yet.)
On the other hand, not e.g. all NoSQL solutions support transactions, so storage adapters for such solutions may choose to omit explicit transaction support and have it supplied by RDF.rb's default implementation.
In earlier RDF.rb releases, our focus was strongly centered on defining the core APIs that have enabled the thriving plugin ecosystem we can witness today. The focus was not so much, therefore, on the performance of the bundled default implementations of those APIs; in some cases, these implementations could have been described as being of only proof-of-concept quality.
In particular, the in-memory graph and repository implementations were suboptimal in RDF.rb 0.1.x, and only somewhat improved in 0.2.x. However, reflecting the increasing production-readiness of RDF.rb in general, matters have been much improved in RDF.rb 0.3.0.
Of course, performance improvements are an open-ended task, and I'm sure we'll see more work on this front in the future as need arises and time permits. But it's likely that RDF.rb 0.3.0 now offers a sufficient out-of-the-box performance level for many if not most common use cases.
Scalability has also been addressed by making use of enumerators throughout the APIs defined by RDF.rb. That means that all operations are generally performed in a streaming fashion, enabling you to build pipelines for hundreds of millions of RDF statements to flow through while still maintaining constant memory usage by ensuring that the statements are processed one by one.
Lastly, RDF.rb 0.3.0 has been upgraded to use and depend on
RSpec 2.x instead of the
previous 1.3.x branch. This requires minor changes to the
spec/spec_helper.rb
file in any project that relies on
the RDF::Spec
library. The most minimal spec_helper.rb
contents are
now as follows:
require 'rdf/spec'
RSpec.configure do |config|
config.include RDF::Spec::Matchers
end
In tandem with the soon 10,000 downloads of RDF.rb on RubyGems.org, a very positive sign of all the interest and ongoing work around RDF.rb is our growing contributor list. We thank everyone who has sent in bug reports, and in particular the following people who have contributed patches to RDF.rb and/or an RDF.rb plugin; in alphabetical order:
Călin Ardelean, Christoph Badura, John Fieber, Joey Geiger, James Hetherington, Gabriel Horner, Nicholas Humfrey, Fumihiro Kato, David Nielsen, Thamaraiselvan Poomalai, Keita Urashima, Pius Uzamere, and Hellekin O. Wolf.
(My apologies if I have inadvertently omitted anyone from the previous, and please let me know about it.)
As always, if you have feedback regarding RDF.rb please contact us either privately or via the public-rdf-ruby@w3.org mailing list. Plain and simple bug reports, however, should more preferably go directly to the issue queue on GitHub.
Be sure to follow @datagraph, @bendiken, @bhuga, and @gkellogg on Twitter for the latest updates on RDF.rb as they happen.
Posted at 16:05
I've just released Spira, a first draft of an RDF ORM, where the 'R' can mean RDF or Resource at your pleasure. It's an easy way to create Ruby objects out of RDF data. The name is from Latin, for 'breath of life'--it's time to give those resource URIs some character. It looks like this (feel free to copy-paste):
require 'spira'
require 'rdf/ntriples'
repo = "http://datagraph.org/jhacker/foaf.nt"
Spira.add_repository(:default, RDF::Repository.load(repo))
class Person
include Spira::Resource
property :name, :predicate => FOAF.name
property :nick, :predicate => FOAF.nick
end
jhacker = RDF::URI("http://datagraph.org/jhacker/#self").as(Person)
jhacker.name #=> "J. Random Hacker"
jhacker.nick #=> "jhacker"
jhacker.name = "Some Other Hacker"
jhacker.save!
I try not to start new projects lightly. There's plenty of good stuff out there. But there wasn't quite what I wanted.
First of all, I want to program in Ruby, so it needed to be Ruby. Spira, while different, has a lot of overlap with a traditional ORM, and I was on the fence for a while about starting Spira or trying to implement things in DataMapper. There's already an RDF.rb backend for DataMapper, which is cool, but using it really cuts you off from RDF as RDF. It's more about making RDF work how DataMapper likes it. DataMapper's storage adapter interface is an implicit data model, one that is not RDF's, and it is not quite what I wanted.
On the RDF-specific front, there's ActiveRDF. ActiveRDF is based on SPARQL directly, and thus, while not hiding RDF from you, only gives you access via Redland. The Redland Ruby bindings have problems, and do not represent the entire RDF ecosystem. I wanted to start on something that completely abstracted away the data model, so I could focus on the problem at hand, which means RDF.rb. The difference is in allowing me to focus on what I'm focusing on: there exists a perfectly good, working SPARQL client storage adapter for RDF.rb, but it's one of many pluggable backends instead of a requirement.
Lastly, while both of those projects would represent a workable
starting point, this was something of a journey of exploration in
terms of semantics. Spira was going to be 'open
world' 'open model' from the start; I specifically wanted
something that could read foreign data. By 'open model' I mean that
Spira does not expect that a class definition is the authoritative,
exclusive, or complete definition of a model class. That turns out
to make Spira have some important semantic differences from ORMs
oriented around object or relational databases. Stumbling on them
was part of the fun, and even if I could have twisted DataMapper
around the problem, I'm not sure that starting from there would
have had me focusing on the core semantics.
So I decided to start something new. To be fair, Spira would suck a lot more were it not for the projects that came before it. In particular, it owes an intellectual debt to DataMapper, which has a generally sane model, readable code, and had to cover a lot of ground that any object-whatever-mapper would. It takes some digging, but as an example, one can find IRC logs where the DataMapper team discusses the ups and downs of identity map implementations in Ruby. That stuff is amazing to have available without spending hundreds of hours fighting it yourself, and again, it saves me a lot of trial and error on ancilliary considerations.
Spira's core use case is allowing programmers to create Ruby objects representing an aspect of an RDF resource. I'm still working on which terminology I like best, but I am leaning towards calling instances of Spira classes 'a projection of a given RDF resource as a Spira resource.' In the simplest of terms, Spira tries to let you create classes that easily get and set values for properties that correspond to RDF predicates. The README will explain it better than I want to in this post (now available in Github and Yardoc flavors).
The hopeful end result is a way to access the RDF data model in a way that agile web programmers have come to expect, without forcing them to get bogged down into a world of ontologies, rule languages, inference levels, and lord knows what-all else. RDF has taken off in the enterprise because of power user features, and we're approaching a critical mass of RDFa publishing, but it's not yet on anyone's radar as a data model for their next weekend project. I think that's a shame--RDF's schema-free model should be the easiest thing in the world to get started on. So in addition to hopefully being an open-model ORM, here's hoping Spira is a step in the adoption of RDF as a day-to-day data model.
Any useful abstraction layer is about applying constraints.
Normal ORMs hide the power of relational databases to make them
into proper object databases. Spira constrains you to a particular
aspect of a resource. That means that in the aspect of 'Person', a
resource's name is a given predicate, and they only have one. A
person might also have a label, multiple names, a comment, function
as a category or tag, have friends, have accounts, have tons of
other stuff, but if all you want is their age, you just want to say
person.name
and person.age
. The goal here
is to let you use data (or at least, to have defined behavior for
data) that you cannot say for sure meets any sort of criteria you
set in Spira. Spira will have defined behavior for when data does
not match a model class, and will still let you use that data
easily, pretending it came from a closed system. That's good enough
surprisingly often.
That open-model part is where tough semantics come in. As an example, I had intended to publish, with Spira, a reference implementation of SIOC. The SIOC core classes are in widespread use, so surely this would find some use, I figured. But it's not so simple to make a reference implementation unless you limit your possibilities. For example, a SIOC post can have topics (a sub-class of dcterms:subject). These topics are RDF resources which may be one (or, I suppose, both, or neither) of two classes defined in the SIOC types ontology, Category or Tag. These two classes have completely different semantics. Now, a Spira class could be created to deal with either of them, but to use that class usefully, you'd always be checking what it is, since the semantics are different. Spira will eventually have helpers to help you decide what to do here, but the point is that in RDF, a 'reference implementation' often doesn't make sense as a concept. However, this is at least in principle representable in Spira--I'm not sure it could be done in a traditional ORM, as it doesn't really match the single-table inheritance model.
Instead, I hope Spira classes are simple enough--throw away, even--that you can define them when you need them. Indeed, defining them programmatically is obvious with the framework in place, I just haven't done it yet.
Another example of differing semantics would be instance
creation. An RDF resource does not 'exist or not'. It's either the
subject of triples or not. So what would it mean to create an
instance of a Spira resource and save it when it had no fields?
Would one save a triple declaring the resource to be an RDF
resource? How about saving the RDF type, should that happen if one
has not saved fields? There are good arguments for several options.
It's just not the same model as the 'find, create, find_or_create'
trio of constructors that the world has grown used to, since the
identifiers are global and always exist. Primary keys do not come
into existence to allow reference to an object, the key is
the object. I dodged the question and now do construction based on
RDF::URI
s.
Instantiation looks either like this:
RDF::URI("http://example.org/bob").as(Person)
or like this:
Person.for(RDF::URI("http://example.org/bob"))
There's no finding or creating. Resources just are. Creating a Spira object is creating the projection of that resource as a class. If you've told Spira about a repository where some information about that resource may or may not exist, great, but it's not required.
As another example, I see a lot of need for validations on creating an instance, not just saving one, as in traditional ORMs. RDF is not like the data fed to a traditional ORM, which is generally created by that ORM or by a known list of applications, managed by a set of hard constraints and schema. RDF data is often found, and used, in the wild.
There's still a ton left to do, but lots of stuff already works. The README has a good rundown of where things stand. I'd enumerate the to-do list, but I'd rather not feed that to Google, and it's long enough anyway that if certain deficencies quickly become obvious, I'd attack them first.
Anyways, hope someone has fun with it. gem install
spira
are the magic words. If you want to spoil the magic,
the code is on
Github.
The original version of this post used the term 'Open World' instead of 'Open Model' willy-nilly throughout, but I was corrected from using the term outside its strict meaning in terms of inference. See the comments. If a term exists for what I'm describing at this level of abstraction, I'm all ears.
Posted at 16:05
This started out as an answer at Semantic Overflow on how RDF database systems differ from other currently available NoSQL solutions. I've here expanded the answer somewhat and added some general-audience context.
RDF database systems are the only standardized NoSQL solutions available at the moment, being built on a simple, uniform data model and a powerful, declarative query language. These systems offer data portability and toolchain interoperability among the dozens of competing implementations that are available at present, avoiding any need to bet the farm on a particular product or vendor.
In case you're not familiar with the term, NoSQL ("Not only SQL") is a loosely-defined umbrella moniker for describing the new generation of non-relational database systems that have sprung up in the last several years. These systems tend to be inherently distributed, schema-less, and horizontally scalable. Present-day NoSQL solutions can be broadly categorized into four groups:
Key-value databases are familiar to anyone who has worked with the likes of the venerable Berkeley DB. These systems are about as simple as databases get, being in essence variations on the theme of a persistent hash table. Current examples include MemcacheDB, Tokyo Cabinet, Redis and SimpleDB.
Document databases are key-value stores that treat stored values as semi-structured data instead of as opaque blobs. Prominent examples at the moment include CouchDB, MongoDB and Riak.
Wide-column databases tend to draw inspiration from Google's BigTable model. Open-source examples include Cassandra, HBase and Hypertable.
Graph databases include generic solutions like Neo4j, InfoGrid and HyperGraphDB as well as all the numerous RDF-centric solutions out there: AllegroGraph, 4store, Virtuoso, and many, many others.
RDF database systems form the largest subset of this last NoSQL category. RDF data can be thought of in terms of a decentralized directed labeled graph wherein the arcs start with subject URIs, are labeled with predicate URIs, and end up pointing to object URIs or scalar values. Other equally valid ways to understand RDF data include the resource-centric approach (which maps well to object-oriented programming paradigms and to RESTful architectures) and the statement-centric view (the object-attribute-value or EAV model).
Without just now extolling too much the virtues of RDF as a particular data model, the key differentiator here is that RDF database systems embrace and build upon W3C's Linked Data technology stack and are the only standardized NoSQL solutions available at the moment. This means that RDF-based solutions, when compared to run-of-the-mill NoSQL database systems, have benefits such as the following:
A simple and uniform standard data model. NoSQL databases typically have one-off, ad-hoc data models and capabilities designed specifically for each implementation in question. As a rule, these data models are neither interoperable nor standardized. Take e.g. Cassandra, which has a somewhat baroque data model that "can most easily be thought of as a four or five dimensional hash" and the specifics of which are described in a wiki page, blog posts here and there, and ultimately only nailed down in version-specific API documentation and the code base itself. Compare to RDF database systems that all share the same well-specified and W3C-standardized data model at their base.
A powerful standard query language. NoSQL databases typically do not provide any high-level declarative query language equivalent of SQL. Querying these databases is a programmatic data-model-specific, language-specific and even application-specific affair. Where query languages do exist, they are entirely implementation-specific (think SimpleDB or GQL). SPARQL is a very big win for RDF databases here, providing a standardized and interoperable query language that even non-programmers can make use of, and one which meets or exceeds SQL in its capabilities and power while retaining much of the familiar syntax.
Standardized data interchange formats. RDBMSes have (somewhat implementation-specific) SQL dumps, and some NoSQL databases have import/export capability from/to implementation-specific structures expressed in an XML or JSON format. RDF databases, by contrast, all have import/export capability based on well-defined, standardized, entirely implementation-agnostic serialization formats such as N-Triples and N-Quads.
From the preceding points it follows that RDF-based NoSQL solutions enjoy some very concrete advantages such as:
Data portability. Should you need to switch between competing database systems in-house, to make use of multiple different solutions concurrently, or to share data with external parties, your data travels with you without needing to write and utilize any custom glue code for converting some ad-hoc export format and data structure into some other incompatible ad-hoc import format and data structure.
Toolchain interoperability. The RDBMS world has its various database abstraction layers, but the very concept is nonsensical for NoSQL solutions in general (see "ad-hoc data model"). RDF solutions, however, represent a special case: libraries and toolchains for RDF are typically only loosely coupled to any particular DBMS implementation. Learn to use and program with Jena or Sesame for Java and Scala, RDFLib for Python, or RDF.rb for Ruby, and it generally doesn't matter which particular RDF-based system you are accessing. Just as with RDBMS-based database abstraction layers, your RDF-based code does not need to change merely because you wish to do the equivalent of switching from MySQL to PostgreSQL.
No vendor or product lock-in. If the RDF database solution A was easy to get going with but eventually for some reason hits a brick wall, just switch to RDF database solution B or C or any other of the many available interoperable solutions. Unlike switching between two non-RDF solutions, this does not have to be a big deal. Needless to say there are also ecosystem benefits with regards to the available talent pool and the commercial support options.
Future proof. With RDF now emerging as the definitive standard for publishing Linked Data on the web, and being entirely built on top of indelibly-established lower-level standards like URIs, it's not an unreasonable bet that your RDF data will still be usable as-is by, say, 2038. It's not at all evident, however, that the same could be asserted for any of the other NoSQL solutions out there at the moment, many which will inevitably prove to be rather short-lived in the big picture.
RDF-based systems also offer unique advantages such as support for globally-addressable row identifiers and property names, web-wide decentralized and dynamic schemas, data modeling standards and tooling for creating and publishing such schemas, metastandards for being able to declaratively specify that one piece of information entails another, and inference engines that implement such data transformation rules.
All these features are mainly due to the characteristics and capabilities of RDF's data model, though, and have already been amply described elsewhere, so I won't go further into them just here and now. If you wish to learn more about RDF in general, a great place to start would be the excellent RDF in Depth tutorial by Joshua Tauberer.
And should you be interested in the growing intersection between the NoSQL and Linked Data communities, you will be certain to enjoy the recording of Sandro Hawke's presentation Toward Standards for NoSQL (slides, blog post) at the NoSQL Live in Boston conference in March 2010.
Posted at 16:05
In this tutorial we'll learn how to parse and serialize RDF data using the RDF.rb library for Ruby. There exist a number of Linked Data serialization formats based on RDF, and you can use most of them with RDF.rb.
To follow along and try out the code examples in this tutorial, you need only a computer with Ruby and RubyGems installed. Any recent Ruby 1.8.x or 1.9.x version will do fine, as will JRuby 1.4.0 or newer.
These are the RDF serialization formats that you can parse and serialize with RDF.rb at present:
Format | Implementation | RubyGems gem
------------|-----------------------|-------------
N-Triples | RDF::NTriples | rdf
Turtle | RDF::Raptor::Turtle | rdf-raptor
RDF/XML | RDF::Raptor::RDFXML | rdf-raptor
RDFa | RDF::Raptor::RDFa | rdf-raptor
RDF/JSON | RDF::JSON | rdf-json
TriX | RDF::TriX | rdf-trix
RDF.rb in and of itself is a relatively lightweight gem that includes built-in support only for the N-Triples format. Support for the other listed formats is available through add-on plugins such as RDF::Raptor, RDF::JSON and RDF::TriX, each one packaged as a separate gem. This approach keeps the core library fleet on its metaphorical feet and avoids introducing any XML or JSON parser dependencies for RDF.rb itself.
Installing support for all these formats in one go is easy enough:
$ sudo gem install rdf rdf-raptor rdf-json rdf-trix
Successfully installed rdf-0.1.9
Successfully installed rdf-raptor-0.2.1
Successfully installed rdf-json-0.1.0
Successfully installed rdf-trix-0.0.3
4 gems installed
Note that the RDF::Raptor gem requires that the Raptor RDF Parser library and command-line tools be available on the system where it is used. Here follow quick and easy Raptor installation instructions for the Mac and the most common Linux and BSD distributions:
$ sudo port install raptor # Mac OS X with MacPorts
$ sudo fink install raptor-bin # Mac OS X with Fink
$ sudo aptitude install raptor-utils # Ubuntu / Debian
$ sudo yum install raptor # Fedora / CentOS / RHEL
$ sudo zypper install raptor # openSUSE
$ sudo emerge raptor # Gentoo Linux
$ sudo pkg_add -r raptor # FreeBSD
$ sudo pkg_add raptor # OpenBSD / NetBSD
For more information on installing and using Raptor, see our previous tutorial RDF for Intrepid Unix Hackers: Transmuting N-Triples.
If you're in a hurry and just want to get to consuming RDF data right away, the following is really the only thing you need to know:
require 'rdf'
require 'rdf/ntriples'
graph = RDF::Graph.load("http://datagraph.org/jhacker/foaf.nt")
In this example, we first load up RDF.rb as well as support for
the N-Triples format. After that, we use a convenience method on
the RDF::Graph
class to fetch and parse RDF data directly from a web URL in one
go. (The load
method can take either a file name or a
URL.)
All RDF.rb parser plugins declare which MIME content types and
file extensions they are capable of handling, which is why in the
above example RDF.rb knows how to instantiate an N-Triples parser
to read the foaf.nt
file at the given URL.
In the same way, RDF.rb will auto-detect any other RDF file formats as long as you've loaded up support for them using one or more of the following:
require 'rdf/ntriples' # Support for N-Triples (.nt)
require 'rdf/raptor' # Support for RDF/XML (.rdf) and Turtle (.ttl)
require 'rdf/json' # Support for RDF/JSON (.json)
require 'rdf/trix' # Support for TriX (.xml)
Note that if you need to read RDF files containing multiple
named graphs (in a serialization format that supports named graphs,
such as TriX), you probably want to be using RDF::Repository
instead of RDF::Graph
:
repository = RDF::Repository.load("http://datagraph.org/jhacker/foaf.nt")
The difference between the two is that RDF statements in
RDF::Repository
instances can contain an optional
context (i.e. they can be quads), whereas statements in an
RDF::Graph
instance always have the same context (i.e.
they are triples). In other words, repositories contain
one or more graphs, which you can access as follows:
repository.each_graph do |graph|
puts graph.inspect
end
RDF.rb's parsing and serialization APIs are based on the following three base classes:
RDF::Format
is used to describe particular RDF serialization formats.RDF::Reader
is the base class for RDF parser implementations.RDF::Writer
is the base class for RDF serializer implementations.If you know something about the file format you want to parse or serialize, you can obtain a format specifier class for it in any of the following ways:
require 'rdf/raptor'
RDF::Format.for(:rdfxml) #=> RDF::Raptor::RDFXML::Format
RDF::Format.for("input.rdf")
RDF::Format.for(:file_name => "input.rdf")
RDF::Format.for(:file_extension => "rdf")
RDF::Format.for(:content_type => "application/rdf+xml")
Once you have such a format specifier class, you can then obtain the parser/serializer implementations for it as follows:
format = RDF::Format.for("input.nt") #=> RDF::NTriples::Format
reader = format.reader #=> RDF::NTriples::Reader
writer = format.writer #=> RDF::NTriples::Writer
There also exist corresponding factory methods on
RDF::Reader
and RDF::Writer
directly:
reader = RDF::Reader.for("input.nt") #=> RDF::NTriples::Reader
writer = RDF::Writer.for("output.nt") #=> RDF::NTriples::Writer
The above is what RDF.rb relies on internally to obtain the
correct parser implementation when you pass in a URL or file name
to RDF::Graph.load
-- or indeed to any other method
that needs to auto-detect a serialization format and to delegate
responsibility for parsing/serialization to the appropriate
implementation class.
If you need to be more explicit about parsing RDF data, for
instance because the dataset won't fit into memory and you wish to
process it statement by statement, you'll need to use RDF::Reader
directly.
RDF parser implementations generally support a
streaming-compatible subset of the RDF::Enumerable
interface, all of which is based on the
#each_statement
method. Here's how to read in an RDF
file enumerated statement by statement:
require 'rdf/raptor'
RDF::Reader.open("foaf.rdf") do |reader|
reader.each_statement do |statement|
puts statement.inspect
end
end
Using RDF::Reader.open
with a Ruby block ensures
that the input file is automatically closed after you're done with
it.
As before, you can generally use an http://
or
https://
URL anywhere that you could use a file
name:
require 'rdf/json'
RDF::Reader.open("http://datagraph.org/jhacker/foaf.json") do |reader|
reader.each_statement do |statement|
puts statement.inspect
end
end
Sometimes you already have the serialized RDF contents in a
memory buffer somewhere, for example as retrieved from a database.
In such a case, you'll want to obtain the parser implementation
class as shown before, and then use RDF::Reader.new
directly:
require 'rdf/ntriples'
input = open('http://datagraph.org/jhacker/foaf.nt').read
RDF::Reader.for(:ntriples).new(input) do |reader|
reader.each_statement do |statement|
puts statement.inspect
end
end
The RDF::Reader
constructor uses duck typing and
accepts any input (for example, IO
or
StringIO
objects) that responds to the
#readline
method. If no input argument is given, input
data will by default be read from the standard input.
Serializing RDF data works much the same way as parsing: when serializing to a named output file, the correct serializer implementation is auto-detected based on the given file extension.
RDF serializer implementations generally support an append-only
subset of the RDF::Mutable
interface, primarily the #insert
method and its alias
#<<
. Here's how to write out an RDF file
statement by statement:
require 'rdf/ntriples'
require 'rdf/raptor'
data = RDF::Graph.load("http://datagraph.org/jhacker/foaf.nt")
RDF::Writer.open("output.rdf") do |writer|
data.each_statement do |statement|
writer << statement
end
end
Once again, using RDF::Writer.open
with a Ruby
block ensures that the output file is automatically flushed and
closed after you're done writing to it.
A common use case is serializing an RDF graph into a string
buffer, for example when serving RDF data from a Rails application.
RDF::Writer
has a convenience buffer
class method that builds up output in a StringIO
under
the covers and then returns a string when all is said and done:
require 'rdf/ntriples'
output = RDF::Writer.for(:ntriples).buffer do |writer|
subject = RDF::Node.new
writer << [subject, RDF.type, RDF::FOAF.Person]
writer << [subject, RDF::FOAF.name, "J. Random Hacker"]
writer << [subject, RDF::FOAF.mbox, RDF::URI("mailto:jhacker@example.org")]
writer << [subject, RDF::FOAF.nick, "jhacker"]
end
If a particular serializer implementation supports options such
as namespace prefix declarations or a base URI, you can pass in
those options to RDF::Writer.open
or
RDF::Writer.new
as keyword arguments:
RDF::Writer.open("output.ttl", :base_uri => "http://rdf.rubyforge.org/")
RDF::Writer.for(:rdfxml).new($stdout, :base_uri => "http://rdf.rubyforge.org/")
That's all for now, folks. For more information on the APIs touched upon in this tutorial, please refer to the RDF.rb API documention. If you have any questions, don't hesitate to ask for help on #swig or the public-rdf-ruby@w3.org mailing list.
Posted at 16:05
RDF.rb is approaching two thousand downloads on RubyGems, and while it has good documentation it could still use some more tutorials. I recently needed to get RDF.rb working with a PostgreSQL storage backend in order to work with RDF data in a Rails 3.0 application hosted on Heroku. I thought I'd keep track of what I did so that I could discuss the notable parts.
In this tutorial we'll be implementing an RDF.rb storage adapter
called RDF::DataObjects::Repository
, which is a
simplified version of what I eventually ended up with. If you want
the real thing, check it out on GitHub and read the docs. This tutorial will
only cover the SQLite backend and won't concern itself with
database indexes, performance tweaks, or any other distractions
from the essential RDF.rb interfaces we'll focus on. There's a copy
of the simplified code used in the tutorial at the tutorial's
project page. And should you be inspired to build something
similar of your own, I have set up an RDF.rb storage adapter
skeleton at GitHub. Click fork, grep for lines
containing a TODO
comment, and dive right in.
I'll mention, briefly, that I chose DataObjects as the database abstraction layer, but I don't want to dwell on that -- this post is about RDF. DataObjects is just a way to use common methods to talk to different databases at the SQL level. It's a leaky abstraction, because we'll want to be using some SQL constraints to enforce statement uniqueness but those constraints need to be done differently for different databases. That means we still have to get down to the level of database-specific SQL, distasteful as that may be in this day and age. However, given that I wanted to be able to target PostgreSQL and SQLite both, DataObjects is still helpful.
You just need a few gems for the example repository. This ought to get you going. Even if you have these, make sure you have the latest; RDF.rb gets updated frequently.
$ sudo gem install rdf rdf-spec rspec do_sqlite3
So where do we start? Tests, of course. RDF.rb has factored out
its mixin specs to the RDF::Spec gem, which provides
the RSpec shared example
groups that are also used by RDF.rb for its own tests. Thus,
here is the complete spec file for the in-memory reference
implementation of RDF::Repository
:
require File.join(File.dirname(__FILE__), 'spec_helper')
require 'rdf/spec/repository'
describe RDF::Repository do
before :each do
@repository = RDF::Repository.new
end
# @see lib/rdf/spec/repository.rb
it_should_behave_like RDF_Repository
end
If you haven't seen something like this before, that's an RSpec shared example group, and it's awesome. Anything can use the same specs as RDF.rb itself to verify that it conforms to the interfaces defined by RDF.rb, and that's exactly what we'll be doing in this tutorial. Let's implement that for our repository:
# spec/sqlite3.spec
$:.unshift File.dirname(__FILE__) + "/../lib/"
require 'rdf'
require 'rdf/do'
require 'rdf/spec/repository'
require 'do_sqlite3'
describe RDF::DataObjects::Repository do
context "The SQLite adapter" do
before :each do
@repository = RDF::DataObjects::Repository.new "sqlite3::memory:"
end
after :each do
# DataObjects pools connections, and only allows 8 at once. We have
# more than 60 tests.
DataObjects::Sqlite3::Connection.__pools.clear
end
it_should_behave_like RDF_Repository
end
end
If you're new to RSpec, run the tests with the spec
command:
$ spec -cfn spec/sqlite3.spec
These fail miserably right now, of course, since we don't have an implementation. So let's make one.
RDF.rb's interface for an RDF store is RDF::Repository
.
That interface is itself composed of a number of mixins:
RDF::Enumerable
, RDF::Queryable
,
RDF::Mutable
, and RDF::Durable
.
RDF::Queryable
has a base implementation that works
on anything which implements RDF::Enumerable
. And
RDF::Durable
only provides boolean methods for clients
to ask if it is durable?
or not; the default is that a
repository reports that it is indeed durable, so we don't need to
do anything there.
The takeaway is that to create an RDF.rb storage adapter, we
need to implement RDF::Enumerable
and
RDF::Mutable
, and the rest will fall into place.
Indeed, the reference implementation is little more than an array
which implements these interfaces.
It turns out we can get away with just three methods to
implement those two interfaces: RDF::Enumerable#each
,
RDF::Mutable#insert_statement
, and
RDF::Mutable#delete_statement
. The default
implementations will use these to build up any missing methods.
That means we need to implement those first, so that we have a base
to pass our tests. Then we can iterate further, replacing methods
which iterate over every statement with methods more appropriate
for our backend.
Here's a repository which doesn't implement much more than those three methods. We'll use it as a starting point.
# lib/rdf/do.rb
require 'rdf'
require 'rdf/ntriples'
require 'data_objects'
require 'do_sqlite3'
require 'enumerator'
module RDF
module DataObjects
class Repository < ::RDF::Repository
def initialize(options)
@db = ::DataObjects::Connection.new(options)
exec('CREATE TABLE IF NOT EXISTS quads (
`subject` varchar(255),
`predicate` varchar(255),
`object` varchar(255),
`context` varchar(255),
UNIQUE (`subject`, `predicate`, `object`, `context`))')
end
# @see RDF::Enumerable#each.
def each(&block)
if block_given?
reader = result('SELECT * FROM quads')
while reader.next!
block.call(RDF::Statement.new(
:subject => unserialize(reader.values[0]),
:predicate => unserialize(reader.values[1]),
:object => unserialize(reader.values[2]),
:context => unserialize(reader.values[3])))
end
else
::Enumerable::Enumerator.new(self,:each)
end
end
# @see RDF::Mutable#insert_statement
def insert_statement(statement)
sql = 'REPLACE INTO `quads` (subject, predicate, object, context) VALUES (?, ?, ?, ?)'
exec(sql,serialize(statement.subject),serialize(statement.predicate),
serialize(statement.object), serialize(statement.context))
end
# @see RDF::Mutable#delete_statement
def delete_statement(statement)
sql = 'DELETE FROM `quads` where (subject = ? AND predicate = ? AND object = ? AND context = ?)'
exec(sql,serialize(statement.subject),serialize(statement.predicate),
serialize(statement.object), serialize(statement.context))
end
## These are simple helpers to serialize and unserialize component
# fields. We use an explicit empty string for null values for clarity in
# this example; we cannot use NULL, as SQLite considers NULLs as
# distinct from each other when using the uniqueness constraint we
# added when we created the table. It would let us insert duplicate
# with a NULL context.
def serialize(value)
RDF::NTriples::Writer.serialize(value) || ''
end
def unserialize(value)
value == '' ? nil : RDF::NTriples::Reader.unserialize(value)
end
## These are simple helpers for DataObjects
def exec(sql, *args)
@db.create_command(sql).execute_non_query(*args)
end
def result(sql, *args)
@db.create_command(sql).execute_reader(*args)
end
end
end
end
And we have a repository. Poof, done, that's it. You can get a copy of this intermediate repository at the tutorial page and run the specs for yourself. It's not very efficient for SQL yet, but this is all it takes, strictly speaking.
Since they are so important, the three main methods deserve a little more attention:
each
Each is the only thing we have to implement to get information
out after we've put it in. RDF::Enumerable
will
provide us tons of things like each_subject
,
has_subject?
, each_predicate
,
has_predicate?
, etc. If you were watching the spec
output, you'll notice we ran tests for RDF::Queryable
.
The default implementation will use RDF::Enumerable
's
methods to implement basic querying. This means we can already do
things like:
# Note that #load actually comes from insert_statement, see below
repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.query(:subject => RDF::URI.new('http://datagraph.org/jhacker/foaf'))
#=> RDF::Enumerable of statements with given URI as subject
Note that if a block is not sent, it's defined to return an
Enumerable::Enumerator
.
RDF::Queryable
, which defines #query
,
is probably the thing we can improve the most on with SQL as
opposed to the reference implementation. We'll revisit it
below.
insert_statement
insert_statement
inserts an
RDF::Statement
into the repository. It's pretty
straightforward. It gives us access to default implementations of
things like RDF::Mutable#load
, which will load a file
by name or import a remote resource:
repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.count
#=> 10
delete_statement
delete_statement
deletes an
RDF::Statement
. Again, it's straightforward, and it's
used to implement things like RDF::Mutable#clear
,
which empties the repository:
repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.clear
repo.count
#=> 0
Since we already have a nice test suite that we can pass, we can
add functionality incrementally. For example, let's implement
RDF::Enumerable#count
in a fashion that does not
require us to enumerate each statement, which is clearly not ideal
for a SQL-based system:
# lib/rdf/do.rb
def count
result = result('SELECT COUNT(*) FROM quads')
result.next!
result.values.first
end
The tests still pass, we can move on. Wash, rinse, repeat;
probably every method in RDF::Enumerable
and
RDF::Mutable
can be done more efficiently with
SQL.
RDF::Queryable
RDF::Queryable
is worth mentioning on its own,
because the interface takes a lot of options. Specifically, it can
take a Hash, a smashed Array, an RDF::Statement, or a Query object.
Fortunately, we can call super
to defer to the
reference implementation if we get arguments we don't understand,
so we can again be iterative here.
We can start by implementing the hash version, which is the most
convienent for doing the actual SQL query later. The hash version
takes a hash which may have keys for :subject
,
:predicate
, :object
, and
:context
, and returns an RDF::Enumerable
which contains all statements matching those parameters
# lib/rdf/do.rb
def query(pattern, &block)
case pattern
when Hash
statements = []
reader = query_hash(pattern)
while reader.next!
statements << RDF::Statement.new(
:subject => unserialize(reader.values[0]),
:predicate => unserialize(reader.values[1]),
:object => unserialize(reader.values[2]),
:context => unserialize(reader.values[3]))
end
case block_given?
when true
statements.each(&block)
else
statements.extend(RDF::Enumerable, RDF::Queryable)
end
else
super(pattern)
end
end
def query_hash(hash)
conditions = []
params = []
[:subject, :predicate, :object, :context].each do |resource|
unless hash[resource].nil?
conditions << "#{resource.to_s} = ?"
params << serialize(hash[resource])
end
end
where = conditions.empty? ? "" : "WHERE "
where << conditions.join(' AND ')
result('SELECT * FROM quads ' + where, *params)
end
Our specs still pass. Note this trick:
statements.extend(RDF::Enumerable, RDF::Queryable)
RDF::Queryable
is defined to return something which
implements RDF::Enumerable
and
RDF::Queryable
. Since the only thing we need to
implement RDF::Enumerable
is #each
, and
Array
already implements that, we can simply extend
this Array
instance with the mixins and return it.
Note also that while we have taken care of the hard part, we're still calling the reference implementation if we don't know how to handle our arguments. Now we can start adding those other query arguments:
# lib/rdf/do.rb
def query(pattern, &block)
case pattern
when RDF::Statement
query(pattern.to_hash)
when Array
query(RDF::Statement.new(*pattern))
when Hash
.
.
.
Our specs still pass! Moving on, there's a lot more we can implement. And once we have implemented it in a straightforward way, we can still implement things like multiple inserts, paging, and more, all transparant to the user. You can see the full list of methods to implement in the docs, but don't be afraid to dive into the code.
If you do, don't forget that RDF.rb is completely public domain, so if you want to copy-paste to bootstrap your implementation, feel free.
Hopefully this is enough to get you started. Remember, the code is at the tutorial page, and don't forget to check out the storage adapter skeleton. The RDF.rb documentation have a lot of information on the APIs you'll be using.
And last but not least, a good place to ask questions or leave a comment is on the W3C RDF-Ruby mailing list.
Posted at 16:05
This is the second part in an ongoing RDF for Intrepid Unix
Hackers article series. In the previous
part, we learned how to process RDF data in the line-oriented,
whitespace-separated N-Triples
serialization format by pipelining standard Unix tools such as
grep
and awk
.
That was all well and good, but what to do if your RDF data isn't already in N-Triples format? Today we'll see how to install and use the excellent Raptor RDF Parser Library to convert RDF from one serialization format to another.
The Raptor toolkit
includes a handy command-line utility called rapper
,
which can be used to convert RDF data between most of the various
popular RDF serialization formats.
Installing Raptor is straightforward on most development and deployment platforms; here's how to install Raptor on Mac OS X with MacPorts and on any of the most common Linux and BSD distributions:
$ [sudo] port install raptor # Mac OS X with MacPorts
$ [sudo] fink install raptor-bin # Mac OS X with Fink
$ [sudo] aptitude install raptor-utils # Ubuntu / Debian
$ [sudo] yum install raptor # Fedora / CentOS / RHEL
$ [sudo] zypper install raptor # openSUSE
$ [sudo] emerge raptor # Gentoo Linux
$ [sudo] pacman -S raptor # Arch Linux
$ [sudo] pkg_add -r raptor # FreeBSD
$ [sudo] pkg_add raptor # OpenBSD / NetBSD
The subsequent examples all assume that you have successfully
installed Raptor and thus have the rapper
utility
available in your $PATH
. To make sure that
rapper
is indeed available, just ask it to output its
version number as follows:
$ rapper --version
1.4.21
We'll be using version 1.4.21 for this tutorial, but any 1.4.x release from 1.4.5 onwards should do fine for present purposes -- so don't worry if your distribution provides a slightly older version.
Should you have any trouble getting rapper
set up,
you can ask for help on the #swig channel on IRC or on the Raptor mailing list.
RDF/XML is the standard RDF serialization specified by W3C back before the dot-com bust. Despite some newer, more human-friendly formats, a great deal of the RDF data out there in the wild is still made available in this format.
For example, every valid RSS 1.0-compatible feed is,
in principle, also a valid RDF/XML document (but note that the same
is not true for non-RDF formats like RSS 2.0 or Atom). So,
let's grab the RSS feed for this blog and define a Bash shell alias
for converting RDF/XML into N-Triples using
rapper
:
$ alias rdf2nt="rapper -i rdfxml -o ntriples"
$ curl http://blog.datagraph.org/index.rss > index.rdf
$ rdf2nt index.rdf > index.nt
rapper: Parsing URI file://index.rdf with parser rdfxml
rapper: Serializing with serializer ntriples
rapper: Parsing returned 106 triples
Pretty easy, huh? It gets even easier, because
rapper
actually supports fetching URLs directly.
Typically Raptor is built with libcurl
support, so it
supports the same set of URL schemes as does the curl
command itself. This means that e.g. any http://
,
https://
and ftp://
input arguments will
work right out of the box, so that we can combine our previous last
two commands as follows:
$ rdf2nt http://blog.datagraph.org/index.rss > index.nt
rapper: Parsing URI http://blog.datagraph.org/index.rss with parser rdfxml
rapper: Serializing with serializer ntriples
rapper: Parsing returned 106 triples
After RDF/XML, Turtle is
probably the most widespread RDF format out there. It is a subset
of Notation3
and a superset of N-Triples, hitting a sweet spot for both
expressiveness and conciseness. It is also much more pleasant to
write by hand than XML, so personal FOAF files in
particular tend to be authored in Turtle and then converted, e.g.
using rapper
, into a variety of formats when published
on the Linked Data web.
For this next example, let's grab my FOAF file in Turtle format and convert it into N-Triples:
$ alias ttl2nt="rapper -i turtle -o ntriples"
$ ttl2nt http://datagraph.org/bendiken/foaf.ttl > foaf.nt
rapper: Parsing URI http://datagraph.org/bendiken/foaf.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Parsing returned 16 triples
Just as easy as with RDF/XML. And you'll notice that this time
around we did the downloading and the conversion in a single step
by letting rapper
worry about fetching the data
directly from the URL in question.
Conversely, you can of course also use rapper
to
convert any N-Triples input data into other RDF serialization
formats such as Turtle, RDF/XML and RDF/JSON. You
need only swap the arguments to the -i
and
-o
options and you're good to go.
So, let's define a couple more handy aliases:
$ alias nt2ttl="rapper -i ntriples -o turtle"
$ alias nt2rdf="rapper -i ntriples -o rdfxml-abbrev"
$ alias nt2json="rapper -i ntriples -o json"
Now we can quickly and easily convert any N-Triples data into other RDF formats:
$ nt2ttl index.nt > index.ttl
$ nt2rdf index.nt > index.rdf
$ nt2json index.nt > index.json
We can define similar aliases for any input/output permutation
provided by rapper
. To find out the full list of input
and output RDF serialization formats supported by your version of
the program, run rapper --help
:
$ rapper --help
...
Main options:
-i FORMAT, --input FORMAT Set the input format/parser to one of:
rdfxml RDF/XML (default)
ntriples N-Triples
turtle Turtle Terse RDF Triple Language
trig TriG - Turtle with Named Graphs
rss-tag-soup RSS Tag Soup
grddl Gleaning Resource Descriptions from Dialects of Languages
guess Pick the parser to use using content type and URI
rdfa RDF/A via librdfa
...
-o FORMAT, --output FORMAT Set the output format/serializer to one of:
ntriples N-Triples (default)
turtle Turtle
rdfxml-xmp RDF/XML (XMP Profile)
rdfxml-abbrev RDF/XML (Abbreviated)
rdfxml RDF/XML
rss-1.0 RSS 1.0
atom Atom 1.0
dot GraphViz DOT format
json-triples RDF/JSON Triples
json RDF/JSON Resource-Centric
...
rapper
aliasesCopy and paste the following code snippet into your
~/.bash_aliases
or ~/.bash_profile
file,
and you will always have these aliases available when working with
RDF data on the command line:
# rapper aliases from http://blog.datagraph.org/2010/04/transmuting-ntriples
alias any2nt="rapper -i guess -o ntriples" # Anything to N-Triples
alias any2ttl="rapper -i guess -o turtle" # Anything to Turtle
alias any2rdf="rapper -i guess -o rdfxml-abbrev" # Anything to RDF/XML
alias any2json="rapper -i guess -o json" # Anything to RDF/JSON
alias nt2ttl="rapper -i ntriples -o turtle" # N-Triples to Turtle
alias nt2rdf="rapper -i ntriples -o rdfxml-abbrev" # N-Triples to RDF/XML
alias nt2json="rapper -i ntriples -o json" # N-Triples to RDF/JSON
alias ttl2nt="rapper -i turtle -o ntriples" # Turtle to N-Triples
alias ttl2rdf="rapper -i turtle -o rdfxml-abbrev" # Turtle to RDF/XML
alias ttl2json="rapper -i turtle -o json" # Turtle to RDF/JSON
alias rdf2nt="rapper -i rdfxml -o ntriples" # RDF/XML to N-Triples
alias rdf2ttl="rapper -i rdfxml -o turtle" # RDF/XML to Turtle
alias rdf2json="rapper -i rdfxml -o json" # RDF/XML to RDF/JSON
alias json2nt="rapper -i json -o ntriples" # RDF/JSON to N-Triples
alias json2ttl="rapper -i json -o ntriples" # RDF/JSON to N-Triples
alias json2rdf="rapper -i json -o ntriples" # RDF/JSON to N-Triples
Since each of these aliases is a mnemonic patterned after the
file extensions for the input and output formats involved,
remembering these is easy as pie. Note also that I've included four
any2*
aliases that specify guess
as the
input format to let rapper
try and automatically
detect the serialization format for the input stream.
A big thanks goes out to Dave
Beckett for having developed Raptor and for giving us the
superbly useful N-Triples and Turtle serialization formats. I
personally use rapper
and these aliases just about
every single day, and I hope you find them as useful as I have.
Stay tuned for more upcoming installments of RDF for Intrepid Unix Hackers.
Lest there be any doubt, all the code in this tutorial is hereby released into the public domain using the Unlicense. You are free to copy, modify, publish, use, sell and distribute it in any way you please, with or without attribution.
Posted at 16:05
The first time I ever sat down to write some real RDF code, I started, as one always should, with some tests. Most of them went fine, but then I had to write a test that compared the equality of two graphs; I think this was for a parser in Scala, sometime last year, but I've lost track of what exactly I was looking at. In any case, what a can of worms I opened.
It turns out that graph equality in RDF is hard. The combination of blank and non-blank nodes makes it a graph isomorphism problem that I have not found an exact equivalence for in straight-up graph theory. Graphs with named vertices and edges have easy solutions, graphs with unnamed vertices and edges have other, difficult solutions. The difference, depending on the type of graph, can be between O(n) and O(n!) on the number of nodes, so when selecting a possible solution, we'd like to avoid solutions that don't take naming into account.
The isomorphism problem is hard enough that many popular RDF implementations don't even include a solution for it. RDFLib for Python has an approximation with a to-do note, I don't see an appropriate function in Redland's model API, and Sesame has an implementation with the following comment:
// FIXME: this recursive implementation has a high risk of
// triggering a stack overflow
My Java is rusty and I have no intention of polishing it up for this blog post, but I believe Sesame's implementation has factorial complexity.
Now, don't get me wrong. Those are all free projects, and it's a tough problem to do right. We over at Datagraph just made do without an isomorphism function in either Scala or Ruby for several months rather than solve it. So this is not intended to be a cheap shot at those projects -- in fact, we use both Redland and Sesame, and quite happily. And if I'm wrong on the sparse nature of this landscape, someone please correct me.
However, we're developing a new RDF library for Ruby, so when it came time to really solve the problem, we wanted to solve it right. Like most problems in computer science, it's actually old news. Jeremy Carroll solved it and implemented it for Jena either before or after writing a great paper on the topic. What I'm about to describe is more or less his algorithm, and while I slightly adjusted the following to my style, I'm not about to say much that his paper doesn't. So just go read the paper if that's your preference.
The algorithm can be described as a refinement of a naive O(n!) graph isomorphism algorithm, in which each blank node is mapped onto each other blank node, followed by a consistency check. The magic stems from RDF having these nifty global identifiers for most vertices and all edges. If we're smart about it, we can eliminate substantially all of the possible mappings before we try even our first speculative mapping.
I haven't done the math, but it would seem that one could generate a pathological case graph which would be O(n!). On the other hand, since RDF does not allow blank node predicates, and because the algorithm terminates on the first match, I haven't yet figured out how to create such a pathological graph for this algorithm. Graphs tend to be either open enough to have a large number of solutions, one of which will be found quickly, or tight enough to have only one.
The algorithm works as follows:
In something approaching day-to-day English, what's happening here is that after eliminating the simple possibilities, we're generating a hash of all of the elements that appear with a given node in a graph. We then create a node-to-hash mapping. As the hashes will be the same for blank nodes on both input graphs, we use that hash to eliminate possible matchings before we try them. Instead of trying every mapping, we try mappings only on nodes with the same signature. The end result is an algorithm that requires a fairly pathological case to recurse at all, let alone to recurse deeply. Nice.
At any rate, you can see the details, along with some test cases
to play with, in RDF::Isomorphic
for RDF.rb. This blog post coincides with release 0.1.0, which
features a slightly improved signature algorithm, reducing the
number of rounds required in some cases. The documentation is also
greatly improved -- I spent more time on this problem than I ever
intended to, so I hope this can be a readable summary of the
algorithm for anyone coming across this in the future. Of course,
RDF.rb's structure means almost anything using RDF.rb can be
tested for isomorphism now, so hopefully it won't ever occur to
you to read the code.
Of course, RDF::Isomorphic
is in the public domain, so should you find my
implementation worthy, feel free to copy the code as directly as
your framework or programming language allows. And please feel free
to do that without any obligation to provide attribution or any
such silliness.
Posted at 16:05
We have just released version 0.1.0 of RDF.rb, our RDF library for Ruby. This is the first generally useful release of the library, so I will here introduce the design philosophy and object model of the library as well as provide a tutorial to using its core classes.
RDF.rb has extensive API documentation with many inline code examples, enjoys comprehensive RSpec coverage, and is immediately available via RubyGems:
$ [sudo] gem install rdf
Once installed, to load up the library in your own Ruby projects you need only do:
require 'rdf'
The RDF.rb source code repository is hosted on GitHub. You can obtain a local working copy of the source code as follows:
$ git clone git://github.com/bendiken/rdf.git
The design philosophy for RDF.rb differs somewhat from previous efforts at RDF libraries for Ruby. Instead of a feature-packed RDF library that attempts to include everything but the kitchen sink, we have rather aimed for something like a lowest common denominator with well-defined, finite requirements.
Thus, RDF.rb is perhaps quickest described in terms of what it isn't and what it hasn't:
RDF.rb does not have any dependencies other than the Addressable gem which provides improved URI handling over Ruby's standard library. We also guarantee that RDF.rb will never add any hard dependencies that would compromise its use on popular alternative Ruby implementations such as JRuby.
RDF.rb does not provide any resource-centric, ORM-like abstractions to hide the essential statement-oriented nature of the API. Such abstractions may be useful, but they are beyond the scope of RDF.rb itself.
RDF.rb does not, and will not, include built-in support for any RDF serialization formats other than N-Triples and N-Quads. However, it does define a DSL and common API for adding support for other formats via third-party plugin gems. There presently exist RDF.rb-compatible RDF::JSON and RDF::TriX gems that add initial RDF/JSON and TriX support, respectively.
RDF.rb does not, and will not, include built-in support for any particular persistent RDF storage systems. However, it does define the interfaces that such storage adapters could be written to. Again, add-on gems are the way to go, and there already exists an in-the-works RDF.rb-compatible RDF::Sesame gem that enables using Sesame 2.0 HTTP endpoints with the repository interface defined by RDF.rb.
RDF.rb does not, and will not, include any built-in RDF Schema or OWL inference capabilities. There exists an in-the-works RDF.rb-compatible RDFS gem that is intended to provide a naive proof-of-concept implementation of a forward-chaining inference engine for the RDF Schema entailment rules.
RDF.rb does not include any built-in SPARQL functionality per se, though it will soon provide support for basic graph pattern (BGP) matching and could thus conceivably be used as the basis for a SPARQL engine written in Ruby.
RDF.rb does not come with a license statement, but rather with the stringent hope that you have a nice day. RDF.rb is 100% free and unencumbered public domain software. You can copy, modify, use, and hack on it without any restrictions whatsoever. This means that authors of other RDF libraries for Ruby are perfectly welcome to steal any of our code, with or without attribution. So, if some code snippet or file may be of use to you, feel free to copy it and relicense it under whatever license you have released your own library with -- no need to include any copyright notices from us (since there are none), or even to mention us in the credits (we won't mind).
So that's what RDF.rb is not, but perhaps more important is what we want it to be. There's no reason for simple RDF-based solutions to require enormous complex libraries, storage engines, significant IDE configuration or XML pushups. We're hoping to bring RDF to a world of agile programmers and startups, and to bring existing Linked Data enthusiasts to a platform that encourages rapid innovation and programmer happiness. And maybe everyone can have some fun along the way!
It is also our hope that the aforementioned minimalistic design approach and extremely liberal licensing can help lead to the emergence of a semi-standard Ruby object model for RDF, that is, a common core class hierarchy and API that could be largely interoperable between a number of RDF libraries for Ruby.
With that in mind, let's proceed to have a look at RDF.rb's core object model.
While RDF.rb is built to take full advantage of Ruby's duck typing and
mixins, it does
also define a class hierarchy of RDF objects. If nothing else, this
inheritance tree is useful for case/when
matching and
also adheres to the principle
of least surprise for developers hailing from less dynamic
programming languages.
The RDF.rb core class hierarchy looks like the following, and will seem instantly familiar to anyone acquainted with Sesame's object model:
The five core RDF.rb classes, all of them ultimately inheriting
from RDF::Value
, are:
RDF::Literal
represents plain, language-tagged or
datatyped literals.RDF::URI
represents URI references (URLs and
URNs).RDF::Node
represents anonymous nodes (also known
as blank nodes).RDF::Statement
represents RDF statements (also
known as triples).RDF::Graph
represents anonymous or named graphs
containing zero or more statements.In addition, the two core RDF.rb interfaces (known as mixins in Ruby parlance) are:
RDF::Enumerable
provides RDF-specific iteration
methods for any collection of RDF statements.RDF::Queryable
provides RDF-specific query methods
for any collection of RDF statements.Let's take a quick tour of each of these aforementioned core classes and mixins.
URI references (URLs and URNs) are represented in RDF.rb as
instances of the RDF::URI
class, which is based on the
excellent Addressable::URI
library.
The RDF::URI
constructor is overloaded to take
either a URI string (anything that responds to #to_s
,
actually) or an options hash of URI components. This means that the
following are two equivalent ways of constructing the same URI
reference:
uri = RDF::URI.new("http://rdf.rubyforge.org/")
uri = RDF::URI.new({
:scheme => 'http',
:host => 'rdf.rubyforge.org',
:path => '/',
})
The supported URI components are explained in the API
documentation for
Addressable::URI.new
.
Turning a URI reference back into a string works as usual in Ruby:
uri.to_s #=> "http://rdf.rubyforge.org/"
RDF::URI
supports the same set of instance methods
as does Addressable::URI
. This means that the
following methods, and many more, are available:
uri = RDF::URI.new("http://rubygems.org/gems/rdf")
uri.absolute? #=> true
uri.relative? #=> false
uri.scheme #=> "http"
uri.authority #=> "rubygems.org"
uri.host #=> "rubygems.org"
uri.port #=> nil
uri.path #=> "/gems/rdf"
uri.basename #=> "rdf"
In addition, RDF::URI
supports several convenience
methods that can help you navigate URI hierarchies without breaking
a sweat:
uri = RDF::URI.new("http://rubygems.org/")
uri = uri.join("gems", "rdf")
uri.to_s #=> "http://rubygems.org/gems/rdf"
uri.parent #=> RDF::URI.new("http://rubygems.org/gems/")
uri.root #=> RDF::URI.new("http://rubygems.org/")
Blank nodes are represented in RDF.rb as instances of the
RDF::Node
class.
The simplest way to create a new blank node is as follows:
bnode = RDF::Node.new
This will create a blank node with an identifier based on the
internal Ruby object ID of the RDF::Node
instance.
This nicely serves us as a unique identifier for the duration of
the Ruby process:
bnode.id #=> "2158816220"
bnode.to_s #=> "_:2158816220"
You can also provide an explicit blank node identifier to the
RDF::Node
constructor. This is particularly useful
when serializing or parsing RDF data, where you generally need to
maintain a mapping of blank node identifiers to blank node
instances.
The constructor argument can be any string or any object that
responds to #to_s
. For example, say that you wanted to
create a blank node instance having a globally-unique UUID as its identifier. Here's
how you would do this with the help of the UUID gem:
require 'uuid'
bnode = RDF::Node.new(UUID.generate)
The above is a fairly common use case, so RDF.rb actually provides a convenience class method for creating UUID-based blank nodes. The following will use either the UUID or the UUIDTools gem, whichever happens to be available:
bnode = RDF::Node.uuid
bnode.to_s #=> "_:504c0a30-0d11-012d-3f50-001b63cac539"
All three types of RDF literals -- plain, language-tagged and
datatyped -- are represented in RDF.rb as instances of the
RDF::Literal
class.
Create plain literals by passing in a string to the
RDF::Literal
constructor:
hello = RDF::Literal.new("Hello, world!")
hello.plain? #=> true
hello.has_language? #=> false
hello.has_datatype? #=> false
Note, however, that in most RDF.rb interfaces you will
not in fact need to wrap language-agnostic, non-datatyped
strings into RDF::Literal
instances; this is done
automatically when needed, allowing you the convenience of, say,
passing in a plain old Ruby string as the object value when
constructing an RDF::Statement
instance.
To create language-tagged literals, pass in an additional
ISO language
code to the :language
option of the
RDF::Literal
constructor:
hello = RDF::Literal.new("Hello!", :language => :en)
hello.has_language? #=> true
hello.language #=> :en
The language code can be given as either a symbol, a string, or
indeed anything that responds to the #to_s
method:
RDF::Literal.new("Hello!", :language => :en)
RDF::Literal.new("Wazup?", :language => :"en-US")
RDF::Literal.new("Hej!", :language => "sv")
RDF::Literal.new("¡Hola!", :language => ["es"])
Datatyped literals are created similarly, by passing in a
datatype URI to the :datatype
option of the
RDF::Literal
constructor:
date = RDF::Literal.new("2010-12-31", :datatype => RDF::XSD.date)
date.has_datatype? #=> true
date.datatype #=> RDF::XSD.date
The datatype URI can be given as any object that responds to
either the #to_uri
method or the #to_s
method. In the example above, we've called the #date
method on the RDF::XSD
vocabulary class which
represents the XML Schema
datatypes vocabulary; this returns an RDF::URI
instance representing the URI for the xsd:date
datatype.
You'll be glad to hear that you don't necessarily have to always explicitly specify a datatype URI when creating a datatyped literal. RDF.rb supports a degree of automatic mapping between Ruby classes and XML Schema datatypes.
In most common cases, you can just pass in the Ruby value to the
RDF::Literal
constructor as-is, with the correct XML
Schema datatype being automatically set by RDF.rb:
today = RDF::Literal.new(Date.today)
today.has_datatype? #=> true
today.datatype #=> RDF::XSD.date
The following implicit datatype mappings are presently supported by RDF.rb:
RDF::Literal.new(false).datatype #=> RDF::XSD.boolean
RDF::Literal.new(true).datatype #=> RDF::XSD.boolean
RDF::Literal.new(123).datatype #=> RDF::XSD.integer
RDF::Literal.new(9223372036854775807).datatype #=> RDF::XSD.integer
RDF::Literal.new(3.1415).datatype #=> RDF::XSD.double
RDF::Literal.new(Date.new(2010)).datatype #=> RDF::XSD.date
RDF::Literal.new(DateTime.new(2010)).datatype #=> RDF::XSD.dateTime
RDF::Literal.new(Time.now).datatype #=> RDF::XSD.dateTime
RDF statements are represented in RDF.rb as instances of the
RDF::Statement
class. Statements can be
triples -- constituted of a subject, a
predicate, and an object -- or they can be
quads that also have an additional context
indicating the named graph that they are part of.
Creating a triple works exactly as you'd expect:
subject = RDF::URI.new("http://rubygems.org/gems/rdf")
predicate = RDF::DC.creator
object = RDF::URI.new("http://ar.to/#self")
RDF::Statement.new(subject, predicate, object)
The subject should be an RDF::Resource
, the
predicate an RDF::URI
, and the object an
RDF::Value
. These constraints are not enforced,
however, allowing you to use any duck-typed equivalents as
components of statements.
Pass in a URI reference in an extra :context
option
to the RDF::Statement
constructor to create a
quad:
context = RDF::URI.new("http://rubygems.org/")
subject = RDF::URI.new("http://rubygems.org/gems/rdf")
predicate = RDF::DC.creator
object = RDF::URI.new("http://ar.to/#self")
RDF::Statement.new(subject, predicate, object, :context => context)
It's also worth mentioning that the RDF::Statement
constructor is overloaded to enable instantiating statements from
an options hash, as follows:
RDF::Statement.new({
:subject => RDF::URI.new("http://rubygems.org/gems/rdf"),
:predicate => RDF::DC.creator,
:object => RDF::URI.new("http://ar.to/#self"),
})
The :context
option can also be given, as before.
Use whichever method of instantiating statements that you happen to
prefer.
Statement objects also support a #to_hash
method
that provides the inverse operation:
statement.to_hash #=> { :subject => ...,
# :predicate => ...,
# :object => ... }
Access the RDF statement components -- the subject, the predicate, and the object -- as follows:
statement.subject #=> an RDF::Resource
statement.predicate #=> an RDF::URI
statement.object #=> an RDF::Value
Since statements can also have an optional context, the
following will return either nil
or else an
RDF::Resource
instance:
statement.context #=> an RDF::Resource or nil
Because RDF.rb is duck-typed, you can often directly use a
three- or four-item Ruby array in place of an
RDF::Statement
instance. This can sometimes feel less
cumbersome than instantiating a statement object, and it may also
save some memory if you need to deal with a very large amount of
in-memory RDF statements. We'll see some examples of doing this
this later on.
Converting from statement objects to Ruby arrays is trivial:
statement.to_triple #=> [subject, predicate, object]
statement.to_quad #=> [subject, predicate, object, context]
Likewise, instantiating a statement object from a triple represented as a Ruby array is straightforward enough:
RDF::Statement.new(*[subject, predicate, object])
RDF graphs are represented in RDF.rb as instances of the
RDF::Graph
class. Note that most of the functionality
in this class actually comes from the RDF::Enumerable
and RDF::Queryable
mixins, which we'll examine further
below.
Creating a new unnamed graph works just as you'd expect:
graph = RDF::Graph.new
graph.named? #=> false
graph.to_uri #=> nil
To create a named
graph, just pass in a blank node or a URI reference to the
RDF::Graph
constructor:
graph = RDF::Graph.new("http://rubygems.org/")
graph.named? #=> true
graph.to_uri #=> RDF::URI.new("http://rubygems.org/")
To insert RDF statements into a graph, use the
#<<
operator or the #insert
method:
graph << statement
graph.insert(*statements)
Let's add some RDF statements to an unnamed graph, taking advantage of the aforementioned duck-typing convenience that lets us represent triples directly using Ruby arrays, and plain literals directly using Ruby strings:
rdfrb = RDF::URI.new("http://rubygems.org/gems/rdf")
arto = RDF::URI.new("http://ar.to/#self")
graph = RDF::Graph.new do
self << [rdfrb, RDF::DC.title, "RDF.rb"]
self << [rdfrb, RDF::DC.creator, arto]
end
If you prefer, you can also be more explicit and use the
equivalent #insert
method form instead of the
#<<
operator:
graph.insert([rdfrb, RDF::DC.title, "RDF.rb"])
graph.insert([rdfrb, RDF::DC.creator, arto])
To delete RDF statements from a graph, use the
#delete
method:
graph.delete(*statements)
Deleting the statements we inserted in the previous example works like so:
graph.delete([rdfrb, RDF::DC.title, "RDF.rb"])
graph.delete([rdfrb, RDF::DC.creator, arto])
Alternatively, we can use wildcard matching (where
nil
stands for a "match anything" wildcard) to simply
delete every statement in the graph that has a particular
subject:
graph.delete([rdfrb, nil, nil])
For even more convenience, since non-existent array subscripts
in Ruby return nil
, the following abbreviation is
exactly equivalent to the previous example:
graph.delete([rdfrb])
RDF::Enumerable
is a mixin module that provides
RDF-specific iteration methods for any object capable of yielding
RDF statements.
In what follows we will consider some of the key
RDF::Enumerable
methods specifically as used in
instances of the RDF::Graph
class.
Just as with most of Ruby's built-in collection classes, graphs
support an #empty?
predicate method that returns a
boolean:
graph.empty? #=> true or false
You can use #count
-- or if you prefer, the
equivalent alias #size
-- to return the number of RDF
statements in a graph:
graph.count
If you need to check whether a specific RDF statement is included in the graph, use the following method:
graph.has_statement?(RDF::Statement.new(subject, predicate, object))
There also exists an otherwise equivalent convenience method
that takes a Ruby array as its argument instead of an
RDF::Statement
instance:
graph.has_triple?([subject, predicate, object])
If you need to check whether a particular value is included in the graph as a component of one or more statements, use one of the following three methods:
graph.has_subject?(RDF::URI.new("http://rdf.rubyforge.org/"))
graph.has_predicate?(RDF::DC.creator)
graph.has_object?(RDF::Literal.new("Hello!", :language => :en))
The following method yields every statement in the graph as an
RDF::Statement
instance:
graph.each_statement do |statement|
puts statement.inspect
end
You can also use #each
as a shorter alias for
#each_statement
, though we ourselves consider using
the more explicit form to be stylistically preferred.
If you don't require RDF::Statement
instances and
simply want to get directly at the triple components of statements,
do the following instead:
graph.each_triple do |subject, predicate, object|
puts [subject, predicate, object].inspect
end
Similarly, you can enumerate the graph using quads as well:
graph.each_quad do |subject, predicate, object, context|
puts [subject, predicate, object, context].inspect
end
Note that for unnamed graphs, the yielded context
will always be nil
; for named graphs, it will always
be the same RDF::Resource
instance as would be
returned by calling graph.context
.
If instead of enumerating statements one-by-one you wish to obtain all the data in a graph in one go as an array of statements, the following method does just that:
graph.statements #=> [RDF::Statement(subject1, predicate1, object1), ...]
Naturally, there also exist the usual alternative methods that give you the statements in the form of raw triples or quads represented as Ruby arrays:
graph.triples #=> [[subject1, predicate1, object1], ...]
graph.quads #=> [[subject1, predicate1, object1, context1], ...]
A particularly useful set of methods is the following, which yield unique statement components from a graph:
graph.each_subject { |value| puts value.inspect }
graph.each_predicate { |value| puts value.inspect }
graph.each_object { |value| puts value.inspect }
For instance, #each_subject
yields every unique
statement subject in the graph, never yielding the same subject
twice.
Again, instead of yielding unique values one-by-one, you can obtain them in one go with the following methods:
graph.subjects #=> [subject1, subject2, subject3, ...]
graph.predicates #=> [predicate1, predicate2, predicate3, ...]
graph.objects #=> [object1, object2, object3, ...]
Here, #subjects
returns an array containing all
unique statement subjects in the graph, and
#predicates
and #objects
do the same for
statement predicates and objects respectively.
RDF::Queryable
is a mixin that provides
RDF-specific query methods for any object capable of yielding RDF
statements. At present this means simple subject-predicate-object
queries, but extended basic graph pattern matching will be
available in a future release of RDF.rb.
In what follows we will consider RDF::Queryable
methods specifically as used in instances of the
RDF::Graph
class.
The simplest type of query is one that specifies all statement components, as in the following:
statements = graph.query([subject, predicate, object])
The result set here would contain either no statements if the query didn't match (that is, the given statement didn't exist in the graph), or otherwise at the most the single matched statement.
The #query
method can also take a block, in which
case matching statements are yielded to the block one after another
instead of returned as a result set:
graph.query([subject, predicate, object]) do |statement|
puts statement.inspect
end
You can replace any of the query components with
nil
to perform a wildcard match. For example, in the
following we query for all dc:title
values for a given
subject resource:
rdfrb = RDF::URI.new("http://rubygems.org/gems/rdf")
graph.query([rdfrb, RDF::DC.title, nil]) do |statement|
puts "dc:title = #{statement.object.inspect}"
end
We can also query for any and all statements related to a given subject resource:
graph.query([rdfrb, nil, nil]) do |statement|
puts "#{statement.predicate.inspect} = #{statement.object.inspect}"
end
The result sets returned by #query
also implement
RDF::Enumerable
and RDF::Queryable
, so it
is possible to chain several queries to incrementally refine a
result set:
graph.query([rdfrb]).query([nil, RDF::DC.title])
Likewise, it is of course possible to chain
RDF::Queryable
operations with methods from
RDF::Enumerable
:
graph.query([nil, RDF::DC.title]).each_subject do |subject|
puts subject.inspect
end
If you have feedback regarding RDF.rb, please contact us either privately or via the public-rdf-ruby@w3.org mailing list. Bug reports should go to the issue queue on GitHub.
In upcoming RDF.rb tutorials we will see how to work with existing RDF vocabularies, how to serialize and parse RDF data using RDF.rb, how to write an RDF.rb plugin, how to use RDF.rb with Ruby on Rails 3.0, and much more. Stay tuned!
Posted at 16:05
The N-Triples format is
the lowest common denominator for RDF serialization formats, and turns
out to be a very good fit to the Unix paradigm of line-oriented,
whitespace-separated data processing. In this tutorial we'll see
how to process N-Triples data by pipelining
standard Unix tools such as grep
, wc
,
cut
, awk
, sort
,
uniq
, head
and tail
.
To follow along, you will need access to a Unix box (Mac OS X,
Linux, or BSD) with a Bash-compatible shell.
We'll be using curl
to fetch data over HTTP, but you can substitute wget
or fetch
if necessary. A couple of the examples
require a modern AWK
version such as gawk
or
mawk
;
on Linux distributions you should be okay by default, but on Mac OS
X you will need to install gawk
or mawk
from MacPorts as
follows:
$ sudo port install mawk
$ alias awk=mawk
Each N-Triples line encodes one RDF statement, also known as a triple. Each line consists of the subject (a URI or a blank node identifier), one or more characters of whitespace, the predicate (a URI), some more whitespace, and finally the object (a URI, blank node identifier, or literal) followed by a dot and a newline. For example, the following N-Triples statement asserts the title of my website:
<http://ar.to/> <http://purl.org/dc/terms/title> "Arto Bendiken" .
This is an almost perfect format for Unix tooling; the only possible further improvement would have been to define the statement component separator to be a tab character, which would have simplified obtaining the object component of statements -- as we'll see in a bit.
Many RDF data dumps are made available as compressed N-Triples files. DBpedia, the RDFization of Wikipedia, is a prominent example. For purposes of this tutorial I've prepared an N-Triples dataset containing all Drupal-related RDF statements from DBpedia 3.4, which is the latest release at the moment and reflects Wikipedia as of late September 2009.
I prepared the sample dataset by downloading all English-language core datasets (20 N-Triples files totaling 2.1 GB when compressed) and crunching through them as follows:
$ bzgrep Drupal *.nt.bz2 > drupal.nt
To save you from gigabyte-sized downloads and an hour of data
crunching, you can just grab a copy of the resulting
drupal.nt
file as follows:
$ curl http://blog.datagraph.org/2010/03/grepping-ntriples/drupal.nt > drupal.nt
The sample dataset totals 294 RDF statements and weighs in at 70 KB.
The first thing we want to do is count the number of triples in an N-Triples dataset. This is straightforward to do, since each triple is represented by one line in an N-Triples input file and there are a number of Unix tools that can be used to count input lines. For example, we could use either of the following commands:
$ cat drupal.nt | wc -l
294
$ cat drupal.nt | awk 'END { print NR }'
294
Since we'll be using a lot more of AWK throughout this
tutorial, let's stick with awk
and define a handy
shell alias for this operation:
$ alias rdf-count="awk 'END { print NR }'"
$ cat drupal.nt | rdf-count
294
Note that, for reasons of comprehensibility, the previous examples as well as most of the subsequent ones assume that we're dealing with "clean" N-Triples datasets that don't contain comment lines or other miscellania. The DBpedia data dumps fit this bill very well. However, further onwards I will give "fortified" versions of these commands that can correctly deal with arbitrary N-Triples files.
We at Datagraph frequently use the N-Triples representation as the canonical lexical form of an RDF statement, and work with content-addressable storage systems for RDF data that in fact store statements using their N-Triples representation. In such cases, it is often useful to know some statistical characteristics of the data to be loaded in a mass import, so as to e.g. be able to fine-tune the underlying storage for optimum space efficiency.
A first useful statistic is to know the typical size of a datum, i.e. the line length of an N-Triples statement, in the dataset we're dealing with. AWK yields us N-Triples line lengths without much trouble:
$ alias rdf-lengths="awk '{ print length }'"
$ cat drupal.nt | rdf-lengths | head -n5
162
150
155
137
150
Note that N-Triples is an ASCII format, so the numbers above reflect both the byte sizes of input lines as well as the ASCII character count of input lines. All non-ASCII characters are escaped in N-Triples, and for present purposes we'll be talking in terms of ASCII characters only.
The above list of line lengths in and of itself won't do us much good; we want to obtain aggregate information for the whole dataset at hand, not for individual statements. It's too bad that Unix doesn't provide commands for simple numeric aggregate operations such as the minimum, maximum and average of a list of numbers, so let's see if we can remedy that.
One way to define such operations would be to pipe the above
output to an RPN
shell calculator such as dc
and have it perform the
needed calculations. The complexity of this would go somewhat
beyond mere shell aliases, however. Thankfully, it turns out that
AWK is well-suited to writing these aggregate operations as well.
Here's how we can extend our earlier pipeline to boil the list of
line lengths down to an average:
$ alias avg="awk '{ s += \$1 } END { print s / NR }'"
$ cat drupal.nt | rdf-lengths | avg
242.517
The above, incidentally, is an example of a simple map/reduce operation:
a sequence of input values is mapped through a function,
in this case length(line)
, to give a sequence of
output values (the line lengths) that is then reduced to a
single aggregate value (the average line length). Though I won't go
further into this just now, it is worth mentioning in passing that
N-Triples is an ideal format for massively parallel processing of
RDF data using Hadoop and
the like.
Now, we can still optimize and simplify the above some by combining both steps of the operation into a single alias that outputs an average line length for the given input stream, like so:
$ alias rdf-length-avg="awk '\
{ s += length }
END { print s / NR }'"
Likewise, it doesn't take much more to define an alias for obtaining the maximum line length in the input dataset:
$ alias rdf-length-max="awk '\
BEGIN { n = 0 } \
{ if (length > n) n = length } \
END { print n }'"
Getting the minimum line length is only slightly more complicated. Instead of comparing against a zero baseline like above, we need to instead define a "roof" value to compare against. In the following, I've picked an arbitrarily large number, making the (at present) reasonable assumption that no N-Triples line will be longer than a billion ASCII characters, which would amount to somewhat less than a binary gigabyte:
$ alias rdf-length-min="awk '\
BEGIN { n = 1e9 } \
{ if (length > 0 && length < n) n = length } \
END { print (n < 1e9 ? n : 0) }'"
Now that we have some aggregate operations to crunch N-Triples data with, let's analyze our sample DBpedia dataset using the three aliases defined above:
$ cat drupal.nt | rdf-length-avg
242.517
$ cat drupal.nt | rdf-length-max
2179
$ cat drupal.nt | rdf-length-min
84
We can see from the output that N-Triples line lengths in this dataset vary considerably: from less than a hundred bytes to several kilobytes, but being on average in the range of two hundred bytes. This variability is to be expected for DBpedia data, given that many RDF statements in such a dataset contain a long textual description as their object literal whereas others contain merely a simple integer literal.
Many other statistics, such as the median line length or the standard deviation of the line lengths, could conceivably be obtained in a manner similar to what I've shown above. I'll leave those as exercises for the reader, however, as further stats regarding the raw N-Triples lines are unlikely to be all that generally interesting.
It's time to move on to getting at the three components -- the subject, the predicate and the object -- that constitute RDF statements.
We have two straightforward choices for obtaining the subject
and predicate: the cut
command and good old
awk
. I'll show both aliases:
$ alias rdf-subjects="cut -d' ' -f 1 | uniq"
$ alias rdf-subjects="awk '{ print \$1 }' | uniq"
While cut
might shave off some microseconds
compared to awk
here, AWK is still the better choice
for the general case, as it allows us to expand the alias
definition to ignore empty lines and comments, as we'll see later.
On our sample data, though, either form works fine.
You may have noticed and wondered about the pipelined
uniq
after cut
and awk
. This
is simply a low-cost, low-grade deduplication filter: it drops
consequent duplicate values. For an ordered dataset (where the
input N-Triples lines are already sorted in lexical order), it will
get rid of all duplicate subjects. In an unordered dataset, it
won't do much good, but it won't do much harm either (what's a
microsecond here or there?)
To fully deduplicate the list of subjects for a (potentially)
unordered dataset, apply another uniq
filter after a
sort
operation as follows:
$ cat drupal.nt | rdf-subjects | sort | uniq | head -n5
<http://dbpedia.org/resource/Acquia_Drupal>
<http://dbpedia.org/resource/Adland>
<http://dbpedia.org/resource/Advomatic>
<http://dbpedia.org/resource/Apadravya>
<http://dbpedia.org/resource/Application_programming_interface>
I've not made sort
an integral part of the
rdf-subjects
alias because sorting the subjects is an
expensive operation with resource usage proportional to the number
of statements processed; when processing a billion-triple N-Triples
stream, it is usually simply better to not care too much about
ordering.
Getting the predicates from N-Triples data works exactly the same way as getting the subjects:
$ alias rdf-predicates="cut -d' ' -f 2 | uniq"
$ alias rdf-predicates="awk '{ print \$2 }' | uniq"
Again, you can apply sort
in conjunction with
uniq
to get the list of unique predicate URIs in the
dataset:
$ cat drupal.nt | rdf-predicates | sort | uniq | tail -n5
<http://www.w3.org/2000/01/rdf-schema#label>
<http://www.w3.org/2004/02/skos/core#subject>
<http://xmlns.com/foaf/0.1/depiction>
<http://xmlns.com/foaf/0.1/homepage>
<http://xmlns.com/foaf/0.1/page>
Obtaining the object component of N-Triples statements, however,
is somewhat more complicated than getting the subject or the
predicate. This is due to the fact that object literals can contain
whitespace that will throw off the whitespace-separated field
handling of cut
and awk
that we've relied
on so far. Not to worry, AWK can still get us the results we want,
but I won't attempt to explain how the following alias works; just
be happy that it does:
$ alias rdf-objects="awk '{ ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"
The output of rdf-objects
is the N-Triples encoded
object URI, blank node identifier or object literal. URIs are
output in the same format as subjects and predicates, with
enclosing angle brackets; language-tagged literals include the
language tag, and datatyped literals include the datatype URI:
$ cat drupal.nt | rdf-objects | sort | uniq | head -n5
"09"^^<http://www.w3.org/2001/XMLSchema#integer>
"16"^^<http://www.w3.org/2001/XMLSchema#integer>
"2001-01"^^<http://www.w3.org/2001/XMLSchema#gYearMonth>
"2009"^^<http://www.w3.org/2001/XMLSchema#integer>
"6.14"^^<http://www.w3.org/2001/XMLSchema#decimal>
Another very useful operation to have is getting the list of
object literal datatypes used in an N-Triples dataset. This is also
a somewhat involved alias definition, and requires a modern AWK
version such as gawk
or
mawk
:
$ alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 1, length(\$3)-2) }' | uniq"
$ cat drupal.nt | rdf-datatypes | sort | uniq
<http://www.w3.org/2001/XMLSchema#decimal>
<http://www.w3.org/2001/XMLSchema#gYearMonth>
<http://www.w3.org/2001/XMLSchema#integer>
As we can see, most object literals in this dataset are untyped strings, but there are some decimal and integer values as well as year + month literals.
As promised, here follow more robust versions of all the
aforementioned Bash aliases. Just copy and paste the following code
snippet into your ~/.bash_aliases
or
~/.bash_profile
file, and you will always have these
aliases available when working with N-Triples data on the command
line.
# N-Triples aliases from http://blog.datagraph.org/2010/03/grepping-ntriples
alias rdf-count="awk '/^\s*[^#]/ { n += 1 } END { print n }'"
alias rdf-lengths="awk '/^\s*[^#]/ { print length }'"
alias rdf-length-avg="awk '/^\s*[^#]/ { n += 1; s += length } END { print s/n }'"
alias rdf-length-max="awk 'BEGIN { n=0 } /^\s*[^#]/ { if (length>n) n=length } END { print n }'"
alias rdf-length-min="awk 'BEGIN { n=1e9 } /^\s*[^#]/ { if (length>0 && length<n) n=length } END { print (n<1e9 ? n : 0) }'"
alias rdf-subjects="awk '/^\s*[^#]/ { print \$1 }' | uniq"
alias rdf-predicates="awk '/^\s*[^#]/ { print \$2 }' | uniq"
alias rdf-objects="awk '/^\s*[^#]/ { ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"
alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 2, length(\$3)-4) }' | uniq"
I should also note that though I've spoken throughout only in terms of N-Triples, most of the above aliases will work fine also for input in N-Quads format.
In the next installments of RDF for Intrepid Unix
Hackers, we'll attempt something a little more ambitious:
building a rdf-query
alias to perform
subject-predicate-object queries on N-Triples input. We'll also see
what to do
if your RDF data isn't already in N-Triples format, learning
how to install and use the Raptor RDF Parser Library to
convert RDF data between the various popular RDF serialization
formats. Stay
tuned.
Lest there be any doubt, all the code in this tutorial is hereby released into the public domain using the Unlicense. You are free to copy, modify, publish, use, sell and distribute it in any way you please, with or without attribution.
Posted at 16:05
RDF.rb is easily the most fun RDF library I've used. It uses Ruby's dynamic system of mixins to create a library that's very easy to use.
If you're new at Ruby, you might know about mixins in other
languages--Scala
traits, for example, are almost exactly functionally
equivalent. They're distinctly more powerful than Java interfaces
or abstract classes. A mixin is basically an interface and an
abstract class rolled into one. Rather than extend an abstract
class, one includes a mixin into your own class. A mixin will
usually require that a given class implement a particular method.
Ruby's own Enumerable
class, for example, requires
that implementing classes implement #each
. For that
tiny bit of trouble, you get a ton of methods (listed here),
including iterators, mapping, partitions, conversion to arrays, and
more. (If you're new to Ruby, it might also help you to know that
#method_name
means 'an instance method named
method_name
').
RDF.rb uses the principle extensively.
RDF::Repository
is, in fact, little more than an
in-memory reference implementation for 4 traits:
RDF::Enumerable
, RDF::Mutable
,
RDF::Queryable
, and RDF::Durable
.
RDF::Sesame::Repository
has the exact same interface
as the in-memory representation, but is based entirely on a Sesame
server. In order to work as a repository,
RDF::Sesame::Repository
only had to extend the
reference implementation and implement #each
,
#insert_statement
, and #delete_statement
.
Nice! Of course, implementing those took some doing, but it's still
exceedingly easy.
RDF::Enumerable
is the key here. For implementing an #each
that yields
RDF::Statement
objects, one gains a ton of
functionality: #each_subject
,
#each_predicate
, #each_object
,
#each_context
, #has_subject?
,
#has_triple?
, and more. It's a key abstraction that
provides huge amounts of functionality.
But the module system goes the other way--not only is it easy to
implement new RDF models, existing ones are easily extended. I
recently wrote RDF::Isomorphic
,
which extends RDF::Enumerable with #bijection_to
and
#isomorphic_with?
methods. The module-based system
provided by RDF.rb means that my isomorphic methods are now
available on RDF::Sesame::Repositories
, and indeed
anything which includes RDF::Enumerable
. This is
everything from repositories to graphs to query results! In fact,
query results themselves implement RDF::Enumerable
,
and thus implement RDF::Queryable
and can be checked
for isomorphism, or whatever else you want to add. This is
functionality that Sesame does not have natively, and which I wrote
for a completely different purpose (testing parsers). Every
RDF::Enumerable
gets it for free because I wanted to
compare 2 textual formats. Neat!
For example, here's what it takes to extend any RDF collection,
from RDF::Isomorphic
:
require 'rdf'
module RDF
##
# Isomorphism for RDF::Enumerables
module Isomorphic
def isomorphic_with(other)
# code that uses #each, or any other method from RDF::Enumerable goes here
...
end
def bijection_to(other)
# code that uses #each, or any other method from RDF::Enumerable goes here
...
end
end
# re-open RDF::Enumerable and add the isomorphic methods
module Enumerable
include RDF::Isomorphic
end
end
Of course, this just can't be done without monkey patching. Mixins and monkey patching together make for a powerful toolkit. To my knowledge, this is the first RDF library that takes advantage of these features.
It's possible to provide powerful features to a wide range of
implementations with this. RDF.rb does not yet have a inference
layer, but any such layer would instantly work for any store which
implements RDF::Enumerable
. Want to prototype some
custom business logic that operates over existing RDF data? Copy it
into a local repository and hack away. No need for the production
RDF store to be the same at all, but you can still apply the same
code.
As a counter-example, compare this to the Java RDF ecosystem.
There are some excellent implementations
(RDF::Isomorphic
is heavily in debt to Jena), but
they're all incompatible. Jena's check for isomorphism is not
really translatable to Sesame, or anything else. RDF.rb, in
addition to providing a reference implementation, acts as an
abstraction layer for underlying RDF implementations. The
difference is night and day--with RDF.rb, you only need to
implement a feature once, at the API layer, to have it apply to any
implementation. This is not a knock at the very talented people
behind those Java implementations; making this happen is a lot of
work in a language without monkey patching, and RDF.rb is only as
good as it is because of the significant influences those projects
have been on Arto's design.
The end result of the mixin-based approach is a system that is
incredibly easy to extend, and just downright fun. It would be a
fairly simple task to extend a Ruby class completely unrelated to
RDF with an #each
method that yields statements,
allowing it to work in RDF::Enumerable
.
Voila, your existing classes now have an RDF representation. Along
the same lines, if one is bothered by the statement-oriented nature
of RDF.rb, building a system which took a resource-oriented view
would not require one to 'break away' from the RDF.rb ecosystem.
Just build your subject-oriented model objects and implement
#each
, and away you go--you can now run RDF queries
and test isomorphism on your model. Build it to accept an
RDF::Enumerable
in the constructor and you can use any
existing repository or query to initialize your model.
RDF.rb is not yet ready for production use, but it's under heavy development and already quite useful. Give it a shot. You can post any issues in the GitHub issue queue.
Posted at 16:05
One of the most talked about features in Rails 3 is its plug & play architecture with various frameworks like Datamapper in place of ActiveRecord for the ORM or jQuery for javascript. However, I've yet to see much info on how to actually do this with the javascript framework.
Fortunately, it looks like a lot of the hard work has already been done. Rails now emits HTML that is compatible with the unobtrusive approach to javascript. Meaning, instead of seeing a delete link like this:
<a href="/users/1" onclick="if (confirm('Are you sure?')) { var f = document.createElement('form'); f.style.display = 'none'; this.parentNode.appendChild(f); f.method = 'POST'; f.action = this.href;var m = document.createElement('input'); m.setAttribute('type', 'hidden'); m.setAttribute('name', '_method'); m.setAttribute('value', 'delete'); f.appendChild(m);f.submit(); };return false;">Delete</a>
you'll now see it written as
<a rel="nofollow" data-method="delete" data-confirm="Are you sure?" class="delete" href="/user/1">Delete</a>
This makes it very easy for a javascript driver to come along, pick out and identify the relevant pieces, and attach the appropriate handlers.
So, enough blabbing. How do you get jQuery working with Rails 3? I'll try to make this short and sweet.
Grab the jQuery driver at http://github.com/rails/jquery-ujs and put it in your javascripts directory. The file is at src/rails.js
Include jQuery (I just use the google hosted version) and the driver in your application layout or view. In HAML it would look something like.
= javascript_include_tag "http://ajax.googleapis.com/ajax/libs/jquery/1.4.1/jquery.min.js"
= javascript_include_tag 'rails'
Rails requires an authenticity token to do form posts back to the server. This helps protect your site against CSRF attacks. In order to handle this requirement the driver looks for two meta tags that must be defined in your page's head. This would look like:
<meta name="csrf-token" content="<%= form_authenticity_token %>" />
<meta name="csrf-param" content="authenticity_token" />
In HAML this would be:
%meta{:name => 'csrf-token', :content => form_authenticity_token}
%meta{:name => 'csrf-param', :content => 'authenticity_token'}
Update: Jeremy Kemper points out that the above meta tags can written out with a single call to "csrf_meta_tag".
That should be all you need. Remember, this is still a work in progress, so don't be surprised if there's a few bugs. Please also note this has been tested with Rails 3.0.0.beta.
Posted at 16:05
The W3C SPARQL working group (previously the Data Access Working Group) has recently released their first versions of the updated SPARQL standards, or SPARQL 1.1. The group's roadmap has these finalized a year from now, but they have asked for comments and I suppose these are mine.
I believe that these documents are a step further down a wrong path for SPARQL and, to a lesser degree, for RDF in general.
The latest round of changes includes a number of changes to SPARQL, including aggregate functions, subqueries, projection expressions, negations, updates and deletions, more specific HTTP protocol bindings, service discovery, entailment regimes, and a RESTful protocol for managing RDF graphs (the last one is not really just SPARQL, but it's in the updates).
So I'll start with my comments, which are mostly critical.
To start, an RDF-specific complaint, not really related to the rest of the post. Why would the one mandated format to be supported in the new RESTful RDF graph management interface be RDF/XML? What would it take for a the semweb community to move on from this failed standard, which has had known issues for more than 5 years? (those two issues were raised in 2001 and are currently marked 'postponed') Why should such an increasingly irrelevant standard as RDF/XML be chosen instead of the widely-supported and easy to implement N3, N-Triples, or Turtle?
As for SPARQL, the 1.1 standards continue to give named graphs first class citizen status, both in the web APIs and in more SPARQL syntax than they had before. It's not so much triples as quads these days. Other meta-metadata, such as time of assertion or validity time, are not covered. While named graphs are admittedly a particularly often-found case, why does it need to invade the syntax of SPARQL? Not every use case needs named graphs, but every SPARQL implementor must support them. The 1.1 standard now includes precedence rules when for named graph and base URIs when they conflict in HTTP query options and inside the query itself, attempting to solve this self-created problem.
How about subqueries? What about variables during insertions? What about subqueries during insertions? Do we really need implementors to consider these kinds of things for every SPARQL endpoint on the web?
None of these things is really all that bad by itself, but one must consider the bigger picture. SPARQL 1.0 was released in January of 2008 (with some comment period before that) and there is still no implementation of a SPARQL engine in PHP or Ruby (exceptions apply, see [1]). One does not increase the participation of that ecosystem by adding a selection of entailment regimes to the standard.
While a SPARQL implementation exists for the excellent RDFLib in Python, it's only one of the current big 3 (with Ruby and PHP) in web development, and there's only one. The fact that no SPARQL engines exist for Ruby or PHP should be considered a failure of the standard. Why are we adding complexity when there is no SQLite for SPARQL? Why are there at least 3 monolithic Java implementations (Jena, Sesame, Boca), all financially sponsored to some degree or another, but so little 'in the wild'? How long can RDFLib herd 16 cats as committers on the project? While I don't have a lot of direct experience with RDFLib, I pity the project 'leads' (I cannot find evidence that the project is sponsored or that anyone is 'in charge') trying to look towards the future of implementing 6 working papers of new standards.
One of the biggest success stories for semweb in widespread use is the Drupal RDF module, which has found wide acceptance in the Drupal community and started an ecosystem of modules. Drupal 7 will output RDFa by default and Drupal 6 supports a ton of wonderful features, including reversing the RSS 1.0 to 2.0 downgrade back to RDF. But Drupal remains a producer of simple triples and a consumer of SPARQL queries generated by other endpoints. Data in those sites remains locked down. Why? Because implementing SPARQL in PHP is nontrivial, and in a chicken-egg problem, nobody's paying for it before someone has a need for SPARQL.
I could go on, but these are symptoms (well, not that RDF/XML thing, I don't think there's a good reason for that). I feel that the working group is attempting to solve the wrong problem. Namely, it is attempting to define a somewhat-human-readable query language, SPARQL that works for almost all use cases. But why must the whole 'kitchen sink' be well-defined? Such a standards body should be attempting to define the easiest possible thing to implement and extend, not the the last tool anyone would ever use.
The SPARQL 1.0 standard's grammar was well-defined as a context free grammar. It also had extension functions, which were uniquely defined by URIs. Why the distinction between CFG elements and extension functions? Why not make syntax elements like named graphs and aggregate functions as discoverable as extensions? Well, the reason is that it's hard to write a parser of a human-readable format and make those things optional and discoverable. (Here's a SPARQL parser implementation in Scala, a language with powerful pattern matching features for good parsing, and it's 500 lines of code. It compiles to S-expressions, the parsing of which is about 30 lines. Hmm.)
If the protocol had been defined as S-expressions, the distinction would not exist and the syntax could be as expandable as the current functions (the current syntax would just be more functions). The new 1.1 service discovery mechanism is excellent and extendible and would allow the standard to grow dynamically instead of becoming bogged down in features for particular use cases. New baseline implementations of SPARQL would be easy to implement and grow incrementally, and the current human-readable format can be implemented in terms of these expressions.
The web of ontologies has grown with ad-hoc definitions created by people used to fill their needs. Standards grow organically around the ones that are needed most, others languish. Why should SPARQL functions have this kind of flexibility, but not the syntax? The distinction makes implementation overly difficult and is slowing the expansion of the Semantic Web.
In fact, it turns out that Jena has been parsing to S-expressions for some time. If you're an implementor, why would you do it any other way, especially when the standard can change as much as it does in 1.1? Any implementation will have to come up with something equivalent to S-expressions if you are going to be able to upgrade your engine implementation to meet standards like this when they are finalized. If people are doing it anyway, why not just make it the standard?
The SPARQL Working Group should be working on a definition for a function list and discovery protocol for S-expressions, and not for what we currently call SPARQL. What we call SPARQL is something that should compile to a simpler standard if various vendors want to implement it. S-expressions allow maximally simple parsing maximally simple serialization, and the ability to do feature discovery on core features of the language, not just portions which are blessed with the ability to be extended. S-expressions are easier for machines to generate for wide variety of automated use cases, far wider, I would venture, than the set of use cases for the human-readable queries.
Please, please, please do not doom the world to write the SPARQL equivalent of SQLAlchemy and ActiveRecord for the next 20 years! We can define a standard that machines can use natively. Now's the time.
At any rate, that's my beef in a nutshell. The working group won't come up with a successful standard until it's easy enough to implement it that workable implementations appear in the languages that are defining the web today. And when people can use those languages to implement that standard without an army of VC-funded engineers.
The SPARQL 1.1 proposals make the standard better than before, but it's not the standard we need. The SPARQL algebra is what needed expansion and specification, not the syntax.
[1]: The PHP ARC project has an implementation, but it attempts to directly convert SPARQL to an SQL query on particular table layout in MySQL, and is difficult to convert to general use. Despite SPARQL's complexity, ARC managed to implement this in just 6400 lines of code. The parser alone is 2000 lines and the engine another 4400. The serialization/parsing libraries, however, are fine, and were integrated successfully into the Drupal RDF module. The PHP RAP project has also done some good work and is perhaps more wrappable than ARC, but implements only a subset of SPARQL.
Posted at 16:05