November 07

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

October 31

Egon Willighagen: SARS-CoV-2, COVID-19, and Open Science

WP4846 that I started on March 16. It will
see a massive overhaul in the next weeks.
Voices are getting stronger over how important Open Science is. Insiders have known the advantages for decades. We also know the issues in the transition, but the transition has been steady. Contributing to Open Science is simple: there are plenty of project where you can contribute without jeopardizing your own research (funding or prestige). Myself, my small contributions have been done without funding too. But I needed to do something. I have been mostly self-quarantined since March 6, with only very few exception. And I'm so done with it. Like so many other people. It won't stop me wearing masks when I go shopping (etc). 

Reflecting on the past eight months, particularly the last two months have been tough. It's easier to sit at home and in the garden when it is warm outside, light. But for another 7 weeks or so, they days will only get darker. The past two months were also so busy with grant reporting that I did not get around to much else, even with an uncommon long stretch of long working weeks, with about 8 weeks of 70-80 hrs of active work in that period. In fact, the past two weeks, with most of the deadlines past, I had a physical reset, and was happy if I made 40 hrs a week. 

So, where is my COVID-19 work now, where is it going?

Molecular Pathways

First, what did we reach? First, leveraging from the Open Science community I am involved in, I stared collaborating. With old friends and making new friends. I was delighted to see I was not the only one. In fact, Somewhere in May/June, I had to give up following all Open Science around COVID-19, because there was too much.

For example, I was not the only one wanting to describe our slowly developing molecular knowledge of the SARS-CoV-2 virus. While my pathway focused on specifically the confirmed processes for SARS-CoV-2, my colleague Freddie digitized a recent review about other corona viruses. Check out her work: WP4863, WP4864WP4877WP4880, and WP4912. In fact, In fact, so much was done by so many people in such a short time, that the WikiPathways COVID-19 Portal was set up.

Further reading:
  • Ostaszewski M. COVID-19 Disease Map, a computational knowledge repository of SARS-CoV-2 virus-host interaction mechanisms. bioRxiv. 2020 Oct 28; 10.1101/2020.10.26.356014v1 (and unversioned 10.1101/2020.10.26.356014)
  • Ostaszewski M, Mazein A, Gillespie ME, Kuperstein I, Niarakis A, Hermjakob H, et al. COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms. Sci Data. 2020 May 5;7(1):136 10.1038/s41597-020-0477-8
Interoperability with Wikidata

Because I see an essential role for Wikidata in Open Science, and because regular databases did not provide identifiers for the molecular building blocks, we created them in Wikidata. This was essential, because I wanted to use Scholia (see screenshot on the right) to track the research output (something that by now has become quite a challenge; btw, checkout Lauren's tutorial on this). This too was still in March. However, because Scholia itself is a general tool, I needed shortlists of all SARS-CoV-2 genes, all proteins, etc. So, I created this book. It's autogenerated and auto-updated by taking advantage of SPARQL queries against Wikidata. And I am so excited the book has been translated in Japanese, Portugues, and Spanish. The i18n work is thanks to the virtual BioHackathon in April, where Yayamamo bootstrapped the framework to localize the content.

Also during that BioHackathon, we started a collaboration with Complex Portal's Birgit, because the next step was to have identifiers for (bio)molecular complexes. This work is still ongoing, but using a workaround we developed for WikiPathways (because complexes in GPML currently cannot have identifiers), we can now link out to Complex Portal, as visible in this screenshot:

The autophagy initiation complex has the CPX-373 identifier in Complex Portal.

Joining the Wikidata effort is simple. Just visit Wikidata:WikiProject_COVID-19 and find your thing of interest. Because the past two months have been so crowded, I still did not get around to explore the kg-covid-19 project, but sounds very interesting too!

Further reading:
  • Waagmeester A, Stupp G, Burgstaller-Muehlbacher S, Good BM, Griffith M, Griffith OL, et al. Wikidata as a knowledge graph for the life sciences. eLife. 2020 Mar 17;9:e52614. 10.7554/eLife.52614
  • Waagmeester A, Willighagen EL, Su AI, Kutmon M, Labra Gayo JE, Fernández-Álvarez D, et al. A protocol for adding knowledge to Wikidata, a case report. bioRxiv [Internet]. 2020 Apr 7 [cited 2020 Apr 17]; 10.1101/2020.04.05.026336
Computer-assisted data curation

For some years now, I have been working on computer-assisted data curation of WikiPathways, but also Wikidata (for chemical compounds). Once your biological knowledge is machine readable, you can have learn machines to recognize common mistakes. Some are basically simple checks, like missing information. But it gets exciting if we take advantage of linked data, and we can have machines check consistency between two or more resources. The better out annotation, the more powerful this computer-assisted data curation becomes. Chris has been urging me to publish this, but I haven't gotten around to this yet.

As part of my COVID-19 work, I have started making curation reports for specific WikiPathways. To enable this, I worked out how to reuse the testing without JUnit, allowing the tests to be used as a library. That allows creating the reports, but in the future will also allow use directly in PathVisio. A second improvement to the testing stack is that tests are now more easily annotated. That allows specifying tests only to be run for a certain WikiPathways portal.

But a lot remains to be done. I think at this moment I only migrated, perhaps, some 5% of all tests. So, this is very much on my "what is next?" list.

What is next?

There is a lot I need, want, and should do. here are some ideas. Maybe you wan to beat me to it. Really, I don't mind being scooped, when it comes to public health. Here goes:
  1. file SARS-CoV-2 book translation update requests for some recent updates
  2. update the SARS-CoV-2 book with a list of important SNPs
  3. add BioSchemase to the SARS-CoV-2 book for individual proteins, genes, etc
  4. update WP4846 with recent literature
  5. have another 'main subject' annotation round for SARS-CoV-2 proteins
  6. migrate more pathways tests from JUnit into the testing library
  7. write a new test to detect preprints in pathway literature lists and check for journal article versions
  8. finish the Dutch translation of the SARS-CoV-2 book
  9. write a tool to recognize WikiPathways complexes with matches in Complex Portal
  10. write a tool to generate markdown for any WikiPathways with curation suggestions based on content in other resources
  11. develop a few HTML+JavaScript pages to summarize WikiPathways COVID-19 Portal content
Am I missing anything? Tweet me or leave a comment here.

October 30

Leigh Dodds: Brief review of revisions and corrections policies for official statistics

In my earlier post on the importance of tracking updates to datasets I noted that the UK Statistics Authority Code of Practice includes a requirement that publishers of official statistics must publish a policy that describes their approach to revisions and corrections.

See 3.9 in T3: Orderly Release, which states: “Scheduled revisions or unscheduled corrections to the statistics and data should be released as soon as practicable. The changes should be handled transparently in line with a published policy.”

The Code of Practice includes definitions of both Scheduled Revisions and Unscheduled Corrections.

Scheduled Revisions are defined as: “Planned amendments to published statistics in order to improve quality by incorporating additional data that were unavailable at the point of initial publication“.

Whereas Unscheduled Corrections are: “Amendments made to published statistics in response to the identification or errors following their initial publication

I decided to have a read through a bunch of policies to see what they include and how they compare.

Here are some observations based on a brief survey of this list of 15 different policies including those by the Office of National Statistics, the FSA, Gambling Commission, CQC, DfE, PHE, HESA and others.


The Code of Practice applies to official statistics. Some organisations publishing official statistics also publish other statistical datasets.

In some cases organisations have written policies that apply:

  • to all their statistical outputs, regardless of designation
  • only to those outputs that are official statistics
  • individual policies relating to specific datasets

There’s some variation in the amount of detail provided.

Some read as basic compliance documents with basic statements of intent to follow the recommendations of the code of practice. The include, for example a note that revisions and corrections will be handled transparently, in a timely way and with general notes about how that will happen.

Others are more detailed, giving more insight into how the policy will actually be carried out in practice. From a data consumer perspective these feel a bit more useful as they often include timescales for reporting, lines of responsibility and notes about how changes are communicated.


Some policies elaborate on the definitions in the code of practice, providing a bit more breakdown on the types of scheduled revisions and sources of error.

For example some policies indicate that changes to statistics may be driven by:

  • access to new or corrected source data
  • routine recalculations, as per methodologies, to establish baselines
  • improvements to methodologies
  • corrections to calculations

Some organisations publish provisional releases of these statistics. So their policies discuss Scheduled Revisions in this light: a dataset is published in one or more provisional releases before being finalised. During those updates the organisation may have been in receipt of new or updated data that impacts how the statistics are calculated. Or may fix errors.

Other organisations do not publish provisional statistics so their datasets do not have scheduled revisions.

A few policies include a classification of the severity of errors, along the lines of:

  • major errors that impact interpretation or reuse of data
  • minor errors in statistics, which may include anything that is not major
  • other minor errors or mistakes, e.g. typographical errors

These classifications are used to describe different approaches to handling the errors, appropriate to their severity.

Decision making

The policies frequently require decision making around how specific revisions and corrections might be handled. With implications for investment of time and resources in handling and communicating the necessary revisions and corrections.

In some cases responsibility lies with a senior leader, e.g. a Head of Profession, or other senior analyst. In some cases decision making rests with the product owner with responsibility for the dataset.

Scheduled revisions

Scheduled changes are, by definition, planned in advance. So the policy sections relating to these revisions are typically brief and tend to focus on the release process.

In general, the policies align around:

  • having clear timetables for when revisions are to be expected
  • summarising key impacts, detail and extent of revisions in the next release of a publication and/or dataset
  • clear labelling of provisional, final and revised statistics

Several of the policies include methodological changes in their handling of scheduled revisions. These explain that changes will be consulted on and clearly communicated in advance. In some cases historical data may be revised to align with the new methodology.


Handling of corrections tends to be the larger sections of each policy. These sections frequently highlight that despite rigorous quality control errors may creep in, either because of mistakes or because of corrections to upstream data sources.

There are different approaches to how quickly errors will be handled and fixed. In some cases this depends on the severity of errors. But in others the process is based on publication schedules or organisational preference.

For example, in one case (SEPA), there is a stated preference to handle publication of unscheduled corrections once a year. In other policies corrections will be applied at the next planned (“orderly”) release of the dataset.

Impact assessments

Several policies note that there will be an impact assessment undertaken to fully understand an error before any changes are made.

These assessments include questions like:

  • does the error impact a headline figure or statistic?
  • is the error within previously reported margins of accuracy or certainty
  • who will be impacted by the change
  • the consequences of the change, e.g. does it impact the main insights from the previously published statistics or how it might be used?

Severity of errors

Major errors tend to get some special treatment. Corrections to these errors are typically made more rapidly. But there are few commitments to timeliness of publishing corrections. “As soon as possible” is a typical statement.

The two exceptions I noted are the MOD policy which notes that minor errors will be corrected within 12 months, and the CQC policy which commits to publishing corrections within 20 days of an agreement to do so. (Others may include commitments that I’ve missed.)

A couple of policies highlight that errors may be found before a fix. In these cases, the existence of the error will still be reported.

The Welsh Revenue Authority was the only policy that noted that it might even retract a dataset from publication while it fixed an error.

A couple of policies noted that minor errors that did not impact interpretation may not be fixed at all. For example one ONS policy notes that errors within the original bounds of uncertainty in the statistics may not be corrected.

Minor typographic errors might just be directly fixed on websites without recording or reporting of changes.


There seems to be general consensus on the use of “p” for provisional and “r” for revised figures in statistics. Interestingly, in the Welsh Revenue Authority policy they note that while there is an accepted welsh translation for “provisional” and “revised”, the marker symbols remain untranslated.

Some policies clarify that these markers may be applied at several levels, e.g. to individual cells as well as rows and columns in a table.

Only one policy noted a convention around adding “revised” to a dataset name.


As required by the code of practice, the policies align on providing some transparency around what has been changed and the reason for the changes. Where they differ is around how that will be communicated and how much detail is included in the policy.

In general, revisions and corrections will simply be explained in the next release of the dataset, or before if a major error is fixed. The goal being to provide users with a reason for the change, and the details of the impact on the statistics and data.

These explanations are handled by additional documentation to be included in publications, markers on individual statistics, etc. Revision logs and notices are common.

Significant changes to methodologies or major errors get special treatment. E.g. via notices on websites or announcements via twitter.

Many of the policies also explain that known users or “key users” will be informed of significant revisions or corrections. Presumably this is via email or other communications.

One policy noted that the results of their impact assessment and decision making around how to handle a problem might be shared publicly.

Capturing lessons learned

A few of the policies included a commitment to carry out a review of how an error occurred in order to improve internal processes, procedures and methods. This process may be extended to include data providers where appropriate.

One policy noted that the results of this review and any planned changes might be published where it would be deemed to increase confidence in the data.

Wrapping up

I found this to be an interesting exercise. It isn’t a comprehensive review, but hopefully it provides a useful summary of approaches.

I’m going to resist the urge to write recommendations or thoughts on what might be added to these policies. Reading a policy doesn’t tell us how well its implemented, or whether users feel it is serving their needs.

I will admit to feeling a little surprised that there isn’t a more structured approach in many cases. For example, pointers to where I might find a list of recent revisions or how to sign up to get notified as an interested user of the data.

I had also expected some stronger commitments about how quickly fixes may be made. These can be difficult to make in a general policy, but are what you might expect from a data product or service.

These elements might be covered by other policies or regulations. If you know of any that are worth reviewing, then let me know.


Leigh Dodds: The importance of tracking dataset retractions and updates

There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.

I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.

The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.

The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.

The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.

But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.

Change notification is a missing piece of our data infrastructure.

If it were easier to monitor important changes to datasets, then it would be easier to:

  • maintain mirrors of data
  • retract or remove data that breached laws or social and ethical norms
  • update derived datasets to remove or amend data
  • re-run analyses against datasets which has seen significant corrections or revisions
  • assess the impacts of poor quality or unethically shared data
  • proactively notify relevant communities of potential impacts relating to published data
  • monitor and review the reasons why datasets get retracted
  • …etc, etc

The importance of these activities can be seen in other contexts.

For example, Retraction Watch is a project that monitors retractions of research papers. CrossMark helps to highlight major changes to published papers including corrections and retractions.

Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.

More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?

This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.

October 27

Libby Miller: MyNatureWatch with a High Quality Raspberry Pi camera

I’ve been using MyNatureWatch setup on my bird table for ages now, and I really love it (you should try it). The standard setup is with a pi zero (though it works fine with other versions of the Pi too). I’ve used the recommended, very cheap, pi zero camera with it, and also the usual pi camera (you can fit it to a zero using a special cable). I got myself one of the newish high quality Pi cameras (you need a lens too, I got this one) to see if I could get some better pics.

I could!

Pigeon portrait using the Pi HQ camera with wide angle lens

I was asked on twitter how easy it is to set up with the HQ camera, so here are some quick notes on how I did it. Short answer – if you use the newish beta version of the MyNatureWatch downloadable image it works just fine with no changes. If you are on the older version, you need to upgrade it, which is a bit fiddly because of the way it works (it creates its own wifi access point that you can connect to, so it’s never usually online). It’s perfectly doable with some fiddling, but you need to share your laptop’s network and use ssh.

Blackbird feeding its young, somewhat out of focus

MyNatureWatch Beta – this is much the easiest option. The beta is downloadable here (more details) and was some cool new features such as video. Just install as usual and connect the HQ camera using the zero cable (you’ll have to buy this separately, the HQ camera comes with an ordinary cable). It is a beta and I had a networking problem with it the first time I installed it (the second time it was fine). You could always put it on a new SD card if you don’t want to blat a working installation. Pimoroni have 32GB cards for £9.

The only fiddly bit after that is adjusting the focus. If you are not used to it, the high quality camera CCTV lens is a bit confusing, but it’s possible to lock all the rings so that you can set the focus while it’s in a less awkward position if you like. Here are the instructions for that (pages 9 and 10).

MyNatureWatch older version – to make this work with the HQ camera you’ll need to be comfortable with sharing your computer’s network over USB, and with using ssh. Download the img here, and install on an SD card as usual. Then, connect the camera to the zero using the zero cable (we’ll need it connected to check things are working).

Next, share your network with the Pi. On a mac it’s like this:

Sharing network using system preferences on a Mac

You might not have the RNDIS/Ethernet gadget option there on yours – I just ticked all of them the first time and *handwave* it worked after a couple of tries.

Now connect your zero to your laptop using the zero’s USB port (not its power port) – we’re going to be using the zero as a gadget (which the MyNatureWatch people have already kindly set up for you).

Once it’s powered up as usual, use ssh to login to the pi, like this:

ssh pi@camera.local
password: badgersandfoxes

On a mac, you can always ssh in but can’t necessarily reach the internet from the device. Test that the internet works like this:


This sort of thing means it’s working:

PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=116 time=19.5 ms
64 bytes from ( icmp_seq=2 ttl=116 time=19.6 ms

If it just hangs, try unplugging the zero and trying again. I’ve no idea why it works sometimes and not others.

Once you have it working, stop mynaturewatch using the camera temporarily:

sudo systemctl stop nwcameraserver.service

and try taking a picture:

raspistill -o tmp.jpg

you should get this error:

mmal: Cannot read camera info, keeping the defaults for OV5647
mmal: mmal_vc_component_create: failed to create component '' (1:ENOMEM)
mmal: mmal_component_create_core: could not create component '' (1)
mmal: Failed to create camera component
mmal: main: Failed to create camera component
mmal: Camera is not detected. Please check carefully the camera module is installed correctly

Ok so now upgrade:

sudo apt-get update
sudo apt-get upgrade

you will get a warning about hostapd – click q when you see this. The whole upgrade took about 20 minutes for me.

When it’s done, reboot

sudo reboot

ssh in again, and test again if you want

sudo systemctl stop nwcameraserver.service
raspistill -o tmp.jpg

reenable hostapd:

sudo systemctl unmask hostapd.service
sudo systemctl enable hostapd.service

reboot again, and then you should be able to use it as usual (i.e. connect to its own wifi access point etc).

The only fiddly bit after that is adjusting the focus. I used a gnome for that, but still sometimes get it wrong. If you are not used to it, the high quality camera CCTV lens is a bit confusing – it’s possible to lock all the rings so that you can set the focus while it’s in a less awkward position if you like. Here are the instructions for that (pages 9 and 10).

A gnome

October 24

Libby Miller: Libbybot – a posable remote presence bot made from a Raspberry Pi 3 – updates

A couple of people have asked me about my presence-robot-in-a-lamp, libbybot – unsurprising at the moment maybe – so I’ve updated the code in github to use the most recent RTCMultiConnection (webRTC) library and done a general tidy up.

I gave a presentation at EMFCamp about it a couple of years ago – here are the slides:


October 22

Leigh Dodds: Consulting Spreadsheet Detective, Season 1

I was very pleased to announce my new TV series today, loosely based on real events. More details here in the official press release.


Coming to all major streaming services in 2021 will be the exciting new series: “Turning the Tables“.

Exploring the murky corporate world of poorly formatted spreadsheets and nefarious macros each episode of this new series will explore another unique mystery.

When the cells lie empty, who can help the CSV:PI team pivot their investigation?

When things don’t add up, who can you turn to but an experienced solver?

Who else but Leigh Dodds, Consulting Spreadsheet Detective?

This smart, exciting and funny new show throws deductive reasoner Dodds into the mix with Detectives Rose Cortana and Colm Bing part of the crack new CSV:PI squad.

Rose: the gifted hacker. Quick to fire up an IDE, but slow to validate new friends.

Colm: the user researcher. Strong on empathy but with an enigmatic past that hints at time in the cells.

What can we expect from Season 1?

Episode 1: #VALUE!

In his first case, Dodds has to demonstrate his worth to a skeptical Rose and Colm, by fixing a corrupt formula in a startup valuation.

Episode 2: #NAME?

A personal data breach leaves the team in a race against time to protect the innocent. A mysterious informant known as VLOOKUP leaves Dodds a note.

Episode 3: #REF!

A light-hearted episode where Dodds is called in to resolve a mishap with a 5-a-side football team matchmaking spreadsheet. Does he stay between the lines?

Episode 4: #NUM?

A misparsed gene name leads a researcher into recommending the wrong vaccine. It’s up to Dodds to fix the formula.

Episode 5: #NULL!

Sometimes it’s not the spreadsheet that’s broken. Rose and Colm have to educate a researcher on the issue of data bias, while Dodds follow up references to the mysterious Macro corporation.

Episode 6: #DIV/0?

Chasing down an internationalisation issue Dodds, Rose and Colm race around the globe following a trail of error messages. As Dodds gets unexpectedly separated from the CSV:PI team, Rose and Colm unmask the hidden cell containing the mysterious VLOOKUP.

In addition to the six episodes in season one, a special feature length episode will air on National Spreadsheet Day 2021:

Feature Episode: #####

Colm’s past resurfaces. Can he grow enough to let the team see the problem, and help him validate his role in the team?

Having previously only anchored documentaries, like “Around with World with 80,000 Apps” and “Great Data Journeys“, taking on the eponymous role will be Dodds’ first foray into fiction. We’re sure he’ll have enough pizazz to wow even the harshest critics.

“Turning the Tables” will feature music composed by Dan Barrett.

October 14

Sandro Hawke: Elevator Pitch for the Semantic Web invited people to make video elevator pitches for the Semantic Web, focused on the question “What is the Semantic Web?”. I decided to give it a go.

I’d love to hear comments from folks who share my motivation, trying to solve this ‘every app is a walled garden’ problem.

In case you’re curious, here’s the script I’d written down, which turned out to be wayyyy to long for the elevators in my building, and also too long for me to remember.

Eric Franzon of SemanticWeb.Com invited people to send in an elevator pitch for the Semantic Web. Here’s mine, aimed at a non-technical audience. I’m Sandro Hawke, and I work for W3C at MIT, but this is entirely my own view.

The problem I’m trying to solve comes from the fact that if you want to do something online with other people, your software has to be compatible with theirs. In practice this usually means you all have to use the same software, and that’s a problem. If you want to share photos with a group, and you use facebook, they all have to use facebook. If you use flickr, they all have to use flickr.

It’s like this for nearly every kind of software out there.

The exceptions show what’s possible if we solve this problem. In a few cases, through years of hard work, people have been able to create standards which allow compatible software to be built. We see this with email and we see this with the web. Because of this, email and the Web are everywhere. They permeate our lives and now it’s hard to imagine modern life without them.

In other areas, though, we’re stuck, because we don’t have these standards, and we’re not likely to get them any time soon. So if you want to create, explore, play a game, or generally collaborate with a group of people on line, every person in the group has to use the same software you do. That’s a pain, and it seriously limits how much we can use these systems.

I see the answer in the Semantic Web. I believe the Semantic Web will provide the infrastructure to solve this problem. It’s not ready yet, but when it is, programs will be able to use the Semantic Web to automatically merge data with other programs, making them all — automatically — compatible.

If I were up to doing another take, I’d change the line about the Semantic Web not being much yet. And maybe add a little more detail about how I see it working. I suppose I’d go for this script:

Okay, elevator pitch for the Semantic Web.

What is the Semantic Web?

Well, right now, it’s a set of technologies that are seeing some adoption and can be useful in their own right, but what I want it to become is the way everyone shares their data, the way all software works together.

This is important because every program we use locks us into its own little silo, its own walled garden

For example, imagine I want to share photos with you. If I use facebook, you have to use facebook. If I use flickr, you have to use flicker. And if I want to share with a group, they all have to use the same system

That’s a problem, and I think it’s one the Semantic Web can solve with a mixture of standards, downloadable data mappings, and existing Web technologies.

I’m Sandro Hawke, and I work for W3C at MIT. This has been entirely my own opinion.

(If only I could change the video as easily as that text. Alas, that’s part of the magic of movies.)

So, back to the subject at hand. Who is with me on this?

Michael Hausenblas: Turning tabular data into entities

Two widely used data formats on the Web are CSV and JSON. In order to enable fine-grained access in an hypermedia-oriented fashion I’ve started to work on Tride, a mapping language that takes one or more CSV files as inputs and produces a set of (connected) JSON documents.

In the 2 min demo video I use two CSV files (people.csv and group.csv) as well as a mapping file (group-map.json) to produce a set of interconnected JSON documents.

So, the following mapping file:

 "input" : [
  { "name" : "people", "src" : "people.csv" },
  { "name" : "group", "src" : "group.csv" }
 "map" : {
  "people" : {
   "base" : "http://localhost:8000/people/",
   "output" : "../out/people/",
   "with" : { 
    "fname" : "people.first-name", 
    "lname" : "people.last-name",
    "member" : " to:group.ID"
  "group" : {
   "base" : "http://localhost:8000/group/",
    "output" : "../out/group/",
    "with" : {
     "title" : "group.title",
     "homepage" : "group.homepage",
     "members" : " link:group.ID to:people.ID"

… produces JSON documents representing groups. One concrete example output is shown below:

John Goodwin: Using Machine Learning to write the Queen’s Christmas Message

In their excellent book “The Indisputable Existence of Santa Claus” Hannah Fry (aka @FryRsquared) and Thomas Oléron Evans (aka @Mathistopheles) talked about using Markov Chains to generate the Queen’s Christmas message. You can read a bit about that here. After reading that chapter I asked Hannah and Thomas if they had considered repeating this using recurrant neural networks. A couple of years ago Andrej Karpathy wrote a blog that he summarised as follows:

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

In his blog he posed the question:

It looks like we can learn to spell English words. But how about if there is more structure and style in the data?

and went on to train an rNN to write text in the style of  Shakespeare. I recommend you read his blog to see the examples and get an idea of what is going on.

You might also want to watch this video from the excellent Computerphile on a similar subject.

In a previous blog I decide to have a go at using rNNs to generate lyrics for my favourite band New Model Army. Thomas was kind enough to send me the text for the Queen’s previous speeches that he and Hannah had used in their work, and I thought I’d see how good rNNs were at generating a Royal Christmas message.  I didn’t do anything particular fancy here…just followed the code and examples in Karpathy’s original blog on a fairly standard home desktop machine.

Here are some of the results… (I take no responsibility for any text generated by the neural network :)).

One parameter that can be adjusted to generate the output is ‘temperature’. This takes a value between 0 and 1. Lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes. Running with a temperaure of 0.5 we get:

i want to send a special message of the human spirit are much to be for everyone this is a happy and part of the christmas story so in the same streed. we are all truly new zealand in an example of the commonwealth has an equal place in the lives of men and women and what they have to show that they are all life in their own peoples of the commonwealth which we have come to the sacrifice of all the fortunes of hope and there are many heart of our grandchildren the presents the prince of wales and will have a preacher group of the commonwealth. on a strong and spreading the responsibility to the commonwealth who are much on the courage of the modern world and the commonwealth games and there are many others at home and the rest of the world have been inspired by the courage and enjoyment and the present peace of a commonwealth which is how to promote the games and from all walks of life and to become a growing peace on earth which we should me to see the strength to try to do so in the world today. it was a great conflict remain inspired by the courage and prince philip and i were able to see buckingham palace which the most special year for the commonwealth is to be movement and in a different furmestand and they are all source of spirit all over the world have shared a moment in the year and the christmas message is to be consed rather a quicter and the contribution of the family of nations and i have meen and wonderful in both these ordarisasions or a common sense and encouragement. this year i am speaking to you from the same family make the world in the world. the christmas message is so many of them and i have difference to the best of ourselves. and this year there are something to look at the world they are may be a life of all the world is the commonwealth is a sense of belonging to the world i was well as they are all sort of moment the more depression

I’m not sure this makes a lot of sense, and there are a few oddities/errors in the text.

I tried lowering the temperature to 0.3:

i hope that the christmas message is to be there and it is not only a very happy christmas to you all. i am speaking to you from the commonwealth and i have been able to see that the commonwealth and i have been deeply touched this year i am speaking to you from the same family friends and communities as the future of the commonwealth and as i have always given to many of you will have heard of the more conscious of the commonwealth and around us we should be able to see that the commonwealth and i have been deeply touched the strength to try to do the same family gathering in the commonwealth and i have all been about the future as well as a great comfort to us all the problems of the commonwealth and i have seen in their own way the problems of the commonwealth and i have been deeply touched the world the progress of the commonwealth and around us we are common throughout the commonwealth who are struck by the state of the commonwealth and i have depended to see in the world today that we should remember those who have seen in their own way a celebration of the child who was born at christmas time for families and friends will never be very different problems but it is not only a time for reflection and confidence to the commonwealth and i have all been about the world that we should not be our lives and to remind us of the future. i am speaking to you from the commonwealth and i have been deeply touched the world that we can all try to make a splendid birthday and the commonwealth and i have been floods and sadness and the best of the world have been able to discuss the best of ourselves. i believe that this christmas day i want to send a special message of hope in the face of hardship is nothing new that of the commonwealth who have seen in their lives in the commonwealth and in the commonwealth and i have been able to discuss the best of ourselves. we are all live together as a great daily and its war the commonwealth

As suggested the the result here is a bit more predictable/boring.

It is also possible to prime the model with some starting text using. This starts out the RNN with some hardcoded characters to warm it up with some context before it starts generating text.

Using a temperature of 0.5 and prime text of ‘Each Christmas’ we get:

each christmas is a time for reflection and grandchildren the more depends of a dedication of the commonwealth and around us of the future of the commonwealth games in the rich and proud of life in a great religions and its members. but i am also speaking by instince in the right through the members of my family and i have been great and personal responsibility to our lives and they say to me that there are many happy or so many countries of the future with the prince of peace and happiness and self-respect for the commonwealth and is a happy and prosperous future of the world we will be achieved with the opportunity to help them are a contribution to the powerful ways of spirit and learning to see the problems of the commonwealth and i have seen in their inestivation the peoples of the commonwealth and i have been for the better but that is possible to the people and the christians it’s all those who are great practical form we have all been for them to be a chance to get the gift of a carren of different lives the service of their people throughout the world. we are all live in the rolin to our children and grandchildren the present generation to see this country and arrived the same for one another which is most popularity and the rest of the world and all the great commonwealth. pronomed news the responsibilities for the duty of events which we have all been the holidic of science. it is they all who are so there and shared heritage and sometimes in saint paul’s and much considerate and to communication the human spirit. and they can be a contralle commonwealth and a firm landmark in the course of the many servicemen and women who are broken for many that can be a precious given us to be a witness this continuing that we can all be the contribution of the commonwealth and as we all have all the features of the love of hope and goodwill. this year has constantly will be overcome. i believe that this year i am speaking to you from the hope of a determination and continues to all

Using a temperature of 0.5 and prime text of ‘This  past year’ we get:

this past year has been one to see that what i have been for the commonwealth and around us we are thinking of the problems show in the face of the fact that he wanted the strongly that we should show many of them service is the birth of the commonwealth. it is a contribution of a lives of family and friends for parents and religious difficulties at the commonwealth and as we all share the future as the future of the future. it was a moment and that the christmas message of the things that have been god shown in the earliest care and for the commonwealth can give the united kingdom and across the commonwealth. it is a helping one of the march of people who are so easy to go but the rest of the world have shaped for their determination and courage of what is right the life of life is word if we can do the state of the same time last month i was welcomed as you and that the opportunity to honour the most important of the thread which have provided a strong and family. even the commonwealth is a common bond that the old games and in the courage which the generations of the commonwealth and i have those to succeed without problems which have their course all that the world has been complete strangers which could have our response. i believe that the triditings of reconciliation but there is nothing in and happy or acts of the commonwealth and around us by science the right of all the members of the world have been difficult and the benefits of dreads and happiness and service to the commonwealth of the future. i wish you all a very happy christmas to you all.

So there you have it. Not sure we’ll be replacing the Queen with an AI anytime soon. Merry Christmas!

Leigh Dodds: Tip for improving standards documentation

I love a good standard. I’ve written about them a lot here.

As its #WorldStandardsDay I thought I’d write a quick post to share something that I’ve learned from leading and supporting some standards work.

I’ve already shared this with a number of people who have asked for advice on standards work, and in some recent user research interviews I’ve participated in. So it makes sense to write it down.

In the ODIHQ standards guide, we explained that at the end of your initial activity to develop a standard, you should plan to produce a range of outputs. This include a variety of tools and guidance that help people use the standard. You will need much more than just a technical specification.

To plan for the different types of documentation that you may need I recommend applying this “Grand Unified Theory of Documentation“.

That framework highlights four different types of documentation are intended to be used by different audiences to address different needs. The content designers and writers out there reading this will be be rolling their eyes at this obvious insight.

Here’s how I’ve been trying to apply it to standards documentation:


This is your primary technical specification. It’ll have all the detail about the standard, the background concepts, the conformance criteria, etc.

It’s the document of record that captures all of the hard work you’ve invested in building consensus around the standard. It fills a valuable role as the document you can point back to when you need to clarify or confirm what was agreed.

But, unless its a very simple standard, it’s going to have a limited audience. A developer looking to implement a conformant tool, API or library may need to read and digest all of the detail. But most people want something else.

Put the effort into ensuring its clear, precise and well-structured. But plan to also produce three additional categories of documentation.


Many people just want an overview of what it is designed to do. What value will it provide? What use cases was it designed to support? Why was it developed? Who is developing it?

These are higher-level introductory questions. The type of questions that business stakeholders want to answer to sign-off on implementing a standard, so it goes onto a product roadmap.

Explainers are also useful background information that are useful for a developer ahead of taking a deeper dive. If there are some key concepts that are important to understanding the design and implementation of a standard, then write an explainer.


A simple, end-to-end description of how to apply the standard. E.g. how to publish a dataset that conforms to the standard, or export data from an existing system.

A tutorial will walk you through using a specific set of tools, frameworks or programming languages. The end result being a basic implementation of the standard. Or a simple dataset that passes some basic validation checks. A tutorial won’t cover all of the detail, it’s enough to get you started.

You may need several tutorials to support different types of users. Or different languages and frameworks.

If you’ve produced a tool, like validator or a template spreadsheet to support data publication, you’ll probably need a tutorial for each of them unless they are very simple to use.

Tutorials are gold for a developer who has been told: “please implement this standard, but you only have 2 days to do it”.


Short, task oriented documentation focused on helping someone apply the standard. E.g. “How to produce a CSV file from Excel”, “Importing GeoJSON data in QGIS”, “Describing a bus stop”. Make them short and digestible.

How-Tos can help developers build from a tutorial, to a more complete implementation. Or help a non-technical user quickly apply a standard or benefit from it.

You’ll probably end up with lots of these over time. Drive creating them from the types of questions or support requests you’re getting. Been asked how to do something three times? Write a How-To.

There’s lots more that can be said about standards documentation. For example you could add Case Studies to this list. And its important to think about whether written documentation is the right format. Maybe your Explainers and How-Tos can be videos?

But I’ve found the framework to be a useful planning tools. Have a look at the documentation for more tips.

Producing extra documentation to support the launch of a standard, and then investing in improving and expanding it over time will always be time well-spent.

October 10

Libby Miller: Zoom on a Pi 4 (4GB)

It works using chromium not the Zoom app (which only runs on x86, not ARM). I tested it with a two-person, two-video stream call. You need a screen (I happened to have a spare 7″ touchscreen). You also need a keyboard for the initial setup, and a mouse if you don’t have a touchscreen.

The really nice thing is that Video4Linux (bcm2835-v4l2) support has improved so it works with both v1 and v2 raspi cameras, and no need for options bcm2835-v4l2 gst_v4l2src_is_broken=1 🎉🎉



  • Install Raspian Buster
  • Connect the screen keyboard, mouse, camera and speaker/mic. I used a Sennheiser usb speaker / mic, and a standard 2.1 Raspberry pi camera.
  • Boot up. I had to add lcd_rotate=2 in /boot/config.txt for my screen to rotate it 180 degrees.
  • Don’t forget to enable the camera in raspi-config
  • Enable bcm2835-v4l2 – add it to sudo nano /etc/modules
  • I increased swapsize using sudo nano /etc/dphys-swapfile -> CONF_SWAPSIZE=2000 -> sudo /etc/init.d/dphys-swapfile restart
  • I increased GPU memory using sudo nano /boot/config.txt -> gpu_mem=512

You’ll need to set up Zoom and pass capchas using the keyboard and mouse. Once you have logged into Zoom you can often ssh in and start it remotely like this:

export DISPLAY=:0.0
/usr/bin/chromium-browser --kiosk --disable-infobars --disable-session-crashed-bubble --no-first-run

Note the url format – this is what you get when you click “join from my browser”. If you use the standard Zoom url you’ll need to click this url yourself, ignoring the Open xdg-open prompts.


You’ll still need to select the audio and start the video, including allowing it in the browser. You might need to select the correct audio and video, but I didn’t need to.

I experimented a bit with an ancient logitech webcam-speaker-mic and the speaker-mic part worked and video started but stalled – which made me think that a better / more recent webcam might just work.

Libby Miller: Removing rivets

I wanted to stay away from the computer during a week off work so I had a plan to fix up some garden chairs whose wooden slats had gone rotten:


Looking more closely I realised the slats were riveted on. How do you get rivets off? I asked my hackspace buddies and Barney suggested drilling them out. They have an indentation in the back and you don’t have to drill very far to get them out.

The first chair took me two hours to drill out 15 rivets, and was a frustrating and sweaty experience. I checked YouTube to make sure I wasn’t doing anything stupid and tried a few different drill bits. My last chair today took 15 minutes, so! My amateurish top tips / reminder for me next time:

  1. Find a drill bit the same size as the hole that the rivet’s gone though
  2. Make sure it’s a tough drill bit, and not too pointy. You are trying to pop off the bottom end of the rivet – it comes off like a ring – and not drill a hole into the rivet itself.
  3. Wear eye protection – there’s the potential for little bits of sharp metal to be flying around
  4. Give it some welly – I found it was really fast once I started to put some pressure on the drill
  5. Get the angle right – it seemed to work best when I was drilling exactly vertically down into to the rivet, and not at a slight angle.
  6. Once drilled, you might need to pop them out with a screwdriver or something of the right width plus a hammer


More about rivets.

Libby Miller: Real_libby – a GPT-2 based slackbot

In the latest of my continuing attempts to automate myself, I retrained a GPT-2 model with my iMessages, and made a slackbot so people could talk to it. Since Barney (an expert on these matters) felt it was unethical that it vanished whenever I shut my laptop, it’s now living happily(?) if a little more slowly in a Raspberry Pi 4.

Screen Shot 2019-07-20 at 12.19.24It was surprisingly easy to do, with a few hints from Barney. I’ve sketched out what I did below. If you make one, remember that it can leak out private information – names in particular – and can also be pretty sweary, though mine’s not said anything outright offensive (yet).

fuck, mitzhelaists!

This work is inspired by the many brilliant Twitter bot-makers  and machine-learning people out there such as Barney, (who has many bots, including inspire_ration and notYourBot, and knows much more about machine learning and bots than I do), Shardcore (who made Algohiggs, which is probably where I got the idea for using GPT-2),  and Janelle Shane, (whose ML-generated names for e.g. cats are always an inspiration).

First, get your data

The first step was to get at my iMessages. A lot of iPhone data is backed up as sqlite, so if you decrypt your backups and have a dig round, you can use something like baskup. I had to make a few changes but found my data in

/Users/[me]/Library/Application\ Support/MobileSync/Backup/[long number]/3d/3d0d7e5fb2ce288813306e4d4636395e047a3d28

This number – 3d0d7e5fb2ce288813306e4d4636395e047a3d28 – seems always to indicate the iMessage database – though it moves round depending on what version of iOS you have. I made a script to write the output from baskup into a flat text file for GPT-2 to slurp up. I had about 5K lines.

Retrain GPT-2

I used this code.

python3 ./ 117M

PYTHONPATH=src ./ --dataset /Users/[me]/gpt-2/scripts/data/

I left it overnight on my laptop and by morning loss and avg were oscillating so I figured it was done – 3600 epochs. The output from training was fun, e.g..

([2899 | 33552.87] loss=0.10 avg=0.07)

my pigeons get dandruff
treehouse actually get little pellets
little pellets of the same stuff as well, which I can stuff pigeons with
little pellets?
little pellets?
little pellets?
little pellets?
little pellets?
little pellets?
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets

Test it

I copied the checkpoint directory into the models directory

cp -r checkpoint/run1 models/libby
cp models/117M/{encoder.json,hparams.json,vocab.bpe} models/libby/

At which point I could test it using the code provided:

python3 src/ --model_name libby

This worked but spewed out a lot of text, very slowly. Adding –length 20 sped it up:

python3 src/ --model_name libby --length 20

Screen Shot 2019-07-20 at 13.05.06

That was the bulk of it done! I turned into a server and then whipped up a slackbot – it responds to direct questions and occasionally to a random message.

Putting it on a Raspberry Pi 4 was very very easy. Startlingly so.

Screen Shot 2019-07-20 at 13.11.10

It’s been an interesting exercise, and mostly very funny. These bots have the capacity to surprise you and come up with the occasional apt response (I’m cherrypicking)

Screen Shot 2019-07-20 at 14.25.00

We’ve been talking a lot at work about personal data and what we would do with our own, particularly messages with friends and the pleasure of scrolling back and finding old jokes and funny messages. My messages were mostly of the “could you get some milk?” “here’s a funny picture of the cat” type, but it covered a long period and there were also two very sad events in there. Parsing the data and coming across those again was a vivid reminder that this kind of personal data can be an emotional minefield and not something to be trivially messed with by idiots like me.

Also: while GPT-2 means there’s plausible deniability about any utterance, a bot like this can leak personal information of various kinds, such as names and regurgitated fragments of real messages. Unsurprisingly it’s not the kind of thing I’d be happy making public as is, and I’m not sure if it ever could be.



Libby Miller: An i2c heat sensor with a Raspberry Pi camera

I had a bit of a struggle with this so thought it was worth documenting. The problem is this – the i2c bus on the Raspberry Pi is used by the official camera to initialise it. So if you want to use an i2c device at the same time as the camera, the device will stop working after a few minutes. Here’s more on this problem.

I really wanted to use this heatsensor with mynaturewatch to see if we could exclude some of the problem with false positives (trees waving in the breeze and similar). I’ve not got it working well enough yet to look at this problem in detail. But, I did get it working with the 12c bus with the camera – here’s how.

Screen Shot 2019-03-22 at 12.31.04

It’s pretty straightforward. You need to

  • Create a new i2c bus on some different GPIOs
  • Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one
  • Fin

1. Create a new i2c bus on some different GPIOs

This is super-easy:

sudo nano /boot/config.txt

Add the following line of code, preferable in the section where spi and i2c is enabled.


This line will create an aditional i2c bus (bus 3) on GPIO 23 as SDA and GPIO 24 as SCL (GPIO 23 and 24 is defaults)

2. Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one

I am using this sensor, for which I need this circuitpython library (more info), installed using:

pip3 install Adafruit_CircuitPython_AMG88xx

While the pi is switched off, plug in the i2c device using pins 23 for SDA and GPIO 24 for SDL, and then boot it up and check it’s working:

 i2cdetect -y 3

Make two changes:

nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/bcm283x/

and change the SDA and SCL pins to the new pins

#SDA = Pin(2)
#SCL = Pin(3)
SDA = Pin(23)
SCL = Pin(24)
nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/generic_linux/

Change line 21 or thereabouts to use the i2c bus 3 rather than the default, 1:

self._i2c_bus = smbus.SMBus(3)

3. Fin

Start up your camera code and your i2c peripheral. They should run happily together.

Screen Shot 2019-03-25 at 19.12.21

September 29

Leigh Dodds: A letter from the future about numbers

It’s an odd now looking at early 21st century content in the Internet Archive. So little nuance.

It feels a little like watching those old black and white movies. All that colour which was just right there. But now lost. Easy to imagine that life was just monochrome. Harder to imagine the richer colours.

Or at least hard for me. There are AIs that will imagine it all for you now, of course. There have been for a while. They’ll repaint the pictures using data they’ve gleaned from elsewhere. But it’s not the film that is difficult to look at. It’s the numbers.

How did you manage with just those bare numerals?

If I showed you, a 21st century reader, one of our numbers you wouldn’t know what it was. You wouldn’t be able to read it.

Maybe you’ve seen that film Arrival? Based on a book by Ted Chiang? Remember the alien writing that was so complex and rich in meaning? That’s what our numbers might look like to you. You’d struggle to decode them.

Oh, the rest of it is much the same. The text, emojis and memes. Everything is just that bit richer, more visual. More nuanced. It’s even taught in schools now. Standardised, tested and interpreted for all. It’d be familiar enough.

You struggle with the numbers though. They’d take much more time to learn.

Not all of them. House numbers. Your position in the queue. The cost of a coffee. Those look exactly the same. Why would we change those?

It’s the important numbers that look different. The employment figures. Your pension value. Your expected grade. The air quality. The life-changing numbers. Those all look very different now.

At some point we decided that those numbers needed to be legible in entirely different ways. We needed to be able to see (or hear, or feel) the richness and limitations in the most important numbers. It was, it turned out, the only way to build that shared literacy.

To imagine how we got there, just think about how people have always adapted and co-opted digital platforms and media for their own ends. Hashtags and memes.

Faced with the difficulty of digging behind the numbers – the need to search for sample sizes, cite the sources, highlight the bias, check the facts –  we had to find a different way. It began with adding colour, toying with fonts and diacritics.


It took off from there. Layers of annotations becoming conventions and then standards. Whole new planes and dimensions in unicode.


All of the richness, all of the context made visible right there in the number.



Context expressed as colour and weight and strokes in the glyphs. You can just read it all right off the digits. There and there. See?

Things aren’t automatically better of course. Numbers aren’t suddenly to be more trusted. Why would they be?.

It’s easier to see what’s not being said. It’s easier to demand better. It’s that little bit harder to ignore what’s before your eyes. It moves us on in our debates or just helps us recognise when the reasons for them aren’t actually down to the numbers at all.

It’s no longer acceptable to elide the detail. The numbers just look wrong. Simplistic. Black and white.

Which is why it’s difficult to read the Internet Archive sometimes.

We’ve got AIs that can dream up the missing information. Mining the Archive for the necessary provenance and add it all back into the numbers. Just like adding colour to those old films, it can be breathtaking to see. But not in a good way. How could you have deluded yourselves and misled each other so easily?

I’ve got one more analogy for you.

Rorschach tests have long been consigned to history. But one of our numbers – the life-changing ones – might just remind you of a colourful inkblots. And you might accuse use of we’re just reading things into them. Imagining things that you just aren’t there.

But numbers are just inkblots. Shapes in which we choose to see different aspects of the world. They always have been. We’ve just got a better palette.

Posted at 21:05

Leigh Dodds: Garden Retro 2020

I’ve been growing vegetables in our garden for years now. I usually end up putting the garden “to bed” for the winter towards the end of September. Harvesting the last bits of produce, weeding out the vegetable patches and covering up the earth until the Spring.

I thought I’d also do a bit of a retro to help me reflect on what worked and what didn’t work so well this year. We’ve had some mixed successes, so there are some things to reflect and improve on.

What did I set out to do this year?

This year I wanted to do a few things:

  • Grow some different vegetables
  • Get more produce out of the garden
  • Have fewer gluts of a single item (no more courgettes!) and limits on wastage
  • Have a more continuous harvest

What changes did I make?

To help achieve my goals, I made the following changes this year:

  • Make some of the planting denser, to try and get more into the same space
  • Plant some vegetables in pots and not just the vegetable patches, to make sure of all available growing space
  • Tried to germinate and plant out seedlings as early as possible
  • Have several plantings of some vegetables, to allow me to harvest blocks of vegetables over a longer period. To help with this I produced a planting layout for each bed at the start of the year
  • Pay closer attention to the dates when produce was due to be ripe, by creating a Google calendar of expected harvest dates
  • Bought some new seeds, as I had a lot of older seeds

What did we grow?

The final list for this year was (new things in bold):

Basil, Butternut Squash, Carrots, Coriander, Cucumber, Lettuce, Pak Choi, Peas, Potato, Radish, Shallots, Spinach, Spring Onion, Sweetcorn

So, not as many new vegetables as I’d hoped, but I did try some different varieties.

What went well?

  • Had a great harvest overall, including about a kilo of fresh peas, 6kg of potatoes, couple of dozen cucumbers, great crop of spring onions and carrots
  • Having the calendar to help guide planting of seeds and planting in blocks across different beds. This definitely helped to limit gluts and spread out the availability of veg
  • Denser planting of peas and giving them a little more space worked well
  • Grew a really great lettuce 
  • Freezing the peas immediately after harvesting, so we could spread out use
  • Making pickled cucumbers and a carrot pickle to preserve some of the produce
  • Using Nemaslug (as usual) to keep the slugs at bay. Seriously, this is my number 1 gardening tip
  • Spring onions grew just fine in pots
  • Spinach harvest was great. None of it went to waste
  • Being able to go to the garden and pick radish, spinach, carrots, spring onion and pak choi and throw them in the wok for dinner was amazing

What didn’t go so well?

  • Germinated and planted out 2-3 different sets of sweetcorn, squash and cucumber plants. Ended up losing them all in early frosts. Nothing more frustrating than seeing things die within a day or so of planting out
  • Basil just didn’t properly germinate or grow this year. Tried 3-4 plantings, end up with a couple of really scrawny plants. Not sure what happened there. They were in pots but were reasonably well watered.
  • Lost some decent lettuces to snails
  • Radish crop was pretty poor. Some good early harvest, but later sets were poor. I think I used some old seed. The close planting and not enough thinning also meant the plants ended up “leggy” and not growing sufficient bulbs
  • Tried coriander indoor and outdoor with mixed success. Like the radishes, they were pretty stringy. Managed to harvest some leaves but in the end, left them to go to seed and harvested those
  • Sweetcorn, after a did get some to grow, weren’t great. Had some decent cobs on a few, but weakest harvest ever. Normally super reliable.
  • Spinach, Pak Choi and some Radishes went to bolt. So didn’t get the full harvest I might have done
  • Cucumbers I grew from seed. But ended up getting a couple of dozen from basically a single monster plant which spread all over the place. So, still had a massive glut of them. There are 7 in the kitchen right now.
  • Crap shallot harvest. Had about half a dozen

What will I do differently?

  • Thin the radishes more, use the early pickings in salads
  • Don’t rush to get the seedlings out too early in the year. This is the second year in a row where I’ve lost plants early on. Make sure to acclimatise them to the outdoors for longer
  • Apply Nemaslug at least twice, not just once a year at the start of the season
  • Try to find a way to control the slugs
  • While I watered regularly when it was very hot, I got lax when we had a wet period. Suspect this may have contributed to some plants going to bolt
  • Need to rotate stuff through the beds next year, to mix up planting
  • Look at where I can do companion planting, e.g. around the sweetcorn 
  • Going to expand the growing patch. The kids have outgrown their trampoline, so will be converting more of garden to beds next year
  • Add another 1-2 compost bins

Main thing I want to do next year is get a green house. I’ve got my eye on this one. I want to grow tomatoes, chillis and peppers. It’ll also help me acclimatise some of the seedling before properly planting out.

Having the space to grow vegetables is a privilege and I’m very glad and very lucky to have the opportunity.

Gardening can be time consuming and frustrating, but I love being able to cook with what I’ve grown myself. Getting out into the garden, doing something physical, seeing things grown is also a nice balm given everything else that is going on.

Looking forward to next year.


Posted at 18:05

September 18

Ebiquity research group UMBC: paper: Context Sensitive Access Control in Smart Home Environments

Sofia Dutta, Sai Sree Laya Chukkapalli, Madhura Sulgekar, Swathi Krithivasan, Prajit Kumar Das, and Anupam Joshi, Context Sensitive Access Control in Smart Home Environments, 6th IEEE International Conference on Big Data Security on Cloud, May 2020

The rise in popularity of Internet of Things (IoT) devices has opened doors for privacy and security breaches in Cyber-Physical systems like smart homes, smart vehicles, and smart grids that affect our daily existence. IoT systems are also a source of big data that gets shared via the cloud. IoT systems in a smart home environment have sensitive access control issues since they are deployed in a personal space. The collected data can also be of a highly personal nature. Therefore, it is critical to building access control models that govern who, under what circumstances, can access which sensed data or actuate a physical system. Traditional access control mechanisms are not expressive enough to handle such complex access control needs, warranting the incorporation of new methodologies for privacy and security. In this paper, we propose the creation of the PALS system, that builds upon existing work in an attribute-based access control model, captures physical context collected from sensed data (attributes) and performs dynamic reasoning over these attributes and context-driven policies using Semantic Web technologies to execute access control decisions. Reasoning over user context, details of the information collected by the cloud service provider, and device type our mechanism generates as a consequent access control decisions. Our system’s access control decisions are supplemented by another sub-system that detects intrusions into smart home systems based on both network and behavioral data. The combined approach serves to determine indicators that a smart home system is under attack, as well as limit what data breach such attacks can achieve.

pals architecture

The post paper: Context Sensitive Access Control in Smart Home Environments appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: paper: Automating GDPR Compliance using Policy Integrated Blockchain

Automating GDPR Compliance using Policy Integrated Blockchain

Abhishek Mahindrakar and Karuna Pande Joshi, Automating GDPR Compliance using Policy Integrated Blockchain, 6th IEEE International Conference on Big Data Security on Cloud, May 2020.

Data protection regulations, like GDPR, mandate security controls to secure personally identifiable information (PII) of the users which they share with service providers. With the volume of shared data reaching exascale proportions, it is challenging to ensure GDPR compliance in real-time. We propose a novel approach that integrates GDPR ontology with blockchain to facilitate real-time automated data compliance. Our framework ensures data operation is allowed only when validated by data privacy policies in compliance with privacy rules in GDPR. When a valid transaction takes place the PII data is automatically stored off-chain in a database. Our system, built using Semantic Web and Ethereum Blockchain, includes an access control system that enforces data privacy policy when data is shared with third parties.

The post paper: Automating GDPR Compliance using Policy Integrated Blockchain appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: Why does Google think Raymond Chandler starred in Double Indemnity?

In my knowledge graph class yesterday we talked about the SPARQL query language and I illustrated it with DBpedia queries, including an example getting data about the movie Double Indemnity. I had brought a google assistant device and used it to compare its answers to those from DBpedia. When I asked the Google assistant “Who starred in the film Double Indemnity”, the first person it mentioned was Raymond Chandler. I knew this was wrong, since he was one of its screenwriters, not an actor, and shared an Academy Award for the screenplay. DBpedia’s data was correct and did not list Chandler as one of the actors.

I did not feel too bad about this — we shouldn’t expect perfect accuracy in these huge, general purpose knowledge graphs and at least Chandler played an important role in making the film.

After class I looked at the Wikidata page for Double Indemnity (Q478209) and saw that it did list Chandler as an actor. I take this as evidence that Google’s knowledge Graph got this incorrect fact from Wikidata, or perhaps from a precursor, Freebase.

The good news 🙂 is that Wikidata had flagged the fact that Chandler (Q180377) was a cast member in Double Indemnity with a “potential Issue“. Clicking on this revealed that the issue was that Chandler was not known to have an occupation property that a “cast member” property (P161) expects, which includes twelve types, such as actor, opera singer, comedian, and ballet dancer. Wikidata lists chandler’s occupations as screenwriter, novelist, write and poet.

More good news 😀 is that the Wikidata fact had provenance information in the form of a reference stating that it came from CSFD (Q3561957), a “Czech and Slovak web project providing a movie database”. Following the link Wikidata provided led me eventually to the resource, which allowed my to search for and find its Double Indemnity entry. Indeed, it lists Raymond Chandler as one of the movie’s Hrají. All that was left to do was to ask for a translation, which confirmed that Hrají means “starring”.

Case closed? Well, not quite. What remains is fixing the problem.

The final good news 🙂 is that it’s easy to edit or delete an incorrect fact in Wikidata. I plan to delete the incorrect fact in class next Monday. I’ll look into possible options to add an annotation in some way to ignore the incorrect ?SFD source for Chander being a cast member over the weekend.

Some possible bad news 🙁 that public knowledge graphs like Wikidata might be exploited by unscrupulous groups or individuals in the future to promote false or biased information. Wikipedia is reasonably resilient to this, but the problem may be harder to manage for public knowledge graphs, which get much their data from other sources that could be manipulated.

The post Why does Google think Raymond Chandler starred in Double Indemnity? appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part
of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

Early Detection of Cybersecurity Threats Using Collaborative Cognition

Sandeep Narayanan, Ashwinkumar Ganesan, Karuna Joshi, Tim Oates, Anupam Joshi and Tim Finin, Early detection of Cybersecurity Threats using Collaborative Cognition, 4th IEEE International Conference on Collaboration and Internet Computing, Philadelphia, October. 2018.


The early detection of cybersecurity events such as attacks is challenging given the constantly evolving threat landscape. Even with advanced monitoring, sophisticated attackers can spend more than 100 days in a system before being detected. This paper describes a novel, collaborative framework that assists a security analyst by exploiting the power of semantically rich knowledge representation and reasoning integrated with different machine learning techniques. Our Cognitive Cybersecurity System ingests information from various textual sources and stores them in a common knowledge graph using terms from an extended version of the Unified Cybersecurity Ontology. The system then reasons over the knowledge graph that combines a variety of collaborative agents representing host and network-based sensors to derive improved actionable intelligence for security administrators, decreasing their cognitive load and increasing their confidence in the result. We describe a proof of concept framework for our approach and demonstrate its capabilities by testing it against a custom-built ransomware similar to WannaCry.

The post paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: paper: Attribute Based Encryption for Secure Access to Cloud Based EHR Systems

Attribute Based Encryption for Secure Access to Cloud Based EHR Systems

Attribute Based Encryption for Secure Access to Cloud Based EHR Systems

Maithilee Joshi, Karuna Joshi and Tim Finin, Attribute Based Encryption for Secure Access to Cloud Based EHR Systems, IEEE International Conference on Cloud Computing, San Francisco CA, July 2018


Medical organizations find it challenging to adopt cloud-based electronic medical records services, due to the risk of data breaches and the resulting compromise of patient data. Existing authorization models follow a patient centric approach for EHR management where the responsibility of authorizing data access is handled at the patients’ end. This however creates a significant overhead for the patient who has to authorize every access of their health record. This is not practical given the multiple personnel involved in providing care and that at times the patient may not be in a state to provide this authorization. Hence there is a need of developing a proper authorization delegation mechanism for safe, secure and easy cloud-based EHR management. We have developed a novel, centralized, attribute based authorization mechanism that uses Attribute Based Encryption (ABE) and allows for delegated secure access of patient records. This mechanism transfers the service management overhead from the patient to the medical organization and allows easy delegation of cloud-based EHR’s access authority to the medical providers. In this paper, we describe this novel ABE approach as well as the prototype system that we have created to illustrate it.

The post paper: Attribute Based Encryption for Secure Access to Cloud Based EHR Systems appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: Videos of ISWC 2017 talks

Videos of almost all of the talks from the 16th International Semantic Web Conference (ISWC) held in Vienna in 2017 are online at They include 89 research presentations, two keynote talks, the one-minute madness event and the opening and closing ceremonies.

The post Videos of ISWC 2017 talks appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: paper: Automated Knowledge Extraction from the Federal Acquisition Regulations System

Automated Knowledge Extraction from the Federal Acquisition Regulations System (FARS)

Srishty Saha and Karuna Pande Joshi, Automated Knowledge Extraction from the Federal Acquisition Regulations System (FARS), 2nd International Workshop on Enterprise Big Data Semantic and Analytics Modeling, IEEE Big Data Conference, December 2017.

With increasing regulation of Big Data, it is becoming essential for organizations to ensure compliance with various data protection standards. The Federal Acquisition Regulations System (FARS) within the Code of Federal Regulations (CFR) includes facts and rules for individuals and organizations seeking to do business with the US Federal government. Parsing and gathering knowledge from such lengthy regulation documents is currently done manually and is time and human intensive.Hence, developing a cognitive assistant for automated analysis of such legal documents has become a necessity. We have developed semantically rich approach to automate the analysis of legal documents and have implemented a system to capture various facts and rules contributing towards building an ef?cient legal knowledge base that contains details of the relationships between various legal elements, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules. In this paper, we describe our framework along with the results of automating knowledge extraction from the FARS document (Title48, CFR). Our approach can be used by Big Data Users to automate knowledge extraction from Large Legal documents.

The post paper: Automated Knowledge Extraction from the Federal Acquisition Regulations System appeared first on UMBC ebiquity.

Posted at 14:13

Ebiquity research group UMBC: W3C Recommendation: Time Ontology in OWL

W3C Recommendation: Time Ontology in OWL

The Spatial Data on the Web Working Group has published a W3C Recommendation of the Time Ontology in OWL specification. The ontology provides a vocabulary for expressing facts about  relations among instants and intervals, together with information about durations, and about temporal position including date-time information. Time positions and durations may be expressed using either the conventional Gregorian calendar and clock, or using another temporal reference system such as Unix-time, geologic time, or different calendars.

The post W3C Recommendation: Time Ontology in OWL appeared first on UMBC ebiquity.

Posted at 14:13

August 29

Leigh Dodds: Increasing inclusion around open standards for data

I read an interesting article this week by Ana Brandusescu, Michael Canares and Silvana Fumega. Called “Open data standards design behind closed doors?” it explores issues of inclusion and equity around the development of “open data standards” (which I’m reading as “open standards for data”).

Ana, Michael and Silvana rightly highlight that standards development is often seen and carried out as a technical process, whereas their development and impacts are often political, social or economic. To ensure that standards are well designed, we need to recognise their power, choose when to wield that tool, and ensure that we use it well. The article also asks questions about how standards are currently developed and suggests a framework for creating more participatory approaches throughout their development.

I’ve been reflecting on the article this week alongside a discussion that took place in this thread started by Ana.

Improving the ODI standards guidebook

I agree that standards development should absolutely be more inclusive. I too often find myself in standards discussions and groups with people that look like me and whose experiences may not always reflect those who are ultimately impacted by the creation and use of a standard.

In the open standards for data guidebook we explore how and why standards are developed to help make that process more transparent to a wider group of people. We also placed an emphasis on the importance of the scoping and adoption phases of standards development because this is so often where standards fail. Not just because the wrong thing is standardised, but also because the standard is designed for the wrong audience, or its potential impacts and value are not communicated.

Sometimes we don’t even need a standard. Standards development isn’t about creating specifications or technology, those are just outputs. The intended impact is to create some wider change in the world, which might be to increase transparency, or support implementation of a policy or to create a more equitable marketplace. Other interventions or activities might achieve those same goals better or faster. Some of them might not even use data(!)

But looking back through the guidebook, while we highlight in many places the need for engagement, outreach, developing a shared understanding of goals and desired impacts and a clear set of roles and responsibilities, we don’t specifically foreground issues of inclusion and equity as much as we could have.

The language and content of the guidebook could be improved. As could some prototype tools we included like the standards canvas. How would that be changed in order to foreground issues of inclusion and equity?

I’d love to get some contributions to the guidebook to help us improve it. Drop me a message if you have suggestions about that.

Standards as shared agreements

Open standards for data are reusable agreements that guide the exchange of data. They shape how I collect data from you, as a data provider. And as a data provider they shape how you (re)present data you have collected and, in many cases will ultimately impact how you collect data in the future.

If we foreground standards as agreements for shaping how data is collected and shared, then to increase inclusion and equity in the design of those agreements we can look to existing work like the Toolkit for Centering Racial Equity which provides a framework for thinking about inclusion throughout the life-cycle of data. Standards development fits within that life-cycle, even if it operates at a larger scale and extends it out to different time frames.

We can also recognise existing work and best practices around good participatory design and research.

We should avoid standards development, as a process, being divorced from broader discussions and best practices around ethics, equity and engagement around data. Taking a more inclusive and equitable approach to standards development is part of the broader discussion around the need for more integration across the computing and social sciences.

We may also need to recognise that sometimes agreements are made that don’t provide equitable outcomes for everyone. We might not be able to achieve a compromise that works for everyone. Being transparent about the goals and aims of a standard, and how it was developed, can help to surface who it is designed for (or not). Sometimes we might just need different standards, optimised for different purposes.

Some standards are more harmful than others

There are many different types of standard. And standards can be applied to different types of data. The authors of the original article didn’t really touch on this within their framework, but I think its important to recognise these differences, as part of any follow-on activities.

The impacts of a poorly designed standard that classifies people or their health outcomes will be much more harmful than a poorly defined data exchange format. See all of Susan Leigh Star‘s work. Or concerns from indigenous peoples about how they are counted and represented (or not) in statistical datasets.

Increasing inclusion can help to mitigate the harmful impacts around data. So focusing on improving inclusion (or recognising existing work and best practices) around the design of standards with greater capacity for harms is important. The skills and experience required in developing a taxonomy is fundamentally different to those required to develop a data exchange format.

Recognising these differences is also helpful when planning how to engage with a wider group of people. As we can identify what help and input is needed: What skills or perspectives are lacking among those leading standards work? What help or support needs to be offered to increase inclusion. E.g. by developing skills, or choosing different collaboration tools or methods of seeking input.

Developing a community of practice

Since we launched the standards guidebook I’ve been wondering whether it would be helpful to have more of a community of practice around standards development. I found myself thinking about this again after reading Ana, Michael and Silvana’s article and the subsequent discussion on twitter.

What would that look like? Does it exist already?

Perhaps supported by a set of learning or training resources that re-purposes some of the ODI guidebook material alongside other resources to help others to engage with and lead impactful, inclusive standards work?

I’m interested to see how this work and discussion unfolds.

Posted at 12:09

August 28

Leigh Dodds: Four types of innovation around data

Vaughn Tan’s The Uncertainty Mindset is one of the most fascinating books I’ve read this year. It’s an exploration of how to build R&D teams drawing on lessons learned in high-end kitchens around the world. I love cooking and I’m interested in creative R&D and what makes high-performing teams work well. I’d strongly recommend it if you’re interested in any of these topics.

I’m also a sucker for a good intellectual framework that helps me think about things in different ways. I did that recently with the BASEDEF framework.

Tan introduces a nice framework in Chapter 4 of the book which looks at four broad types of innovation around food. These are presented as a way to help the reader understand how and where innovation creates impact in restaurants. The four categories are:

  1. New dishes – new arrangements of ingredients, where innovation might be incremental refinements to existing dishes, combining ingredients together in new ways, or using ingredients from different contexts (think “fusion”)
  2. New ingredients – coming up with new things to be cooked
  3. New cooking methods – new ways of cooking things, like spherification or sous vide
  4. New cooking processes – new ways of organising the processes of cooking, e.g. to help kitchen staff prepare a dish more efficiently and consistently

The categories are the top are more evident to the consumer, those lower down less so. But the impacts of new methods and processes are greater as they apply in a variety of contexts.

Somewhat inevitably, I found myself thinking about how these categories work in the context of data:

  1. New dishes analyses – New derived datasets made from existing primary sources. Or new ways of combining datasets to create insights. I’ve used the metaphor of cooking to describe data analysis before, those recipes for data-informed problem solving help to document this stage to make it reproducible
  2. New ingredients datasets and data sources – Finding and using new sources of data, like turning image, text or audio libraries into datasets, using cheaper sensors, finding a way to extract data from non-traditional sources, or using phone sensors for earthquake detection
  3. New cooking methods for cleaning, managing or analysing data – which includes things like Jupyter notebooks, machine learning or differential privacy
  4. New cooking processes for organising the collection, preparation and analysis of data – e.g. collaborative maintenance, developing open standards for data or approaches to data governance and collective consent?

The breakdown isn’t perfect, but I found the exercise useful to think through the types of innovation around data. I’ve been conscious recently that I’m often using the word “innovation” without really digging into what that means, how that innovation happens and what exactly is being done differently or produced as a result.

The categories are also useful, I think, in reflecting on the possible impacts of breakthroughs of different types. Or perhaps where investment in R&D might be prioritised and where ensuring the translation of innovative approaches into the mainstream might have most impact?

What do you think?

Posted at 15:07

Leigh Dodds: #TownscaperDailyChallenge

This post is a bit of a diary entry. It’s to help me remember a fun little activity that I was involved in recently.

I’d seen little gifs and screenshots of Townscaper on twitter for months. But then suddenly it was in early access.

I bought it and started playing around. I’ve been feeling like I was in a rut recently and wanted to do something creative. After seeing Jim Rossignol mention playing with townscaper as a nightly activity, I thought I’d do similar.

Years ago I used to do lunchtime hacks and experiments as a way to be a bit more creative than I got to be in my day job. Having exactly an hour to create and build something is a nice constraint. Forces you to plan ahead and do the simplest thing to move an idea forward.

I decided to try lunchtime Townscaper builds. Each one with a different theme. I did my first one, with the theme “Bridge”, and shared it on twitter.

Chris Love liked the idea and suggested adding a hashtag so others could do the same. I hadn’t planned to share my themes and builds every day, but I thought, why not? The idea was to try doing something different after all.

So I tweeted out the first theme using the hashtag.

That tweet is the closest thing I’ve ever had to a “viral” tweet. It’s had over 53,523 impressions and over 650 interactions.

Turns out people love Townscaper. And are making lots of cool things with it.

Tweetdeck was pretty busy for the next few days. I had a few people start following me as a result, and suddenly felt a bit pressured. To help orchestra things and manage my own piece of mind, I did a bit of forward planning.

I decided to run the activity for one week. At the end I’d either hand it over to someone or just step back.

I also spent the first evening brainstorming a list of themes. More than enough for me to keep me going for the week, so I could avoid the need to come up with new themes on the fly. I tried to find a mixture of words that were within the bounds of the types of things you could create in Townscaper, but left room for creativity. In the end I revised and prioritized the initial list over the course of the week based on how people engaged.

I wanted the activity to be inclusive so came up with a few ground rules: “No prizes, no winners. It’s just for fun.”. And some brief guidance about how to participate: post screenshots, use the right hashtags).

I also wanted to help gather together submissions, but didn’t want to retweet or share all of them. So decided to finally try out creating twitter moments. One for each daily challenge. This added some work as I was always worrying I’d missed something, but it also meant I spent time looking at every build.

I ended up with two template tweets, one to introduce the challenge and one to publish the results. These were provided as a single thread to help weave everything together.

And over the course of a week, people built some amazing things. Take a look for yourself:

  1. Townscaper Daily Challenge #1 – Bridge
  2. Townscaper Daily Challenge #2 – Garden
  3. Townscaper Daily Challenge #3 – Neighbours
  4. Townscaper Daily Challenge #4 – Canal
  5. Townscaper Daily Challenge #5 – Eyrie
  6. Townscaper Daily Challenge #6 – Fortress
  7. Townscaper Daily Challenge #7 – Labyrinth

People played with the themes in interesting ways. They praised and commented on each others work. It was one of the most interesting, creative and fun things I’ve done on twitter.

By the end of the week, only a few people were contributing, so it was right to let it run its course. (Although I see that people are still occasionally using the hashtag).

It was a reminder than twitter can be and often is a completely different type of social space. A break from the doomscrolling was good.

It was also a reminded me how much I loved creating and making things. So I’m resolved to do more of that in the future.

Posted at 14:05

August 07

Sebastian Trueg: Protecting And Sharing Linked Data With Virtuoso

Disclaimer: Many of the features presented here are rather new and can not be found in  the open-source version of Virtuoso.

Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.

Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.

We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:

Virtuoso Sparql Endpoint

Sparql Result

This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:

$ isql-v localhost:1112 dba dba
Connected to OpenLink Virtuoso
Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('', 'urn:trueg:demo');

Done. -- 2 msec.

Now our new named graph urn:trueg:demo is private and its contents cannot be seen by anyone. We can easily test this by logging out and trying to query the graph:

Sparql Query
Sparql Query Result

But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.

@prefix acl: <> .
@prefix oplacl: <> .
<#rule> a acl:Authorization ;
  rdfs:label "Share Demo Graph with trueg's LinkedIn account" ;
  acl:agent <> ;
  acl:accessTo <urn:trueg:demo> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:PrivateGraphs .

Virtuoso makes use of the ACL ontology proposed by the W3C and extends on it with several custom classes and properties in the OpenLink ACL Ontology. Most of this little Turtle snippet should be obvious: we create an Authorization resource which grants Read access to urn:trueg:demo for agent The only tricky part is the scope. Virtuoso has the concept of ACL scopes which group rules by their resource type. In this case the scope is private graphs, another typical scope would be DAV resources.

Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:

$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules

As a result we get the full rule resource including additional properties added by the API.

Finally we will login using my LinkedIn identity and are granted read access to the graph:

SPARQL Endpoint Login

We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.

Technical Footnote

Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:

prefix oplacl: <>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs .
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs .

This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:

@prefix acl: <> .
@prefix oplacl: <> .
<#rule> a acl:Authorization ;
  rdfs:label "Allow Anyone to SPARQL Read" ;
  acl:agentClass foaf:Agent ;
  acl:accessTo <urn:virtuoso:access:sparql> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:Query .

I will explain these technical concepts in more detail in another article.

Posted at 00:33

Sebastian Trueg: Sharing Files With Whomever Is Simple

Dropbox, Google Drive, OneDrive, – they all allow you to share files with others. But they all do it via the strange concept of public links. Anyone who has this link has access to the file. On first glance this might be easy enough but what if you want to revoke read access for just one of those people? What if you want to share a set of files with a whole group?

I will not answer these questions per se. I will show an alternative based on OpenLink Virtuoso.

Virtuoso has its own WebDAV file storage system built in. Thus, any instance of Virtuoso can store files and serve these files via the WebDAV API (and an LDP API for those interested) and an HTML UI. See below for a basic example:

Virtuoso DAV Browser

This is just your typical file browser listing – nothing fancy. The fancy part lives under the hood in what we call VAL – the Virtuoso Authentication and Authorization Layer.

We can edit the permissions of one file or folder and share it with anyone we like. And this is where it gets interesting: instead of sharing with an email address or a user account on the Virtuoso instance we can share with people using their identifiers from any of the supported services. This includes Facebook, Twitter, LinkedIn, WordPress, Yahoo, Mozilla Persona, and the list goes on.

For this small demo I will share a file with my LinkedIn identity (Virtuoso/VAL identifier people via URIs, thus, it has schemes for all supported services. For a complete list see the Service ID Examples in the ODS API documentation.)

Virtuoso Share File

Now when I logout and try to access the file in question I am presented with the authentication dialog from VAL:

VAL Authentication Dialog

This dialog allows me to authenticate using any of the supported authentication methods. In this case I will choose to authenticate via LinkedIn which will result in an OAuth handshake followed by the granted read access to the file:

LinkedIn OAuth Handshake


Access to file granted

It is that simple. Of course these identifiers can also be used in groups, allowing to share files and folders with a set of people instead of just one individual.

Next up: Sharing Named Graphs via VAL.

Posted at 00:33

