Planet RDF

It's triples all the way down

August 25

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Advertisements

Posted at 03:06

August 24

Benjamin Nowack: Want to hire me?

I have been happily working as a self-employed semantic web developer for the last seven years. With steady progress, I dare to say, but the market is still evolving a little bit too slowly for me (well, at least here in Germany) and I can't invest any longer. So I am looking for new challenges and an employer who would like to utilize my web technology experience (semantic or not). I have created a new personal online profile with detailed information about me, my skills, and my work.

My dream job would be in the social and/or data web area, I'm particularly interested in front-end development for data-centric or stream-oriented environments. I also love implementing technical specifications (probably some gene defect).

The potential show-stopper: I can't really relocate, for private reasons. I am happy to (tele)commute or travel, though. And I am looking for a full-time employment (or a full-time, longer-term contract). I am already applying for jobs, mainly here in D�sseldorf so far, but I thought I'd send out this post as well. You never know :)

Posted at 10:10

August 12

Gregory Williams: SPARQL* for Wikidata

I recently asked Olaf Hartig on twitter if he was aware of anyone using RDF* or SPARQL* for modeling qualified statements in Wikidata. These qualified statements are a feature of Wikidata that allow a statement such as “the speed limit in Germany is 100 km/h” to be qualified as applying only to “paved road outside of settlements.” Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph by Malyshev, et al. published last year at ISWC 2018 helps to visualize this data:

Visualization of Wikidata qualified statements

Although Olaf wasn’t aware of any work in this direction, I decided to look a bit into what the SPARQL* syntax might look like for Wikidata queries. Continuing with the speed limit example, we can query for German speed limits, and their qualifications:

SELECT ?speed ?qualifierLabel WHERE {
    wd:Q183
        wdt:P3086 ?speed ;
        p:P3086 [
            ps:P3086 ?speed ;
            pq:P3005 ?qualifier ;
        ] .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

This acts much like an RDF reification query. Using SPARQL* syntax to represent the same query, I ended up with:

SELECT ?speed ?qualifierLabel WHERE {
    << wd:Q183 wdt:P3086 ?speed >>
        pq:P3005 ?qualifier ;
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

This strikes me as a more appealing syntax for querying qualification statements, without requiring the repetition and understanding of the connection between wdt:P3086 and p:P3086. However, that repetition of “P3086” would still be required to access the quantityUnit and normalized values via the psi: and psn: predicate namespaces. I’m not familiar enough with the history of Wikidata to know why RDF reification wasn’t used in the modeling, but I think this shows that there are opportunities for improving the UX of the query interface (and possibly the RDF data model, especially if RDF* sees more widespread adoption in the future).

With minimal changes to my swift SPARQL parser, I made a proof-of-concept translator from Wikidata queries using SPARQL* syntax to standard SPARQL. It’s available in the sparql-star-wikidata branch, and as a docker image (kasei/swift-sparql-syntax:sparql-star-wikidata):

$ docker pull kasei/swift-sparql-syntax:sparql-star-wikidata
$ docker run -t kasei/swift-sparql-syntax:sparql-star-wikidata sparql-parser wikidata -S 'SELECT ?speed ?qualifierLabel WHERE { << wd:Q183 wdt:P3086 ?speed >> pq:P3005 ?qualifier ; SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }'
SELECT ?speed ?qualifierLabel WHERE {
    _:_blank.b1 <http://www.wikidata.org/prop/statement/P3086> ?speed .
    <http://www.wikidata.org/entity/Q183> <http://www.wikidata.org/prop/P3086> _:_blank.b1 .
    <http://www.wikidata.org/entity/Q183> <http://www.wikidata.org/prop/direct/P3086> ?speed .
    _:_blank.b1 <http://www.wikidata.org/prop/qualifier/P3005> ?qualifier .
    SERVICE <http://wikiba.se/ontology#label>
    {
        <http://www.bigdata.com/rdf#serviceParam> <http://wikiba.se/ontology#language> "en" .
    }
}

Posted at 18:06

August 10

Benjamin Nowack: I'm joining Talis!

KASABI data marketplace I received a number of very interesting job offers when I began searching for something new last month, but there was one company that stood out, and that is Talis. Not only do I know many people there already, I also find Talis' new strategic focus and products very promising. In addition, they know and use some of my tools already, and I've successfully worked on Talis projects with Leigh and Keith before. The job interview almost felt like coming home (and the new office is just great).

So I'm very happy to say that I'm going to become part of the Kasabi data marketplace team in September where I'll help create and drupalise data management and data market tools.

BeeNode I will have to get up to speed with a lot of new things, and the legal and travel costs overhead for Talis is significant, so I hope I can turn this into a smart investment for them as quickly as possible. I'll even rename my blog if necessary... ;-) For those wondering about the future of my other projects, I'll write about them in a separate post soon.

Can't wait to start!

Posted at 09:06

Benjamin Nowack: Semantic WYSIWYG in-place editing with Swipe

Several months ago (ugh, time flies) I posted a screencast demo'ing a semantic HTML editor. Back then I used a combination of client-side and server-side components, which I have to admit led to quite a number of unnecessary server round-trips.

In the meantime, others have shown that powerful client-side editors can be implemented on top of HTML5, and so I've now rewritten the whole thing and turned it into a pure JavaScript tool as well. It now supports inline WYSIWYG editing and HTML5 Microdata annotations.

The code is still at beta stage, but today I put up an early demo website which I'll use as a sandbox. The editor is called Swipe (like the dance move, but it's an acronym, too). What makes Swipe special is its ability to detect the caret coordinates even when the cursor is inside a text node, which is usually not possible with W3C range objects. This little difference enables several new possibilities, like precise in-place annotations or "linked-data-as-you-type" functionality for user-friendly entity suggestions. More to come soon...

Swipe - Semantic WYSIWYG in-place editor

Posted at 09:06

Benjamin Nowack: Dynamic Semantic Publishing for any Blog (Part 2: Linked ReadWriteWeb)

The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.



In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Trice Bot Console

Here is a quick re-cap of the proposed dynamic semantic publishing process , followed by a detailed description of the individual components:
  • Index and monitor the archives pages, build a registry of post URLs.
  • Load and parse posts into raw structures (title, author, content, ...).
  • Extract named entities from each post's main content section.
  • Build a site-optimized schema (an "ontology") from the data structures generated so far.
  • Align the extracted data structures with the target ontology.
  • Re-purpose the final dataset (widgets, entity hubs, semantic ads, authoring tools)

Archives indexer and monitor

The archives indexer fetches the by-month archives , extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store .

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

Archives triples via SPARQL

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Raw post structures

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

RWW through Paggr Prospect

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

RWW entity types

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

RWW ontology

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:
  • Consolidate author aliases ("richard-macmanus-1 = richard-macmanus-2" etc.).
  • Normalize author tags, Zemanta tags, OpenCalais tags, and OpenCalais "industry terms" to a single "tag" field.
  • Consolidate the various type identifiers into canonical ones.
  • For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.
  • Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).
Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Dynamic RWW Entity Hub

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch .

Posted at 09:06

August 01

AKSW Group - University of Leipzig: DBpedia Day @ SEMANTiCS 2019

 We are happy to announce that SEMANTiCS 2019 will host the 14th DBpedia Community Meeting at the last day of the conference on September 12, 2019.

 

 

Highlights/Sessions

  • Keynote #1: Katja Hose, Aalborg University, Denmark
  • Keynote #2: Dan Weitzner from WPSemantix
  • DBpedia Databus presentation and training session
  • DBpedia Association hour
  • DBpedia Showcase session
  • DBpedia Chapter session

Call for Contribution

Tell us what cool things you do with DBpedia:  Present your tools and datasets at the DBpedia Community Meeting! Please submit your presentations, posters, demos or other forms of contributions through our web form.

Quick Facts

  • Web URL: https://wiki.dbpedia.org/events/14th-dbpedia-community-meeting-karlsruhe
  • When: September 12th, 2019
  • Where: Leibniz-Institute für Informationsstruktur – FIZ Karlsruhe, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
  • Call for Contribution: Submit your proposal in our form
  • Registration: Attending the DBpedia Community meeting costs 90 €. You can buy your ticket on the SEMANTiCS website. DBpedia members get free admission. Please contact your nearest DBpedia chapter for a promotion code, or please contact the DBpedia Association.

Sponsors and Acknowledgments

In case you want to sponsor the 14th DBpedia Community Meeting, please contact the DBpedia Association via dbpedia@infai.org.

Organisation

  • Tina Schmeissner, DBpedia Association
  • Sandra Prätor, AKSW/KILT, DBpedia Association
  • Sebastian Hellmann, AKSW/KILT, DBpedia Association

We are looking forward to meeting you in Karlsruhe!

Your DBpedia Association

Posted at 09:05

July 24

Leigh Dodds: How do data publishing choices shape data ecosystems?

This is the latest in a series of posts in which I explore some basic questions about data.

In our work at the ODI we have often been asked for advice about how best to publish data. When giving trying to give helpful advice, one thing I’m always mindful of is how the decisions about how data is published shapes the ways in which value can be created from it. More specifically, whether those choices will enable the creation of a rich data ecosystem of intermediaries and users.

So what are the types of decisions that might help to shape data ecosystems?

To give a simple example, if I publish a dataset so its available as a bulk download, then you could use that data in any kind of application. You could also use it to create a service that helps other people create value from the same data, e.g. by providing an API or an interface to generate reports from the data. Publishing in bulk allows intermediaries to help create a richer data ecosystem. But, if I’d just published that same data via an API then there are limited ways in which intermediaries can add value. Instead people must come directly to my API or services to use the data.

This is one of the reasons why people prefer open data to be available in bulk. It allows for more choice and flexibility in how it is used. But, as I noted in a recent post, depending on the “dataset archetype” your publishing options might be limited.

The decision to only publish a dataset as an API, even if it could be published in other ways is often a deliberate decision. The publisher may want to capture more of the value around the dataset, e.g. by charging for the use of an API. Or they may it is important to have more direct control over who uses it, and how. These are reasonable choices and, when the data is sensitive, sensible options.

But there are a variety of ways in which the choices that are made about how to publish data, can can shape or constrain the ecosystem around a specific dataset. It’s not just about bulk downloads versus APIs.

The choices include:

  • the licence that is applied to the data, which might limit it to non commercial use. Or restrict redistribution. Or imposing limits on the use of derived data
  • the terms and conditions for the API or other service that provides access to the data. These terms are often conflated with data licences, but typically focus on aspects of service provisions, for example rate limiting, restriction on storage of API results, permitted uses of the API, permitted types of users, etc
  • the technology used to provide access to data. In addition to bulk downloads vs API, there are also details such as the use of specific standards, the types of API call that are possible, etc
  • the governance around the API or service that provides access to data, which might create limit which users can get access the service or create friction that discourages use
  • the business model that is wrapped around the API or service, which might include a freemium model, chargeable usage tiers, service leverl agreements, usage limits, etc

I think these cover the main areas. Let me know if you think I’ve missed something.

You’ll notice that APIs and services provide more choices for how a publisher might control usage. This can be a good or a bad thing.

The range of choices also means it’s very easy to create a situation where an API or service doesn’t work well for some use cases. This is why user research and engagement is such an important part of releasing a data product and designing policy interventions that aim to increase access to data.

For example, let’s imagine someone has published an openly licensed dataset via an API that restricts users to a maximum number of API calls per month.

These choices limits some uses of the API, e.g. applications that need to make lots of queries. This also means that downstream users creating web applications are unable to provide a good quality of service to their own users. A popular application might just stop working at some point over the course of the month because it has hit the usage threshold.

The dataset might be technically openly, but practically its used has been constrained by other choices.

Those choices might have been made for good reasons. For example as a way for the data publisher to be able to predict how much they need to invest each month in providing a free service, that is accessible to lots of users making a smaller number of requests. There is inevitably a trade-off between the needs of individual users and the publisher.

Adding on a commercial usage tier for high volume users might provide a way for the publisher to recoup costs. It also allows some users to choose what to pay for their use of the API, e.g. to more smoothly handle unexpected peaks in their website traffic. But it may sometimes be simpler to provide the data in bulk to support those use cases. Different use cases might be better served by different publishing options.

Another example might be a system that provides access to both shared and open data via a set of APIs that conform to open standards. If the publisher makes it too difficult for users to actually sign up to use those APIs, e.g because of difficult registration or certification requirements, then only those organisations that can afford to invest the time and money to gain access might both using them. The end result might be a closed ecosystem that is built on open foundations.

I think its important for understand how this range of choices can impact data ecosystems. They’re important not just for how we design products and services, but also in helping to design successful policies and regulatory interventions. If we don’t consider the full range of changes, then we may not achieve the intended outcomes.

More generally, I think it’s important to think about the ecosystems of data use. Often I don’t think enough attention is paid to the variety of ways in which value is created. This can lead to poor choices, like a choosing to try and sell data for short term gain rather than considering the variety of ways in which value might be created in a more open ecosystem.

Posted at 21:05

July 20

Libby Miller: Real_libby – a GPT-2 based slackbot

In the latest of my continuing attempts to automate myself, I retrained a GPT-2 model with my iMessages, and made a slackbot so people could talk to it. Since Barney (an expert on these matters) felt it was unethical that it vanished whenever I shut my laptop, it’s now living happily(?) if a little more slowly in a Raspberry Pi 4.

Screen Shot 2019-07-20 at 12.19.24It was surprisingly easy to do, with a few hints from Barney. I’ve sketched out what I did below. If you make one, remember that it can leak out private information – names in particular – and can also be pretty sweary, though mine’s not said anything outright offensive (yet).

fuck, mitzhelaists!

This work is inspired by the many brilliant Twitter bot-makers  and machine-learning people out there such as Barney, (who has many bots, including inspire_ration and notYourBot, and knows much more about machine learning and bots than I do), Shardcore (who made Algohiggs, which is probably where I got the idea for using GPT-2),  and Janelle Shane, (whose ML-generated names for e.g. cats are always an inspiration).

First, get your data

The first step was to get at my iMessages. A lot of iPhone data is backed up as sqlite, so if you decrypt your backups and have a dig round, you can use something like baskup. I had to make a few changes but found my data in

/Users/[me]/Library/Application\ Support/MobileSync/Backup/[long number]/3d/3d0d7e5fb2ce288813306e4d4636395e047a3d28

This number – 3d0d7e5fb2ce288813306e4d4636395e047a3d28 – seems always to indicate the iMessage database – though it moves round depending on what version of iOS you have. I made a script to write the output from baskup into a flat text file for GPT-2 to slurp up. I had about 5K lines.

Retrain GPT-2

I used this code.

python3 ./download_model.py 117M

PYTHONPATH=src ./train.py --dataset /Users/[me]/gpt-2/scripts/data/

I left it overnight on my laptop and by morning loss and avg were oscillating so I figured it was done – 3600 epochs. The output from training was fun, e.g..

([2899 | 33552.87] loss=0.10 avg=0.07)

my pigeons get dandruff
treehouse actually get little pellets
little pellets of the same stuff as well, which I can stuff pigeons with
*little
little pellets?
little pellets?
little pellets?
little pellets?
little pellets?
little pellets?
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets

Test it

I copied the checkpoint directory into the models directory

cp -r checkpoint/run1 models/libby

At which point I could test it using the code provided:

python3 src/interactive_conditional_samples.py --model_name libby

This worked but spewed out a lot of text, very slowly. Adding –length 20 sped it up:

python3 src/interactive_conditional_samples.py --model_name libby --length 20

Screen Shot 2019-07-20 at 13.05.06

That was the bulk of it done! I turned interactive_conditional_samples.py into a server and then whipped up a slackbot – it responds to direct questions and occasionally to a random message.

Putting it on a Raspberry Pi 4 was very very easy. Startlingly so.

Screen Shot 2019-07-20 at 13.11.10

It’s been an interesting exercise, and mostly very funny. These bots have the capacity to surprise you and come up with the occasional apt response (I’m cherrypicking)

Screen Shot 2019-07-20 at 14.25.00

We’ve been talking a lot at work about personal data and what we would do with our own, particularly messages with friends and the pleasure of scrolling back and finding old jokes and funny messages. My messages were mostly of the “could you get some milk?” “here’s a funny picture of the cat” type, but it covered a long period and there were also two very sad events in there. Parsing the data and coming across those again was a vivid reminder that this kind of personal data can be an emotional minefield and not something to be trivially messed with by idiots like me.

Also: while GPT-2 means there’s plausible deniability about any utterance, a bot like this can leak personal information of various kinds, such as names and regurgitated fragments of real messages. Unsurprisingly it’s not the kind of thing I’d be happy making public as is, and I’m not sure if it ever could be.

 

 

Posted at 19:10

Libby Miller: An i2c heat sensor with a Raspberry Pi camera

I had a bit of a struggle with this so thought it was worth documenting. The problem is this – the i2c bus on the Raspberry Pi is used by the official camera to initialise it. So if you want to use an i2c device at the same time as the camera, the device will stop working after a few minutes. Here’s more on this problem.

I really wanted to use this heatsensor with mynaturewatch to see if we could exclude some of the problem with false positives (trees waving in the breeze and similar). I’ve not got it working well enough yet to look at this problem in detail. But, I did get it working with the 12c bus with the camera – here’s how.

Screen Shot 2019-03-22 at 12.31.04

It’s pretty straightforward. You need to

  • Create a new i2c bus on some different GPIOs
  • Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one
  • Fin

1. Create a new i2c bus on some different GPIOs

This is super-easy:

sudo nano /boot/config.txt

Add the following line of code, preferable in the section where spi and i2c is enabled.

dtoverlay=i2c-gpio,bus=3,i2c_gpio_delay_us=1

This line will create an aditional i2c bus (bus 3) on GPIO 23 as SDA and GPIO 24 as SCL (GPIO 23 and 24 is defaults)

2. Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one

I am using this sensor, for which I need this circuitpython library (more info), installed using:

pip3 install Adafruit_CircuitPython_AMG88xx

While the pi is switched off, plug in the i2c device using pins 23 for SDA and GPIO 24 for SDL, and then boot it up and check it’s working:

 i2cdetect -y 3

Make two changes:

nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/bcm283x/pin.py

and change the SDA and SCL pins to the new pins

#SDA = Pin(2)
#SCL = Pin(3)
SDA = Pin(23)
SCL = Pin(24)
nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/generic_linux/i2c.py

Change line 21 or thereabouts to use the i2c bus 3 rather than the default, 1:

self._i2c_bus = smbus.SMBus(3)

3. Fin

Start up your camera code and your i2c peripheral. They should run happily together.

Screen Shot 2019-03-25 at 19.12.21

Posted at 14:10

Libby Miller: Balena’s wifi-connect – easy wifi for Raspberry Pis

When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.

Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:

Provision an SD card with Stretch e.g. using Etcher or manually

Enable ssh e.g. by

touch /Volumes/boot/ssh

Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:

sudo raspi-config

then Localisation Options -> Set wifi country.

Install wifi-connect

bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)

Add a slightly-edited version of their bash script

curl https://gist.githubusercontent.com/libbymiller/e8fe6821e122e0a0ac921c8e557320a9/raw/46138fb4d28b494728e66515e46bd7d736b19132/start.sh > /home/pi/start-wifi-connect.sh

Add a systemd script to start it on boot.

sudo nano /lib/systemd/system/wifi-connect-start.service

-> contents:

[Unit]
Description=Balena wifi connect service
After=NetworkManager.service

[Service]
Type=simple
ExecStart=/home/pi/start-wifi-connect.sh
Restart=on-failure
StandardOutput=syslog
SyslogIdentifier=wifi-connect
Type=idle
User=root

[Install]
WantedBy=multi-user.target

Enable the systemd service

sudo systemctl enable wifi-connect-start.service

Reboot the pi

sudo reboot

A wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi:

ping raspberrypi.local

(You might need to disconnect your ethernet from the Pi before connecting to the Wifi Connect network if you were sharing network that way).

Posted at 14:10

Libby Miller: Cat detector with Tensorflow on a Raspberry Pi 3B+

Like this
Download Raspian Stretch with Desktop    

Burn a card with Etcher.

(Assuming a Mac) Enable ssh

touch /Volumes/boot/ssh

Put a wifi password in

nano /Volumes/boot/wpa_supplicant.conf
country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="foo"
  psk="bar"
}

Connect the Pi camera, attach a dial to GPIO pin 12 and ground, boot up the Pi, ssh in, then

sudo apt-get update
sudo apt-get upgrade
sudo raspi-config # and enable camera; reboot

install tensorflow

sudo apt install python3-dev python3-pip
sudo apt install libatlas-base-dev
pip3 install --user --upgrade tensorflow

Test it

python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

get imagenet

git clone https://github.com/tensorflow/models.git
cd ~/models/tutorials/image/imagenet
python3 classify_image.py

install openCV

pip3 install opencv-python
sudo apt-get install libjasper-dev
sudo apt-get install libqtgui4
sudo apt install libqt4-test
python3 -c 'import cv2; print(cv2.__version__)'

install the pieces for talking to the camera

cd ~/models/tutorials/image/imagenet
pip3 install imutils picamera
mkdir results

download edited version classify_image

curl -O https://gist.githubusercontent.com/libbymiller/d542d596566774a35752d134f80b1332/raw/471f066e4dc498501bab7731a07fa0c1926c1575/classify_image_dial.py

Run it, and point at a cat

python3 classify_image_dial.py

Posted at 14:10

Libby Miller: Etching on a laser cutter

I’ve been struggling with this for ages, but yesterday at Hackspace – thanks to Barney (and I now realise, Tiff said this too and I got distracted and never followed it up) – I got it to work.

The issue was this: I’d been assuming that everything you lasercut had to be a vector DXF, so I was tracing bitmaps using inkscape in order to make a suitable  SVG, converting to DXF, loading it into the lasercut software at hackspace, downloading it and – boom – “the polyline must be closed” for etching: no workie. No matter what I did it to the export in Inkscape or how I edited it, it just didn’t work.

The solution is simply to use a black and white png, with a non-transparent background. This loads directly into lasercut (which comes with JustAddSharks lasers) and…just…works.

As a bonus and for my own reference – I got good results with 300 speed / 30 power (below 30 didn’t seem to work) for etching (3mm acrylic).

 

Posted at 14:10

Libby Miller: Simulating crap networks on a Raspberry Pi

I’ve been having trouble with libbybot (my Raspberry Pi / lamp based presence robot) in some locations. I suspect this is because the Raspberry Pi 3’s inbuilt wifi antenna isn’t as strong as that in, say a laptop, so wifi problems that go unnoticed most of the time are much more obvious.

The symptoms are these:

  • Happily listening / watching remotely
  • Stream dies
  • I get a re-notification that libbybot is online, but can’t connect to it properly

My hypothesis is that the Raspberry Pi is briefly losing wifi connectivity, Chromium auto-reconnects, but the webRTC stream doesn’t re-initiate.

Anyway, the first step to mitigating the problem was to try and emulate it. There were a couple of ways I could have gone about this. One was use network shaping tools on my laptop to try and emulate the problems by messing with the receiving end. A more realistic way would be to shape the traffic on the Pi itself, as that’s where the problem is occurring.

Searching for network shaping tools – and specifically dropped packets and network latency, led me to the FreeBSD firewall, called dummynet and referenced by ipfw. However, this is tightly coupled to the kernel and doesn’t seem suitable for the Raspberry Pi.

On the laptop, there is a tool for network traffic shaping on Mac OS – it used to be ipfw, but since 10.10 (details) it’s been an app called network link conditioner, available as part of Mac OS X developer tools.

Before going through the xcode palaver for something that wasn’t really what I wanted, I had one last dig for an easier way, and indeed there is: wondershaper led me to using tc to limit the bandwidth which in turn led to iptables for dropped packets.

But. None of these led to the behaviour that I wanted, in fact libbybot (which uses RTCMulticonnection for webRTC) worked perfectly under most conditions I could simulate. The same when using tc with with Netem, which can emulate network-wide delays – all fine.

Finally I twigged that the problem was probably a several-second network outage, and for that you can use iptables again. In this case using it to stop the web page (which runs on port 8443) being accessed from the Pi. Using this I managed to emulate the symptoms I’d been seeing.

Here are a few of the commands I used, for future reference.

The final, useful command: emulate a dropped network on a specific port for 20 seconds using iptables output command:

#/bin/bash
echo "stopping external to 8443"
iptables -A OUTPUT -p tcp --dport 8443 -j DROP
sleep 20
echo "restarting external to 8443"
iptables -D OUTPUT -p tcp --dport 8443 -j DROP

Other things I tried: drop 30% of (input or output) packets randomly, using iptable’s statistics plugin

sudo iptables -A INPUT -m statistic --mode random --probability 0.30 -j DROP

sudo iptables -A OUTPUT -m statistic --mode random --probability 0.30 -j DROP

list current iptables rules

iptables -L

clear all (flush)

iptables -F

Delay all packets by 100ms using tc and netem

sudo tc qdisc add dev wlan0 root netem delay 100ms

change that to 2000ms

sudo tc qdisc change dev wlan0 root netem delay 2000ms 10ms 25%

All the tc rules go away when you reboot.

Context and links:

tc and netem: openWRT: Netem (Network emulator)

iptablesUsing iptables to simulate service interruptions by Matt Parsons, and The Beginner’s guide to iptables, the linux firewall

 

Posted at 14:10

Libby Miller: Neue podcast in a box, part 1

Ages ago I wrote a post on how to create a physical podcast player (“podcast in a box”) using Radiodan. Since then, we’ve completely rewritten the software, so those instructions can be much improved and simplified. Here’s a revised technique, which will get you as far as reading an RFID card. I might write a part 2, depending on how much time I have.

You’ll need:

  • A Pi 3B or 3B+
  • An 8GB or larger class 10 microSD card
  • A cheapo USB soundcard (e.g.)
  • A speaker with a 3.5mm jack
  • A power supply for the Pi
  • An MFC522 RFID reader
  • A laptop and microSD card reader / writer

The idea of Radiodan is that as much as possible happens inside web pages. A server runs on the Pi. One webpage is opened headlessly on the Pi itself (internal.html) – this page will play the audio; another can be opened on another machine to act as a remote control (external.html).

They are connected using websockets, so each can access the same messages – the RFID service talks to the underlying peripheral on the Pi, making the data from the reader available.

Here’s what you need to do:

1. Set up the the Pi as per these instructions (“setting up your Pi”)

You need to burn a microSD card with the latest Raspian with Desktop to act as the Pi’s brain, and the easiest way to do this is with Etcher. Once that’s done, the easiest way to do the rest of the install is over ssh, and the quickest way to get that in place is to edit two files while the card is still in your laptop (I’m assuming a Mac):

Enable ssh by typing:

touch /Volumes/boot/ssh

Add your wifi network to boot by adding a file called

/Volumes/boot/wpa_supplicant.conf

contents: (replace AP_NAME and AP_PASSWORD with your wifi details)

country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="AP_NAME"
  psk="AP_PASSWORD"
  key_mgmt=WPA-PSK
}

Then eject the card, put the card in the Pi, attach all the peripherals except for the RFID reader and switch it on. While on the same wifi network, you should be able to ssh to it like this:

ssh pi@raspberrypi.local

password: raspberry.

Then install the Radiodan software using the provisioning script like this:

curl https://raw.githubusercontent.com/andrewn/neue-radio/master/deployment/provision | sudo bash

2. Enable SPI on the Pi

Don’t reboot yet; type:

sudo raspi-config

Under interfaces, enable SPI, then shut the Pi down

sudo halt

and unplug it.

3. Test Radiodan and configure it

If all is well and you have connected a speaker via a USB soundcard, you should hear it say “hello” as it boots.

Please note: Radiodan does not work with the default 3.5mm jack on the Pi. We’re not sure yet why. But USB soundcards are very cheap, and work well.

There’s one app available by default for Radiodan on the Pi. To use it,

  1. Navigate to http://raspberrypi.local/radio
  2. Use the buttons to play different audio clips. If you can hear things, then it’s all working

 

radiodan_screenshot1

shut the Pi down and unplug it from the mains.

4. Connect up the RFID reader to the Pi

like this

Then start the Pi up again by plugging it in.

5. Add the piab app

Dan has made a very fancy mechanism for using Samba to drag and drop apps to the Pi, so that you can develop on your laptop. However, because we’re using RFID (which only works on the Pi), we may as well do everything on there. So, ssh to it again:

ssh pi@raspberrypi.local
cd /opt/radiodan/rde/apps/
git clone http://github.com/libbymiller/piab

This is currently a very minimal app, which just allows you to see all websocket messages going by, and doesn’t do anything else yet.

6. Enable the RFID service and piab app in the Radiodan web interface

Go to http://raspberrypi.local:5020

Enable “piab”, clicking ‘update’ beneath it. Enable the RFID service, clicking ‘update’ beneath it. Restart the manager (red button) and then install dependencies (green button), all within the web page.

radiodan_screenshot2

radiodan_screenshot4

Reboot the Pi (e.g. ssh in and sudo reboot). This will enable the RFID service.

7. Test the RFID reader

Open http://raspberrypi.local:5000/piab and open developer tools for that page. Place a card on the RFID reader. You should see a json message in the console with the RFID identifier.

radiodan_screenshot5

The rest is a matter of writing javascript / html code to:

  • Associate a podcast feed with an RFID (e.g. a web form in external.html that allows the user to add a podcast feed url)
  • Parse the podcast feed when the appropriate card id is detected by the reader
  • Find the latest episode and play it using internal.html (see the radio app example for how to play audio)
  • Add more fancy options, such as remembering where you were in an episode, stopping when the card is removed etc.

As you develop, you can see the internal page on http://raspberrypi.local:5001 and the external page on http://raspberrypi.local:5000. You can reload the app using the blue button on http://raspberrypi.local:5020.

Many more details about the architecture of Radiodan are available; full installation instructions and instructions for running it on your laptop are here; docs are here; code is in github.

Posted at 14:10

Libby Miller: #Makevember

@chickengrylls#makevember manifesto / hashtag has been an excellent experience. I’ve made maybe five nice things and a lot of nonsense, and a lot of useless junk, but that’s fine – I’ve learned a lot, mostly about servos and other motors. There’s been tons of inspiration too (check out these beautiful automata, some characterful paper sculptures, Richard’s unsuitable materials, my initial inspiration’s set of themes on a tape, and loads more). A lovely aspect was all the nice people and beautiful and silly things emerging out of the swamp of Twitter.

Screen Shot 2017-12-01 at 16.31.14

Of my own makes, my favourites were this walking creature, with feet made of crocodile clips (I was amazed it worked); a saw-toothed vertical traveller, such a simple little thing; this fast robot (I was delighted when it actually worked); some silly stilts; and (from October) this blimp / submarine pair.

I did lots of fails too – e.g. a stencil, a raspberry blower. Also lots of partial fails that got scaled back – AutoBez 1, 2, and 3; Earth-moon; a poor-quality under-water camera. And some days I just ran out of inspiration and made something crap.

Why’s it so fun? Well there’s the part about being more observant, looking at materials around you constantly to think about what to make, though that’s faded a little. As I’ve got better I’ve had more successes and when you actually make something that works, that’s amazing. I’ve loved seeing what everyone else is making, however good or less-good, whether they spent ages or five minutes on it. It feels very purposeful too, having something you have to do every day.

Downsides: I’ve spent far too long on some of these. I was very pleased with both Croc Nest, and Morse, but both of them took ages. The house is covered in bits of electronics and things I “might need” despite spending some effort tidying, but clearly not enough (and I need to have things to hand and to eye for inspiration). Oh, and I’m addicted to Twitter again. That’s it really. Small price to pay.

Posted at 14:10

Libby Miller: Capturing button presses from bluetooth hands free kits on a Raspberry Pi

Is there anything better than this wonky and unreliable hack for capturing keypresses from a handsfree kit?

sudo hciconfig hci0 down
sudo hciconfig hci0 up
sudo hcidump -l 1 | grep ACL

As the kit connects, I see in syslog

Sep 27 21:17:10 gvoice bluetoothd[532]: Unable to get connect data for Hands-Free Voice gateway: getpeername: Transport endpoint is not connected (107)

Sep 27 21:17:10 gvoice bluetoothd[532]: Unable to get connect data for Headset Voice gateway: getpeername: Transport endpoint is not connected (107)

I can see it appearing as

Sep 27 21:14:29 gvoice kernel: [  827.342038] input: B8:D5:0B:4C:CF:59 as /devices/virtual/input/input6

evtest gives

sudo evtest
No device specified, trying to scan all of /dev/input/event*
Available devices:
/dev/input/event0: B8:D5:0B:4C:CF:59
Select the device event number [0-0]: 0
[...]
    Event code 402 (KEY_CHANNELUP)
    Event code 403 (KEY_CHANNELDOWN)
    Event code 405 (KEY_LAST)
  Event type 2 (EV_REL)
Key repeat handling:
  Repeat type 20 (EV_REP)
    Repeat code 0 (REP_DELAY)
      Value    300
    Repeat code 1 (REP_PERIOD)
      Value     33
Properties:
Testing ... (interrupt to exit)

but there are never any events.

(I’m asking as I have it nicely hooked up to google voice recogniser via aiy. But it needs (and I want) a button press to trigger it. With a bit of twiddling hands free kits and bluetooth headsets work nicely with Raspian Stretch.)

Posted at 14:10

Libby Miller: Leaving Flickr

I’m very sad about this, especially because of all the friends I have made on Flickr, but with Verizon’s acquisition of Yahoo (and so Flickr) and the consequent sharing of Flickr user data with the new “Oath” “”””family”””” I feel like it’s time for me to admit just how shit Flickr has become and finally leave. I’ve been using it (and paying for it) for 10 years though, so I’ve a lot of pictures, about 13K in number, about 23G. I’ve got all my data and will delete my account tomorrow (which I think it’s their deadline, but they seem confused about it).

It’s been a busy week so I don’t know what I’ll replace it with yet, maybe something simple and static I’ll write myself like the thing I had 11 years ago, with some, I dunno, RSS feeds or something. But anyway, here’s the best way I’ve found to get my data back, and kudos to Flickr that the API is still there to make it possible. I tried a few things and flickrmirrorer looks best. It’s straightforward for pictures; some older videos need downloading by hand with it. As far as I can tell it gets all the metadata for each photo. No comments though, and no notes that I can see.

Because of the video issue I did images first, leaving it overnight (forgot to time it)

mkdir pictures
cd pictures/
mkdir flickr
cd flickr/
git clone https://github.com/markdoliner/flickrmirrorer
cd flickrmirrorer/
 sudo easy_install flickrapi
mkdir data
./flickrmirrorer.py --ignore-videos --statistics /Users/libby/personal/pictures/flickr/flickrmirrorer/data/

Output:

New photos / videos: 12921
Deleted photos / videos: 0
Modified photos /videos: 0
Modified albums: 198
Modified collections: 0

Check it matches the volume of data from your stats page (roughly might be all you can hope for; there’s a problem with flickr’s reporting)

du -sh .

check a couple to make sure we’re actually getting some data

open ./data/photostream/72332206.jpg
cat ./data/photostream/72332206.jpg.metadata

Then video:

./flickrmirrorer.py --ignore-photos --statistics /Users/libby/personal/pictures/flickr/flickrmirrorer/data/

downloading about 50 by hand.

I was worried I didn’t have the metadata for some of them, so I hacked together a script that just got all the video metadata – which is here.

I also wanted a list of my favourites – I rolled my own script for that, here. I hardcoded the number of pages, sorry!

There doesn’t seem to be any way to get notes, which sucks.

To use these two scripts you need to get an api key from flickr here.

I’m really really annoyed about all the cool urls I’ll kill because of this. Oh well.

Update: Matthew in the comments thought that notes were available from this api, and he’s right. flickrmirrorer doesn’t get them (and actually doesn’t get as much metadata as I want) so I grabbed all the ids of my photos using the dump from flickrmirrorer as a starting point:

find . | grep "\.metadata" > list_of_photos.txt

and then use this script to get as much metadata as I can.

I also realised I didn’t have a list of my friends 😦 So, I wrote one more script to do that.

Posted at 14:10

July 11

Leigh Dodds: That thing we call “open”

I’ve been involved in a few conversations recently about what “open” or “being open” means in different situations.

As I’ve noted previously when people say “open” they often mean very different things. And while there may be a clear definitions of “open”, people don’t
often use the terms correctly. And some phrases like “open API” are still, well, open to interpretation.

In this post I’m going to summarise some of the ways in which I tend to think about making something “open”.

Let me know if I’m missing something so I can plug gaps in my understanding.

Openness of a “thing”

Digital objects: books, documents, images, music, software and datasets can all be open.

Making things open in this sense is the most well documented, but still the most consistently misunderstood. There are clear definitions for open content
and data, open source, etc. Open in these contexts provide various freedoms to use, remix, share, etc.

People often confuse something being visible or available to them as being open, but that’s not the same thing at all. Being able to see or read something doesn’t give you any legal permissions at all.

It’s worth noting that the definitions of open “things” in different communities are often overlapping. For example, the Creative Commons licences allow works to be licensed in ways that enable a wide variety of legal reuses. But the Open Definition only recognises a subset of those as being open, rather than shared.

Putting an open licence on something also doesn’t necessarily grant you the full freedom to reuse that thing. For example I could open source some machine learning software but it might only be practically reusable if you can train it on some data that I’ve chosen not to share.

Or I might use an open licence like the Open Government Licence that allows me to put an open licence on something whilst ignoring the existence of any third-party rights. No need to do my homework. Reuser beware.

Openness of a process

Processes can be open. It might be better to think about transparency (e.g. of how the process is running) or the ability to participate in a process in this context.

Anything that changes and evolves over time will have a process by which those changes are identified, agreed, prioritised and applied. We sometimes call that governance. The definition of an open standard includes defining both the openness of the standard (the thing) as well as the process.

Stewardship, of a software project, or a dataset, or a standard are also examples of where it might be useful for a process to be open. Questions we can ask of open processes are things like:

  • Can I contribute to the main codebase of a software package, rather than just fork it?
  • Can I get involved in the decision making around how a piece of software or standard evolves?
  • Can I directly fix errors in a dataset?
  • Can I see what decisions have been, or are being made that relate to how something is evolving?

When we’re talking about open data or open source, often we’re really talking about openness of the “thing”. But when we’re making things open to make them
better, I think we’re often talking about being open to contributions and participation. Which needs something more than a licence on a thing.

There’s probably a broader category of openness here which relates to how open a process is socially. Words like inclusivity and diversity spring to mind.

Your standards process isn’t really open to all if all of your meetings are held face to face in Hawaii.

Openness of a product, system or platform

Products, platforms and systems can be open too. Here we can think of openness as relating to the degree to which the system

  • is built around open standards and open data (made from open things)
  •  is operated using open processes
  • is available for wider access and use

We can explore this by asking questions like:

  • Is it designed to run on open infrastructure or is it tied to particular cloud infrastructure or hardware?
  • Are the interfaces to the system built around open standards?
  • Can I get access to an API? Or is it invite only?
  • How do the terms of service shape the acceptable uses of the system?
  • Can I use its outputs, e.g. the data returned by a platform or an API, under an open licence?
  • Can we observe how well the system or platform is performing, or measure its impacts in different ways (e.g. socially, economically, environmentally)

Openness of an ecosystem

Ecosystems can be open too. In one sense an open ecosystem is “all of the above”. But there are properties of an ecosystem that might itself indicate aspects of openness:

  • Is there a choice in providers, or is there a monopoly provider of services or data?
  • How easy is it for new organisations to engage with the ecosystem, e.g to provide
    competing or new services?
  • Can we measure the impacts and operations of the ecosystem?

When we’re talking about openness of an ecosystem we’re usually talking about markets and sectors and regulation and governance.

Applying this in practice

So when  thinking about whether something is “open” the first thing I tend to do is consider which of the above categories apply. In some cases its actually several.

This is evident in my attempt to define “open API“.

For example we’re doing some work @ODIHQ to explore the concept of a digital twin. According to the Gemini Principles a digital twin should be open. Here we can think of an individual digital twin as an object (a piece of software or a model), or a process (e.g. as an open source project), or an operational system or platform depending on how its made available.

We’re also looking at cities. Cities can be open in the sense of the openness of their processes of governance and decision making. They might also be considered as platforms for sharing data and connecting software. Or as ecosystems of the same.

Posted at 09:05

July 08

Leigh Dodds: Lets talk about plugs

This is a summary of a short talk I gave internally at the ODI to help illustrate some of the important aspects of data standards for non-technical folk. I thought I’d write it up here too, in case its useful for anyone else. Let me know what you think.

We benefit from standards in every aspect of our daily lives. But because we take them for granted, we don’t tend to think about them very much. At the ODI we’re frequently talking about standards for data which, if you don’t have a technical background, might be even harder to wrap your heard around.

A good example can help to illustrate the value of standards. People frequently refer to telephone lines, railway tracks, etc. But there’s an example that we all have plenty of personal experience with.

Lets talk about plugs!

You can confidently plug any of your devices into a wall socket and it will just work. No thought required.

Have you ever thought about what it would be like if plugs and wall sockets were all different sizes and shapes?

You couldn’t rely on being able to consistently plug your device into any random socket, so you’d have to carry around loads of different cables. Manufacturers might not design their plugs and sockets very well so there might be greater risks of electrocution or fires. Or maybe the company that built your new house decided to only fit a specific type of wall socket because its agree a deal with an electrical manufacturer, so when you move in you needed to buy a completely new set of devices.

We don’t live in that world thankfully. As a nation we’ve agreed that all of our plugs should be designed the same way.

That’s all a standard is. A documented, reusable agreement that everyone uses.

Notice that a single standard, “how to design a really great plug“, has multiple benefits. Safety is increased. We save time and money. Manufacturers can be confident that their equipment will work in any home or office.

That’s true of different standards too. Standards have economic, policy, technical and social impacts.

Open up a UK plug and it looks a bit like this.

Notice that there are colours for different types of wires (2, 3, 4). And that fuses (5) are expected to be the same size and shape. Those are all standards too. The wiring and voltages are standardised too.

So the wiring, wall sockets and plugs in your house are designed affording to a whole family of different standards, that are designed to work with one another.

We can design more complex systems from smaller standards. It helps us make new things faster, because we are reusing existing work.

That’s a lot of time and agreement that we all benefit from. Someone somewhere has invested the time and energy into thinking all of that through. Lucky us!

When we visit other countries, we learn that their plugs and sockets are different. Oh no!

That can be a bit frustrating, and means we have to spend a bit more money and remember to pack the right adapters. It’d be nice if the whole world agreed on how to design a plug. But that seems unlikely. It would cost a lot of time and money in replacing wiring and sockets.

But maybe those different designs are intentional? Perhaps there are different local expectations around safety, for example. Or in what devices people might be using in their homes. There might be reasons why different communities choose to design and adopt slightly different standards. Because they’re meeting slightly different needs. But sometimes those differences might be unnecessary. It can be hard to tell sometimes.

The people most impacted by these differences aren’t tourists, its the manufacturers that have to design equipment to work in different locations. Which is why your electrical devices normally has a separate cable. So, depending on whether you travel or whether you’re a device manufacturer you’ll have different perceptions of how much a problem that is.

All of the above is true for data standards.

Standards for data are agreements that help us collect, access, share, use and publish data in consistent ways.  They have a range of different impacts.

There are lots of different types of standard and we combine them together to create different ways to successfully exchange data. Different communities often have their own standards for similar things, e.g. for describing metadata or accessing data via an API.

Sometimes those are simple differences that an adapter can easily fix. Sometimes those differences are because the standards are designed to meet different needs.

Unfortunately we don’t live in a world of standardised data plugs and wires and fuses. We live in that other world. The one where its hard to connect one thing to another thing. Where the stuff coming down the wires is completely unexpected. And we get repeated shocks from accidental releases of data.

I guarantee that in every user research, interview, government consultation or call for evidence, people will be consistently highlighting the need for more standards for data. People will often say this explicitly, “We need more standards!”. But sometimes they refer to the need in other ways: “We need make data more discoverable!” (metadata standards) or “We need to make it easier to safely release data!” (standardised codes of practice).

Unfortunately that’s not always that helpful because when you probe a little deeper you find that people are talking about lots of different things. Some people want to standardise the wiring. Others just want to agree on a voltage. While others are still debating the definition of “fuse”. These are all useful and important things. You just need to dig a little deeper to find the most useful place to start.

Its also not always clear whose job it is to actually create those standards. Because we take standards for granted, we’re not always clear about how they get created. Or how long it takes and what process to follow to ensure they’re well designed.

The reason we published the open standards for data guidebook was to help communities get started in designing the standards they need.

Standards development needs time and investment, as someone somewhere needs to do the work of creating them. That, as ever, is the really hard part.

Standards are part of the data infrastructure that help us unlock value from data. We need to invest in creating and maintaining them like we do other parts of our infrastructure.

Don’t just listen to me, listen to some of the people who’ve being creating standards for their communities.

Posted at 21:11

June 11

Benjamin Nowack: Trice' Semantic Richtext Editor

In my previous post I mentioned that I'm building a Linked Data CMS. One of its components is a rich-text editor that allows the creation (and embedding) of structured markup.

An earlier version supported limited Microdata annotations, but now I've switched the mechanism and use an intermediate, but even simpler approach based on HTML5's handy data-* attributes. This lets you build almost arbitrary markup with the editor, including Microformats, Microdata, or RDFa. I don't know yet when the CMS will be publicly available (3 sites are under development right now), but as mentioned, I'd be happy about another pilot project or two. Below is a video demonstrating the editor and its easy customization options.

Posted at 22:07

May 31

Leigh Dodds: The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?

Posted at 18:05

May 30

Leigh Dodds: How can we describe different types of dataset? Ten dataset archetypes

As a community, when we are discussing recommendations and best practices for how data should be published and governed, there is a natural tendency for people to focus on the types of data they are most familiar with working with.

This leads to suggestions that every dataset should have an API, for example. Or that every dataset should be available in bulk. While good general guidance, those approaches aren’t practical in every case. That’s because we also need to take into account a variety of other issues, including:

  • the characteristics of the dataset
  • the capabilities of the publishing organisation and the funding their have available
  • the purpose behind publishing the data
  • and the ethical, legal and social contexts in which it will be used

I’m not going to cover all of that in this blog post.

But it occurred to me that it might be useful to describe a set of dataset archetypes, that would function a bit like user personas. They might help us better answer some of the basic questions people have around data, discuss recommendations around best practices, inform workshop exercises or just test our assumptions.

To test this idea I’ve briefly described ten archetypes. For each one I’ve tried to describe some it’s features, identified some specific examples, and briefly outlined some of the challenges that might apply in providing sustainable access to it.

Like any characterisation detail is lost. This is not an exhaustive list. I haven’t attempted to list every possible variation based on size, format, timeliness, category, etc. But I’ve tried to capture a range that hopefully illustrate some different characteristics. The archetypes reflect my own experiences, you will have different thoughts and ideas. I’d love to read them.

The Study

The Study is a dataset that was collected to support a research project. The research group collected a variety of new data as part of conducting their study. The dataset is small, focused on a specific use case and there are no plans to maintain or update it further as the research group does not have any ongoing funded to collect or maintain the dataset. The data is provided as is for others to reuse, e.g. to confirm the original analysis of the data or to use it on other studies. To help others, and as part of writing some academic papers that reference the dataset, the research group has documented their methodology for collecting the data. The dataset is likely published in an academic data portal or alongside the academic papers that reference it.

Examples: water quality samples, field sightings of animals, laboratory experiment results, bibliographic data from a literature review, photos showing evidence of plant diseases, consumer research survey results

The Sensor Feed

The Sensor Feed is a stream of sensor readings that are produced by a collection of sensors that have been installed across a city. New readings are added to the stream at regular intervals. The feed is provided to allow a variety of applications to tap into the raw sensor readings.. The data points are as directly reported by the individual sensors and are not quality controlled. The individual sensors may been updated, re-calibrated or replaced over time. The readings are part of the operational infrastructure of the city so can be expected to be available over at least the medium term. This mean the dataset is effectively unbounded: new observations will continue to be reported until the infrastructure is decommissioned.

Examples: air quality readings, car park occupancy, footfall measurements, rain gauges, traffic light queuing counts, real-time bus locations

The Statistical Index

The Statistical Index is intended to provide insights into the performance of specific social or economic policies by measuring some aspect of a local community or economy. For example a sales or well-being index. The index draws on a variety of primary datasets, e.g. on commercial activities, which are then processed according to a documented methodology to generate the index. The Index is stewarded by an organisation and is expected to be available over the long term. The dataset is relatively small and is reported against specific geographic areas (e.g. from The Register) to support comparisons. The Index is updated on a regular basis, e.g. monthly or annually. Use of the data typically involves comparing across time and location at different levels of aggregation.

Examples: street safety survey, consumer price indices, happiness index, various national statistical indexes

The Register

The Register is a set of reference data that is useful for adding context to other datasets. It consists of a list of specific things, e.g. locations, cars, services with an unique identifier and some basic descriptive metadata for each of the entries on the list. The Register is relatively small, but may grow over time. It is stewarded by an organisation tasked with making the data available for others. The steward, or custodian, provides some guarantees around the quality of the data.  It is commonly used as a means to link, validate and enrich other datasets and is rarely used in isolation other than in reporting on changes to the size and composition of the register.

Examples: licensed pubs, registered doctors, lists of MOT stations, registered companies, a taxonomy of business types, a statistical geography, addresses

The Database

The Database is a copy or extract of the data that underpins a specific application or service. The database contains information about a variety of different types of things, e.g. musicians, the albums and songs. It is a relatively large dataset that can be used to perform a variety of different types of query and to support a variety of uses. As it is used in a live service it is regularly updated, undergoes a variety of quality checks, and is growing over time in both volume and scope. Some aspects of The Database may reference one or more Registers or could be considered as Registers in themselves.

Examples: geographic datasets that include a variety of different types of features (e.g. OpenStreetMap, MasterMap), databases of music (e.g. MusicBrainz) and books (e.g. OpenLibrary), company product and customer databases, Wikidata

The Description

The Description is a collection of a few data points relating to a single entity. Embedded into a single web page, it provides some basic information about an event, or place, or company. Individually it may be useful in context, e.g. to support a social interaction or application share. The owner of the website provides some data about the things that are discussed or featured on the website, but does not have access to a full dataset. The individual item descriptions are provided by website contributors using a CRM to add content to the website. If available in aggregate, the individual descriptions might make a useful Database or Register.

Examples: descriptions of jobs, events, stores, video content, articles

The Personal Records

The Personal Records are a history of the interactions of a single person with a product or service. The data provides insight into the individual person’s activities.  The data is a slice of a larger Dataset that contains data for a larger number of people. As the information contains personal information it has to be secure and the individual has various rights over the collection and use of the data as granted by GDPR (or similar local regulation). The dataset is relatively small, is focused on a specific set of interactions, but is growing over time. Analysing the data might provide useful insight to the individual that may help them change their behaviour, increase their health, etc.

Examples: bank transactions, home energy usage, fitness or sleep tracker, order history with an online service, location tracker, health records

The Social Graph

The Social Graph is a dataset that describes the relationships between a group of individuals. It is typically built-up by a small number of contributions made by individuals that provide information about their relationships and connections to others. They may also provide information about those other people, e.g. names, contact numbers, service ratings, etc. When published or exported it is typically focused on a single individual, but might be available in aggregate. It is different to Personal Records as its specifically about multiple people, rather than a history of information about an individual (although Personal Records may reference or include data about others).  The graph as a whole is maintained by an organisation that is operating a social network (or service that has social features).

Examples: social networks data, collaboration graphs, reviews and trip histories from ride sharing services, etc

The Observatory

The Observatory is a very large dataset produce by a coordinated large-scale data collection exercise, for example by a range of earth observation satellites. The data collection is intentionally designed to support a variety of down-stream uses, which informs the scale and type of data collected. The scale and type of data can makes it difficult to use because of the need for specific tools or expertise. But there are a wide range of ways in which the raw data can be processed to create other types of data products, to drive a variety of analyses, or used to power a variety of services.  It is refreshed and re-released as required by the needs and financial constraints of the organisations collaborating on collecting and using the dataset.

Examples: earth observation data, LIDAR point clouds, data from astronomical surveys or Large Hadron Collider experiments

The Forecast

The Forecast is used to predict the outcome of specific real-world events, e.g. a weather or climate forecast. It draws on a variety of primary datasets which are then processed and anlysed to produce the output dataset. The process by which the predictions are made are well-documented to provide insight into the quality of the output. As the predictions are time-based the dataset has a relatively short “shelf-life” which means that users need to quickly access the most recent data for a specific location or area of interest. Depending on the scale and granularity, Forecast datasets can be very large, making them difficult to distribute in a timely manner.

Example: weather forecasts

Let me know what you think of these. Do they provide any useful perspective? How would you use or improve them?

Posted at 13:05

May 18

Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a
    ['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a
    ['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision ), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Posted at 22:06

May 16

Benjamin Nowack: Contextual configuration - Semantic Web development for visually minded webmasters

Let's face it, building semantic web sites and apps is still far from easy. And to some extent, this is due to the configuration overhead. The RDF stack is built around declarative languages (for simplified integration at various levels), and as a consequence, configuration directives often end up in some form of declarative format, too. While fleshing out an RDF-powered website, you have to declare a ton of things. From namespace abbreviations to data sources and API endpoints, from vocabularies to identifier mappings, from queries to object templates, and what have you.

Sadly, many of these configurations are needed to style the user interface, and because of RDF's open world context, designers have to know much more about the data model and possible variations than usually necessary. Or webmasters have to deal with design work. Not ideal either. If we want to bring RDF to mainstream web developers, we have to simplify the creation of user-optimized apps. The value proposition of semantics in the context of information overload is pretty clear, and some form of data integration is becoming mandatory for any modern website. But the entry barrier caused by large and complicated configuration files (Fresnel anyone?) is still too high. How can we get from our powerful, largely generic systems to end-user-optimized apps? Or the other way round: How can we support frontend-oriented web development with our flexible tools and freely mashable data sets? (Let me quickly mention Drupal here, which is doing a great job at near-seamlessly integrating RDF. OK, back to the post.)

Enter RDF widgets. Widgets have obvious backend-related benefits like accessing, combining and re-purposing information from remote sources within a manageable code sandbox. But they can also greatly support frontend developers. They simplify page layouting and incremental site building with instant visual feedback (add a widget, test, add another one, re-arrange, etc.). And, more importantly in the RDF case, they can offer a way to iteratively configure a system with very little technical overhead. Configuration options could not only be scoped to the widget at hand, but also to the context where the widget is currently viewed. Let's say you are building an RDF browser and need resource templates for all kinds of items. With contextual configuration, you could simply browse the site and at any position in the ontology or navigation hierarchy, you would just open a configuration dialog and define a custom template, if needed. Such an approach could enable systems that worked out of the box (raw, but usable) and which could then be continually optimized, possibly even by site users.

A lot of "could" and "would" in the paragraphs above, and the idea may sound quite abstract without actually seeing it. To illustrate the point I'm trying to make I've prepared a short video (embedded below). It uses Semantic CrunchBase and Paggr Prospect (our new faceted browser builder) as an example use case for in-context configuration.

And if you are interested in using one of our solutions for your own projects, please get in touch !



Paggr Prospect (part 1)


Paggr Prospect (part 2)

Posted at 23:06

Benjamin Nowack: 2011 Resolutions and Decisions

All right, this post could easily have become another rant about the ever-growing complexity of RDF specifications, but I'll turn it into a big shout-out to the Semantic Web community instead. After announcing the end of investing further time into ARC's open-source branch, I received so many nice tweets and mails that I was reminded of why I started the project in the first place: The positive vibe in the community, and the shared vision. Thank you very much everybody for the friendly reactions, I'm definitely very moved.

Some explanations: I still share the vision of machine-readable, integration-ready web content, but I have to face the fact that the current approach is getting too expensive for web agencies like mine. Luckily, I could spot a few areas where customer demands meet the cost-efficient implementation of certain spec subsets. (Those don't include comprehensive RDF infrastructure and free services here, though. At least not yet, and I just won't make further bets). The good news: I will continue working with semantic web technologies, and I'm personally very happy to switch focus from rather frustrating spec chasing to customer-oriented solutions and products with defined purposes . The downside: I have to discontinue a couple of projects and services in order to concentrate my energy and reduce (opportunity) costs. These are:
  • The ARC website, mailing list, and other forms of free support. The code and documentation get a new home on GitHub , though. The user community is already thinking about setting up a mailing list on their own. Development of ARC is going to continue internally, based on client projects (it's not dying).
  • Trice as an open-source project (lesson learned from ARC)
  • Semantic CrunchBase. I had a number of users but no paying ones. It was also one those projects that happily burn your marketing budget while at the same time having only negative effects on the company's image because the funds are too small to provide a reliable service (similar to the flaky DBPedia SPARQL service which makes the underlying RDF store look like a crappy product although it is absolutely not).
  • Knowee, Smesher and similar half-implemented and unfunded ideas.
Looking forward to a more simplified and streamlined 2011. Lots of success to all of you, and thanks again for the nice mails!

Posted at 07:07

May 09

Sebastian Trueg: Protecting And Sharing Linked Data With Virtuoso

Disclaimer: Many of the features presented here are rather new and can not be found in  the open-source version of Virtuoso.

Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.

Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.

We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:

Virtuoso Sparql Endpoint

Sparql Result

This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:

$ isql-v localhost:1112 dba dba
Connected to OpenLink Virtuoso
Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('http://www.openlinksw.com/schemas/virtrdf#PrivateGraphs', 'urn:trueg:demo');

Done. -- 2 msec.

Now our new named graph urn:trueg:demo is private and its contents cannot be seen by anyone. We can easily test this by logging out and trying to query the graph:

Sparql Query
Sparql Query Result

But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Share Demo Graph with trueg's LinkedIn account" ;
  acl:agent <http://www.linkedin.com/in/trueg> ;
  acl:accessTo <urn:trueg:demo> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:PrivateGraphs .

Virtuoso makes use of the ACL ontology proposed by the W3C and extends on it with several custom classes and properties in the OpenLink ACL Ontology. Most of this little Turtle snippet should be obvious: we create an Authorization resource which grants Read access to urn:trueg:demo for agent http://www.linkedin.com/in/trueg. The only tricky part is the scope. Virtuoso has the concept of ACL scopes which group rules by their resource type. In this case the scope is private graphs, another typical scope would be DAV resources.

Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:

$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules

As a result we get the full rule resource including additional properties added by the API.

Finally we will login using my LinkedIn identity and are granted read access to the graph:

SPARQL Endpoint Login
sparql6
sparql7
sparql8

We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.

Technical Footnote

Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs .
};

This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Allow Anyone to SPARQL Read" ;
  acl:agentClass foaf:Agent ;
  acl:accessTo <urn:virtuoso:access:sparql> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:Query .

I will explain these technical concepts in more detail in another article.

Posted at 10:06

Sebastian Trueg: Sharing Files With Whomever Is Simple

Dropbox, Google Drive, OneDrive, Box.com – they all allow you to share files with others. But they all do it via the strange concept of public links. Anyone who has this link has access to the file. On first glance this might be easy enough but what if you want to revoke read access for just one of those people? What if you want to share a set of files with a whole group?

I will not answer these questions per se. I will show an alternative based on OpenLink Virtuoso.

Virtuoso has its own WebDAV file storage system built in. Thus, any instance of Virtuoso can store files and serve these files via the WebDAV API (and an LDP API for those interested) and an HTML UI. See below for a basic example:

Virtuoso DAV Browser

This is just your typical file browser listing – nothing fancy. The fancy part lives under the hood in what we call VAL – the Virtuoso Authentication and Authorization Layer.

We can edit the permissions of one file or folder and share it with anyone we like. And this is where it gets interesting: instead of sharing with an email address or a user account on the Virtuoso instance we can share with people using their identifiers from any of the supported services. This includes Facebook, Twitter, LinkedIn, WordPress, Yahoo, Mozilla Persona, and the list goes on.

For this small demo I will share a file with my LinkedIn identity http://www.linkedin.com/in/trueg. (Virtuoso/VAL identifier people via URIs, thus, it has schemes for all supported services. For a complete list see the Service ID Examples in the ODS API documentation.)

Virtuoso Share File

Now when I logout and try to access the file in question I am presented with the authentication dialog from VAL:

VAL Authentication Dialog

This dialog allows me to authenticate using any of the supported authentication methods. In this case I will choose to authenticate via LinkedIn which will result in an OAuth handshake followed by the granted read access to the file:

LinkedIn OAuth Handshake

 

Access to file granted

It is that simple. Of course these identifiers can also be used in groups, allowing to share files and folders with a set of people instead of just one individual.

Next up: Sharing Named Graphs via VAL.

Posted at 10:06

Sebastian Trueg: Digitally Sign Emails With Your X.509 Certificate in Evolution

Digitally signing Emails is always a good idea. People can verify that you actually sent the mail and they can encrypt emails in return. A while ago Kingsley showed how to sign emails in Thunderbird.I will now follow up with a short post on how to do the same in Evolution.

The process begins with actually getting an X.509 certificate including an embedded WebID. There are a few services out there that can help with this, most notably OpenLink’s own YouID and ODS. The former allows you to create a new certificate based on existing social service accounts. The latter requires you to create an ODS account and then create a new certificate via Profile edit -> Security -> Certificate Generator. In any case make sure to use the same email address for the certificate that you will be using for email sending.

The certificate will actually be created by the web browser, making sure that the private key is safe.

If you are a Google Chrome user you can skip the next step since Evolution shares its key storage with Chrome (and several other applications). If you are a user of Firefox you need to perform one extra step: go to the Firefox preferences, into the advanced section, click the “Certificates” button, choose the previously created certificate, and export it to a .p12 file.

Back in Evolution’s settings you can now import this file:

To actually sign emails with your shiny new certificate stay in the Evolution settings, choose to edit the Mail Account in question, select the certificate in the Secure MIME (S/MIME) section and check “Digitally sign outgoing messages (by default)“:

The nice thing about Evolution here is that in contrast to Thunderbird there is no need to manually import the root certificate which was used to sign your certificate (in our case the one from OpenLink). Evolution will simply ask you to trust that certificate the first time you try to send a signed email:

That’s it. Email signing in Evolution is easy.

Posted at 10:06

Davide Palmisano: SameAs4J: little drops of water make the mighty ocean

Few days ago Milan Stankovich contacted the Sindice crew informing us that he wrote a simply Java library to interact with the public Sindice HTTP APIs. We always appreciate such kind of community efforts lead to collaboratively make Sindice a better place on the Web. Agreeing with Milan, we decided to put some efforts on his initial work to make such library the official open source tool for Java programmers.
That reminded me that, few months ago, I did for sameas.org the same thing Milan did for us. But (ashamed) I never informed those guys about what I did.
Sameas.org is a great and extremely useful tool on the Web that makes concretely possible to interlink different Linked data clouds. Simple to use (both for humans via HTML and for machines with a simple HTTP/JSON API) and extremely reactive, it allows to get all the owl:sameAs object for a given URI. And, moreover, it’s based on Sindice.com.
Do you want to know the identifier of http://dbpedia.org/resource/Rome in Freebase or Yago? Just ask it to Sameas.org.

So, after some months I just refined a couple of things, added some javadocs, set up a Maven repository and made SameAs4j publicly available (MIT licensed) to everyone on Google Code.
It’s a simple but reliable tiny set of Java classes that allows you to interact with sameas.org programatically in your Java Semantic Web applications.

Back to the beginning: every pieces of open source software is like a little drop of water which makes the mighty ocean, so please submit any issue or patch if interested.

Posted at 10:06

Copyright of the postings is owned by the original blog authors. Contact us.