Planet RDF

It's triples all the way down

July 28

Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • OpenCalais gives you more free API calls out of the box than Zemanta (50.000 vs. 1.000 per day). You can get a free upgrade to 10.000 Zemanta calls via a simple email, though (or excessive API use; Andraž auto-upgraded my API limit when he noticed my crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Zemanta responses, in contrast, do not (yet, Andraž told me they are working on it) contain entity types at all. You always need an additional request to retrieve type information (unless you are doing nasty URI inspection, which is what I did with detected URIs from Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Posted at 09:50

July 27

AKSW Group - University of Leipzig: ORE 0.2 Released

Today, we released version 0.2 of the ontology repair and enrichment (ORE) tool. It is a tool for knowledge engineers to improve an OWL ontology through a wizard like repair process and uses state-of-the-art ontology debugging methods. The main feature in version 0.2 is a mode for incrementally detecting inconsistencies in large knowledge bases available as SPARQL endpoints. Using this mode, we have detected inconsistencies and computed justifications in DBpedia Live and OpenCyc. Previously, both knowledge bases were too large to compute justifications on standard hardware to the best of our knowledge, i.e. inconsistencies could not be fixed efficiently. A screencast illustrates this process for the case of DBpedia Live. Thanks to Lorenz Bühmann for his work on ORE.

ORE Homepage | Download | Screencast | AKSW Homepage

Posted at 21:31

Norm Walsh: Balisage 2010

Balisage 2010 is only days away. (How did that happen?)

Posted at 20:43

Norm Walsh: XML Calabash V0.9.23

Announcing XML Calabash V0.9.23.

Posted at 20:02

Norm Walsh: DocBook V5.1 Beta 2

Announcing DocBook V5.1b2, the second beta release of (what will become) DocBook V5.1. Version 5.1 includes significant new features for topic-based authoring as well as a number of bug fixes.

Posted at 19:30

Bob DuCharme: Jazz camp

Theory and practice.

Posted at 14:04

Dublin Core Metadata Initiative: DC-2010 Program updated, early-bird discount until 10 September 2010

2010-07-27, The organizing committee of DC-2010, the tenth International Conference on Dublin Core and Metadata Applications, to be held in Pittsburgh, PA, USA, 20-22 October 2010, has published an update to the program for the event. More meetings of DCMI Communities and Task Groups have been added and more details are now included for these meetings and the special sessions. Please register online; early-bird discount is available until 10 September 2010.

Posted at 00:00

July 23

Bill Roberts: Linked Data in Edinburgh and Manchester

I’ve been at a couple of great Web of Data events in the last ten days or so.

On 13 July, I organised a Linked Data meetup in Edinburgh that I’m pleased to say went very well. Around 25 people showed up to hear interesting talks from Zach Beauvais of Talis (slides) and Paola di Maio of Strathclyde University (slides). There was a good mix of people already experienced with linked data and others who wanted to learn more about it – many of them with specific potential applications in mind.

There’s a fuller write-up at the SLDIG wiki.

Then a couple of days ago I went down to Manchester for the Vision and Media Transmission 6 event “Towards a web of data?” organised by Paul Collins of the White Room.

I presented on “Publishing Linked Data: Getting Started”. The other speakers were Paul Miller of Cloud of Data (and semantic podcaster extraordinaire) and Liz Turner of Iconomical, the creator of the well-known WhereDoesMyMoneyGo. Paul’s slides are here.

I very much enjoyed the chance to be involved and to meet with some of the thriving digital media community in Manchester.

Posted at 14:11

Bill Roberts: Linked Data in Edinburgh and Manchester

I’ve been at a couple of great Web of Data events in the last ten days or so.

On 13 July, I organised a Linked Data meetup in Edinburgh that I’m pleased to say went very well. Around 25 people showed up to hear interesting talks from Zach Beauvais of Talis (slides) and Paola di Maio of Strathclyde University (slides). There was a good mix of people already experienced with linked data and others who wanted to learn more about it – many of them with specific potential applications in mind.

There’s a fuller write-up at the SLDIG wiki.

Then a couple of days ago I went down to Manchester for the Vision and Media Transmission 6 event “Towards a web of data?” organised by Paul Collins of the White Room.

I presented on “Publishing Linked Data: Getting Started”. The other speakers were Paul Miller of Cloud of Data (and semantic podcaster extraordinaire) and Liz Turner of Iconomical, the creator of the well-known WhereDoesMyMoneyGo. Paul’s slides are here.

I very much enjoyed the chance to be involved and to meet with some of the thriving digital media community in Manchester.

[Credits: Manchester picture by Graham Smith, via Flickr).]

Posted at 14:11

July 22

Talis: Linked Data in Libraries – Presentations

The Talis Linked Data in Libraries event, held at the British Library in London on Wednesday 21st July was attended by 50 enthusiastic interested people interested in the topic.

Below you will find presentations from the day.

Introduction Talis and the world of Linked Data – Zach Beavais, Talis

 
Click to play

The data.bnf.fr Project – Romain Wenz, Bibliothèque nationale de France
        (Presentation not yet available)

Linked Data, RDF, and SPARQL – Rob Styles, Talis


By rob

Linked Data in Action – Richard Wallis, Talis


By rob

Lightning Talks:
                       Neil Wilson, The British Library NielWilson

                       Sally Chambers, The European Library Chambers_Talis_linked_data
                       Felix Ostrowsk, The North Rhine-Westphalian Library Servicehbz_master

Linked Bibliographic Data – Rob Styles, Talis


By rob

W3C Library Linked Data Incubator Group – Antoine Isaac, Europeana

Isaac-LLD10
Click to play

An overview of the Talis Platform – Richard Wallis, Talis


By rob

Watch this space for videos of some of the sessions.

Posted at 15:53

Talis: Linked Data in Libraries – Presentations

The Talis Linked Data in Libraries event, held at the British Library in London on Wednesday 21st July was attended by 50 enthusiastic interested people interested in the topic.

Below you will find presentations from the day.

Introduction Talis and the world of Linked Data – Zach Beavais, Talis 
Click to play

The data.bnf.fr Project – Romain Wenz, Bibliothèque nationale de France
        (Presentation not yet available)

Linked Data, RDF, and SPARQL – Rob Styles, Talis
Linked Data, RDF & SPARQL
Click to play

Linked Data in Action – Richard Wallis, Talis
In Action
Click to play

Lightning Talks:
                       Neil Wilson, The British Library NielWilson

                       Sally Chambers, The European Library Chambers_Talis_linked_data
                       Felix Ostrowsk, The North Rhine-Westphalian Library Servicehbz_master

Linked Bibliographic Data – Rob Styles, Talis
Linked Bibliographic Data
Click to play

W3C Library Linked Data Incubator Group – Antoine Isaac, EuropeanaIsaac-LLD10
Click to play

An overview of the Talis Platform – Richard Wallis, Talis
RJW- Platform Overview
Click to play

Watch this space for videos of some of the sessions.

Posted at 15:53

W3C QA Blog Semantic Web News: Augmented Reality: A Point of Interest for the Web

Last month's Augmented Reality on the Web workshop in Barcelona has sparked a good deal of debate within and around W3C. As the final report shows, the workshop brought together many different companies and organizations working on or with a direct interest in the field of Augmented Reality — but how can W3C help in this area?

One outcome is clear: we need a method for representing data about points of interest and proposals are advancing to achieve this in a new POI Working Group. Quite what data needs to be represented concerning Points of Interest depends on who you ask. For some it's a question of annotating a given point on the Earth's surface where the longitude, latitude and altitude are all key identifiers. For others it's more a question of the point at a given distance and angle from an object that may or may not be static as seen by an observer who may themselves also be moving.

Different communities are involved here: as well as the augmented reality community, the linked data community has a keen interest. There are other facets to the discussion too and this is what will make the POI working group's work interesting!

The workshop also recommended that a new POI WG should go further and consider the wider picture of how AR does, or might, relate to the Web. Privacy is a major concern; device APIs are critical enablers; do CSS and SVG have sufficient power to support AR functions? Even the use of HTTP as a transport mechanism is questioned by some given the real time nature of AR.

To join the debate about all this, please subscribe to the Point of Interest mailing list and keep an eye out for calls for review of the charter in the near future.

Posted at 09:26

July 19

W3C Semantic Web News: OWLlink protocol published as a W3C Member Submission

The “OWLlink Protocol” specification has been published as a W3C member submission, co-authored by experts from Clark & Parsia LLC, Creative Commons, Daimler Chrysler Research and Technology, Free University of Bozen-Bolzano, German Research Center for Artificial Intelligence (DFKI) GmbH, NTT DOCOMO, Stanford University, University of Aberdeen, Computing Science, University of Manchester, and Vrije Universiteit. The specification defines a general, implementation-neutral protocol to access the functionalities of a reasoner acting as an (OWLlink) server. This general mechanism is defined in term of UML; separate documents define bindings of this general protocol with different syntaxes that can be used to communicate with reasoners over HTTP. Using one of these concrete protocol bindings clients can control and query reasoners using the terms defined in the general OWLlink Structure.

Posted at 16:05

July 17

Semantic Web Company (Austria): What if the biggest web company bought one of the central semantic web players?

Well, exactly this happened yesterday: Google bought Metaweb – provider of Freebase. Freebase is an important hub in the linked data cloud providing 12 million entities with uniform resource identifiers most of them linked to other semantic web datasets like DBpedia or New York Times. For example: Google´s page on Freebase offers a rich source for machine-readable facts around this company.

What does this mean to the Semantic Web Community which has  been working on a smarter web in the last decade?
Well, a lot… First of all, it´s good to hear that Google will continue to develop Freebase as a free and open database to everyone, saying “… we would be delighted if other web companies use and contribute to the data.”

Until yesterday still a lot of companies were not fully convinced if the Semantic Web will play a central role in the further development of the Internet. Now the game has changed. The entity-driven approach to develop web applications has just started now:

We will keep on reporting and discussing how Google will influence the development of the Semantic Web – and if I had a wish for free: Please add RDF(a) to the Freebase widgets!

Posted at 10:47

W3C Semantic Web News: POWDER: Not So Quiet

Since it completed the Recommendation Track process last year, little has been said or written about POWDER. However, there have been a number of unrelated events recently that I take as evidence of a long term future. As chair of the WG that created it (and an editor of all but one of the documents and general front-person for the whole thing), this makes me happy!

One of my private measures of success for it has always been that one day, someone I don't know and who doesn't know me stands up at a big conference and says "you know this POWDER thing is really cool." That happened at SemTech last month when Matt Fisher presented it in a session called RDF Friday Part 3: Practical RDF - POWDER & Object Design Patterns. The full version of what he was saying is available in an article on his company website Putting POWDER to Work. Matt and I have exchanged e-mails since then but we hadn't had any contact before.

Secondly my friend and WG member Andrea Perego has been cooking up some code that uses POWDER to generate RDFa in a way that could make it really easy to add all those <link /> elements in documents on the fly under the control of a single, central POWDER file.

Suppose you want to add RDFa to all the pages on your Web site (not a bad thing to do!). One can imagine doing this for Creative Commons licences, DC metadata etc. Something like

  <link rel ="cc:license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/" />
  <link rel="dcterms:creator" href="http://philarcher.org/foaf.rdf#me" />

These link elements should probably be included on every page of your site. Sounds like a job for POWDER. Andrea's PHP POWDER Processor (3P) can take a POWDER file and URI as inputs and return those RDFa link elements via a RESTful API - one that could easily be called from within an authoring tool. Full documentation, including an example using the Open Graph Protocol, is available.

Another development is still under wraps at the moment but the signs are very positive that a combination of marketing expertise, industry contacts, business dynamism and, not unimportantly, venture capital is coming together in a POWDER-fuelled start-up.

A quick reminder of the key features of POWDER:

  • it allows you to associate a bunch of predicates and objects with any number of subjects (as Dan Brickley puts it: it solves the aboutEachPrefix issue);
  • it's primary format is XML, a small amount of which can describe a large amount of content;
  • it has an associated GRDDL transform that renders the data as semantically-equivalent OWL;
  • a POWDER processor returns RDF triples;
  • the provenance of the data is always declared.

If you haven't looked at POWDER before, maybe now's a good time to do so.

Posted at 08:28

July 16

Ebiquity research group UMBC: Google acquires Metaweb and Freebase

Google announced today that it has acquired Metaweb, the company behind Freebase — a free, semantic database of “over 12 million people, places, and things in the world.” This is from their announcement on the Official Google blog:

“Over time we’ve improved search by deepening our understanding of queries and web pages. The web isn’t merely words — it’s information about things in the real world, and understanding the relationships between real-world entities can help us deliver relevant information more quickly. … With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better. Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers.”

In their announcement, Google promises to continue to maintain Freebase “as a free and open database for the world” and invites other web companies use and contribute to it.

Freebase is a system very much in the linked open data spirit, even thought RDF is not its native representation. It’s content is available as RDF and there are many links that bind it to the LOD cloud. Moreover, Freebase has a very good wiki-like interface allowing people to upload, extend and edit both its schema and data.

Here’s a video on the concepts behind Metaweb which are, of course, also those underlying the Semantic Web. What the difference — I’d say a combination of representational details and centralized (Metaweb) vs. distributed (Semantic Web).

Posted at 19:30

Ebiquity research group UMBC: Search neutrality: Google and Danny Sullivan weigh in

Web search guru Danny Sullivan has a great response to the NYT editorial on regulating search engine algorithms: The New York Times Algorithm and Why It Needs Government Regulation. Here’s how it starts:

“The New York Times is the number one newspaper web site. Analysts reckon it ranks first in reach among US opinion leaders. When the New York Times editorial staff tweaks its supersecret algorithm behind what to cover and exactly how to cover a story — as it does hundreds of times a day — it can break a business that is pushed down in coverage or not covered at all.”

Google published its own response to the Times piece as a Financial Times op-ed and also posted it to the Google public policy blog: regulating what is “best” in search?

“Search engines use algorithms and equations to produce order and organisation online where manual effort cannot. These algorithms embody rules that decide which information is “best”, and how to measure it. Clearly defining which of any product or service is best is subjective. Yet in our view, the notion of “search neutrality” threatens innovation, competition and, fundamentally,your ability as a user to improve how you find information.”

The penultimate paragraph gives what they say is their strongest argument againt mandating “search neutrality”.

“But the strongest arguments against rules for “neutral search” is that they would make the ranking of results on each search engine similar, creating a strong disincentive for each company to find new, innovative ways to seek out the best answers on an increasingly complex web. What if a better answer for your search, say, on the World Cup or “jaguar” were to appear on the web tomorrow? Also, what if a new technology were to be developed as powerful as PageRank that transforms the way search engines work? Neutrality forcing standardised results removes the potential for innovation and turns search into a commodity.”

This assumes of course, that there is real competition among Internet search engines. Microsoft has been putting a lot of research and development into Bing with good results and it’s been gaining market share. Yahoo is doing very interesting this as well. Consumer choice among a handful of competitors would be the best way to ensure that none abuse their customers.

Posted at 05:01

July 15

Dublin Core Metadata Initiative: Joint NISO/DCMI Webinar on 25 August 2010

2010-07-15, Tom Baker and Makx Dekkers will be presenting a Webinar on 25 August 2010 from 1:00 to 2:30 p.m. US Eastern Time (UTC 17:00-18:30) under the title "Dublin Core: The Road from Metadata Formats to Linked Data". Further details and registration information are available at the NISO Web site.

Posted at 23:59

Ebiquity research group UMBC:

Posted at 21:36

Ebiquity research group UMBC: New York Times editorializes about the Google search ranking algorithm

In what may be a first, today’s New York Times has an editorial about an algorithm. No, they haven’t waded into the P=NP issue, but commented on Google’s algorithm for ranking search results and accusations that Google unfairly biases it for its own self interest.

“In the past few months, Google has come under investigation by antitrust regulators in Europe. Rivals have accused Google of placing the Web sites of affiliates like Google Maps or YouTube at the top of Internet searches and relegating competitors to obscurity down the list. In the United States, Google said it expects antitrust regulators to scrutinize its $700 million purchase of the flight information software firm ITA, with which it plans to enter the online travel search market occupied by Expedia, Orbitz, Bing and others.”

This issue will become more important as the companies dominating Web search (Google, Microsoft and Yahoo) continue to increase their importance and also broaden their acquisition of companies offering web services.

The NYT’s position is moderate, recommending:

Google provides an incredibly valuable service, and the government must be careful not to stifle its ability to innovate. Forcing it to publish the algorithm or the method it uses to evaluate it would allow every Web site to game the rules in order to climb up the rankings — destroying its value as a search engine. Requiring each algorithm tweak to be approved by regulators could drastically slow down its improvements. Forbidding Google to favor its own services — such as when it offers a Google Map to queries about addresses — might reduce the value of its searches. With these caveats in mind, if Google is to continue to be the main map to the information highway, it concerns us all that it leads us fairly to where we want to go.

Posted at 18:28

July 13

Talis: Open Day: Linked Data and Health

We’ve seen and reported on the rise of Linked Data from concept to practice, and our Open Days have been a great opportunity to explore and explain Linked Data very broadly. The broad discussions have allowed many people to imagine using semantics with their own data, as publishers, developers, information architects etc. across many different industries and applications. But one area in which we are particularly interested is health.

Biomedical science is full of structured and semi-structured information, much of which crosses the organising boundaries we’ve created for it. Every aspect of medical practice, research and policy makes use of (and in most cases creates supplementary) information, and it’s become plain that much of this data is stored, hidden and often unaccessible.

I attended some sessions on biomedical semantics at SemTech last month, and was hugely intrigued by the state of health data world-wide. There are many usable ontologies for medical science, for example, which show the relationships among biological knowledge and clinical use; but much of the data used on the front line is not part of this structure. There seems to be much that could be gained from taking a Linked approach to these data!

Mark Birbeck and Dr Michael Wilkinson, in last month’s Nodalities Magazine introduced the idea of “A Linked Data Platform for Innovation,” a project of the National Innovation Centre for joining clinicians to linked visualisations through a widget-like, Linked Data platform:

The NIC is committed to using Semantic Web technologies as a way to significantly improve the speed and quality of decision- making in the area of health technology innovations.

So, we’ve decided to join forces with some of these minds and host an event to explain and explore biomedical data. We’ll be at No 76 Portland Place on 19th August from 10AM to 4PM. We’ve invited Dr Nigam Shah from Stanford University to talk to us about the state of global health data, and to suggest several ways in which linking can be done in the very near future. We will also cover the topic of Linked Data (what it is, and how it works), as well as taking a quick look at how it’s being used across the web already. The people behind the NIC’s clinical widget platform will also be there to introduce their project.

Places are free of charge, but limited so make sure to sign up to reserve your place.

We’d very much like to keep the spirit of an Open Day. This event is open for discussion, examination and exploration of using the Semantic Web in life sciences, so come armed with ideas, questions and problems!

Talis will be putting on lunch, and we will also have a ready supply of coffee on hand to help the discussions.

Image: “Science is Knowledge” by Zach Beauvais, is a mashup of “3D Stone Cells” by BlueRidgeKitties, and “Glass Bottles I” by Tim O’Brien via flickr. They are used under CC: BY, NC, SA licenses.

Posted at 18:24

Yves Raimond: First BBC microsite powered by a triple-store

Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.

All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)

There are two issues we are still currently trying to solve though:

  • We need to be able to cluster our triples in several dimension. For example, we may want to have a graph for a particular programme, and a much larger graph for a particular dataset (e.g. programme data, wildlife finder data, world cup data). The smaller graph is used to make our updates relatively cheap (we replace the whole graph whenever we receive an update). The bigger graph is used to give some degree of isolations between the different sources of data. For that, we need graphs within graphs. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs.
  • With regards to programme data, the main bottleneck we're facing is the number of updates per second we need to be able to process, which most of available triple stores struggle to keep up. The 4store instance on DBTune does keep up, but it has a negative impact on the querying performances, as the write operations are blocking the reads. We were quite surprised to see that the available triple store benchmarks do not take the write throughput into account!

Posted at 14:46

July 12

W3C QA Blog Semantic Web News: New opportunities for linked data nose-following

For those of you interested in deploying RDF on the Web, I'd like to draw your attention to three new proposed standards from IETF, "Web Linking", "Defining Well-Known URIs", and "Web Host Metadata", that create new follow-your-nose tricks that could be used by semantic web clients to obtain RDF connected to a URI - RDF that presumably defines what the URI 'means' and/or describes the thing that the URI is supposed to refer to.

Most semantic web application developers are probably familiar with three ways to nose-follow from a URI:

  1. For # URIs - for X#F, the document X tells you about <X#F>
  2. When the response to GET X is a 303 - the redirect target tells you about <X>
  3. When the response to GET X is a 200 - the content may tell you about <X>

In case 3, X refers to what I'll call a "web page" (a more technical term is used in the TAG's httpRange-14 resolution). One of the new RFCs extends case 3 to situations where the RDF can't be embedded in the content, either because the content-type doesn't provide a place to put it (e.g. text/plain) or because for administrative reasons the content can't be modified to include it (e.g. a web archive that has to deliver the original bytes faithfully). The others cover this case as well as offering improved performance in case 2.

Web pages as RDF subjects

Before getting into the new nose-following protocols, I'll amplify case 3 above by listing a few applications of RDF in which a web page occurs as a subject. I'll rather imprecisely call such RDF "metadata".

  1. Bibliographic metadata - tools such as Zotero might be interested in obtaining Dublin Core, BIBO, or other citation data for the web page.
  2. Stability metadata - for annotation and archiving purposes it may be useful to know whether the page's content is committed to be stable over time (e.g. this has changing content versus this has unchanging content). See TimBL's Generic Resources note.
  3. Historical and archival metadata - it is useful to have links to other versions of a document - including future versions.

All sorts of other statements can be made about a web page, such as a type (wiki page, blog post, etc.), SKOS concepts, links to comments and reviews, duration of a recording, how to edit, who controls it administratively, etc. Anything you might want to say about a web page can be said in RDF.

Embedded metadata is easy to deploy and to access, and should be used when possible. But while embedded metadata has the advantages of traveling around with the content, a protocol that allows the server responsible for the URI to provide metadata over a separate "channel" has two advantages over embedded metadata: First, the metadata doesn't have to be put into the content; and second, it doesn't have to be parsed out of the content. And it's not either/or: There is no reason not to provide metadata through both channels when possible.

Link: header

The 'Web Linking' proposed standard defines the HTTP Link: header, which provides a way to communicate links rooted at the requested resource. These links can either encode interesting information directly in the HTTP response, or provide a link to a document that packages metadata relevant to the resource.

In the former case, one might have:

Link: <http://xmlns.com/foaf/0.1/Document>;
  rel="http://www.w3.org/1999/02/22-rdf-syntax-ns#type"

meaning that the request URI refers to something of type foaf:Document. In the latter case one might have:

Link: <http://example.com/about/foo.rdf>;
  rel="describedby"; type=application/rdf+xml

meaning that metadata can be found in <http://example.com/about/foo.rdf>, and hinting that the latter resource might have a 'representation' with media type application/rdf+xml.

Host-wide nose-following rules

The motivation for the "well-known URIs" RFC is to collect all "well-known URIs" (analogous to "robots.txt") in a single place, a root-level ".well-known" directory, and create a registry of them to avoid collisions. The most pressing need comes from protocols such as webfinger and OpenID; see Eran Hammer-Lahav's blog post for the whole story.

For linked data, .well-known provides an opportunity for providing metadata for web pages, as well improving the efficiency of obtaining RDF associated with other "slash URIs", what is currently done using 303 responses.

Ever since the TAG's httpRange-14 decision in 2005, there have been concerns that it takes two round trips to collect RDF associated with a slash URI. While some might question why those complaining aren't using hash URIs, in any case the "well-known URIs" mechanism gives a way to reduce the number of round trips in many cases, eliminating many GET/303 exchanges.

The trick is to obtain, for each host, a generic rule that will transform the URI at that host that you want RDF for into the URI of a document that carries that RDF. This generic rule is stored in a file residing in the .well-known space at a path that is fixed across all hosts. That is: to find RDF for http://example.com/foo, follow these steps:

  1. obtain the host name, "example.com"
  2. form the URI with that host name and path "/.well-known/host-meta", i.e. "http://example.com/.well-known/host-meta" (see here)
  3. if not already cached, fetch the document at that URI
  4. in that document find a rule generically transforming original-URI -> about-URI
  5. apply the rule to "http://example.com/foo" obtaining (say) "http://example.com/about/foo"
  6. find RDF about "http://example.com/foo" in document "http://example.com/about/foo"

The form of the about-URI is chosen by the particular host, e.g. "http://example.com/foo,about" or "http://about.example.com/foo" or whatever works best.

Why is this fewer round trips than using 303? Because you can fetch and cache the generic rule once per site. The first use of the rule still costs an extra round trip, but subsequent URIs for a given site can be nose-followed without any extra web accesses.

A worked example can be found here.

Next steps

As with any new protocol, figuring out exactly how to apply the new proposed standards will require coordination and consensus-building. For example, the choice of the "describedby" link relation and "host-meta" well-known URI need to be confirmed for linked data, and agreement reached on whether multiple Link: headers is in good taste or poor taste. (Link: and .well-known put interesting content in a peculiarly obscure place and it might be a good idea to limit their use.) Consideration should be given to Larry Masinter's suggestion to use multiple relations reflecting different attitudes the server might have regarding the various metadata sources: For example the server may choose to announce that it wants the Link: metadata to override any embedded metadata, or vice versa. Agreement should be reached on the use of Link: and host-meta with redirects (302 and so on) - personally I think it would be a great thing as you could then use a value-added forwarding service to provide metadata that the target host doesn't or can't provide.

This is not a particularly heavy coordination burden; the design odds-and-ends and implementations are all simple. The impetus might come from inside W3C (e.g. via SWIG) or bottom-up. All we really need to get this going are a bit of community discussion, a server, and a cooperating client, and if the protocols actually fill a need, they will take off.

For past TAG work on this topic, please see TAG issue 62 and the "Uniform Access to Metadata" memo.

Posted at 11:56

Ivan Herman: Experiences of LOD publication

Frank van Harmelen’s tweet drew my attention on a

Posted at 08:39

Kingsley Idehen: Solving Real Problems by Leveraging Linked Data: Unambiguous & Verifiable Identity for HTTP Networks

Problem: Unambiguous Verifiable Network Identity.

How Does Linked Data Address This Problem? It provides critical infrastructure for the WebID Protocol that enables an innovative tweak of SSL/TLS.

What about OpenID? The WebID Protocol embraces and extends OpenID (in an open and positive way) via the WebID + OpenID Hybrid variant of the protocol -- basic effect is that OpenID calls are re-routed to the WebID aspect which simply removes Username and Password Authentication from the authentication challenge interaction pattern.

WebID Components

  1. X.509 Certificate and Private Key Generator
  2. Structured Profile Document (e.g. a FOAF based Profile) published to an HTTP Network (e.g. World Wide Web) and accessible at an Address (URL)
  3. An Agent Identifier aka. WebID (an HTTP Name Reference re. URI variant) that's the Subject of a Structured Profile Document (actually a Descriptor Resource)
  4. Mechanism for persisting Public Key data from X.509 Certificate to Structured Profile Document and associating it with Subject WebID (e.g. SPARUL or other HTTP based methods)
  5. Mechanism for de-referencing Public Key data associated with a WebID (from its Structured Profile Document) for comparison against Public Key data following successful standard SSL/TLS protocol handshake (e.g. via SPARQL Query).

Demo

Related

Posted at 03:25

July 11

W3C Semantic Web News: Report of the RDF Next Steps Workshop published

The last week of June, participants at the W3C RDF Next Steps Workshop concluded that support for JSON, Turtle, and for "Named Graphs" are top priorities for any future work on RDF. Participants also highlighted the importance of compatibility with existing deployment. Read about these and other topics in the Workshop report. To join the discussion about organizing future work on RDF, please share your thoughts on the Semantic Web Interest Group mailing list (with a copy to the separate RDF Comments list). W3C thanks the National Center for Biomedical Ontology at Stanford, Palo Alto, USA, for hosting the Workshop.

Posted at 08:13

July 09

Ebiquity research group UMBC: Google Open Spot Android app finds parking

sf_retrieving_spotGoogle’s Open Spot Android app lets people leaving parking spots share the information with others searching for parking nearby. Running the app shows you parking spots within a 1.5km. New parking spots are assumed to be gone after 20 minutes and removed from the system.

People who announce open spots gain karma points, while those who report false spots, known as griefers, are on notice:

“We’re watching for behavior that looks like a griefer spoofing parking spots. We have a couple of mechanisms available to make sure someone can’t leave a bunch of fake parking spots. If we see this happening we will take steps to fix it.

This is a simple example of a context-aware mobile app that can further benefit from also knowing that you are driving, as opposed to riding, in your car and likely to want to find a parking spot, as opposed to doing 70mph on I-95 as it goes through Baltimore. Moreover, context would also inform that app that you are probably leaving a public parking spot and mark it automatically. However, such a feature should be smart enough to avoid being tagged by Google as a griefer and finding out what punishment Google has in store for you.

Posted at 23:02

Talis: Tom Steinberg talks about the Public Sector Transparency Board

rjw_caricature_mini Tom Steinberg Tom Steinberg of mySociety fame joins me on this Talking with Talis podcast to discus the approach to open and linked data in the context of the UK Government.

We talk about his role over the years; the emergence data.gov.uk as part of the previous administration’s Making Public Data Public initiative; and the subtle change of emphasis accompanying the new administrations name change to the Transparency Programme.

Finally we move on to the role of the newly formed Public Sector Transparency Board of which he is a member.

Posted at 11:13

Dan Brickley: Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.

This makes the mapping of UDC (and to some extent also Dewey classifications)  into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed:  331.46-053.18(735.215.2/.4)”2008
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008″

PLACE (735.215.2/.4)  Counties in the Washington metropolitan area

TOPIC 1
331     Labour. Employment. Work. Labour economics. Organization of  labour
331.4     Working environment. Workplace design. Occupational safety.  Hygiene at work. Accidents at work
331.46  Accidents at work ==> 614.8

TOPIC 2
614   Prophylaxis. Public health measures. Preventive treatment
614.8    Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069    Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read  UDC number 331.46 you do not see only e.g. ‘accidents at work’ but  ==>  ’accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8  it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), CroatianArmenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66″ in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.

Posted at 09:38

Ebiquity research group UMBC: USCYBERCOM secret revealed

USCYBERCOM logo. Click to enlarge.

The secret message embedded in the USCYBERCOM logo

     9ec4c12949a4f31474f299058ce2b22a

is what the md5sum function returns when applied to the string that is USCYBERCOM’s official mission statement. Here’s a demonstration of this fact done on a Mac. On linux, use the md5sum command instead of md5.

~> echo -n "USCYBERCOM plans, coordinates, integrates, \
synchronizes and conducts activities to: direct the \
operations and defense of specified Department of \
Defense information networks and; prepare to, and when \
directed, conduct full spectrum military cyberspace \
operations in order to enable actions in all domains, \
ensure US/Allied \ freedom of action in cyberspace and \
deny the same to our adversaries." | md5
9ec4c12949a4f31474f299058ce2b22a
~>

md5sum is a standard Unix command that computes a 128 bit “fingerprint” of a string of any length. It is a well designed hashing function that has the property that its very unlikely that any two non-identical strings in the real world will have the same md5sum value. Such functions have many uses in cryptography.

Thanks to Ian Soboroff for spotting the answer on Slashdot and forwarding it.

Someone familiar with md5 would recognize that the secret string has the same length and character mix as an md5 value — 32 hexadecimal characters. Each of the possible hex characters (0123456789abcdef) represents four bits, so 32 of them is a way to represent 128 bits.

We’ll leave it as an exercise for the reader to compute the 128 bit sequence that our secret code corresponds to.

Posted at 01:11

Copyright of the postings is owned by the original blog authors. Contact us.