It's triples all the way down
function getCalaisResult($id, $text) {
$parms = '
<c:params xmlns:c="http://s.opencalais.com/1/pred/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<c:processingDirectives
c:contentType="TEXT/RAW"
c:outputFormat="XML/RDF"
c:calculateRelevanceScore="true"
c:enableMetadataType="SocialTags"
c:docRDFaccessible="false"
c:omitOutputtingOriginalText="true"
></c:processingDirectives>
<c:userDirectives
c:allowDistribution="false"
c:allowSearch="false"
c:externalID="' . $id . '"
c:submitter="http://semsol.com/"
></c:userDirectives>
<c:externalMetadata></c:externalMetadata>
</c:params>
';
$args = array(
'licenseID' => $this->a['calais_key'],
'content' => urlencode($text),
'paramsXML' => urlencode(trim($parms))
);
$qs = substr($this->qs($args), 1);
$url = 'http://api.opencalais.com/enlighten/rest/';
return $this->getAPIResult($url, $qs);
}
function getZemantaResult($id, $text) {
$args = array(
'method' => 'zemanta.suggest',
'api_key' => $this->a['zemanta_key'],
'text' => urlencode($text),
'format' => 'rdfxml',
'return_rdf_links' => '1',
'return_articles' => '0',
'return_categories' => '0',
'return_images' => '0',
'emphasis' => '0',
);
$qs = substr($this->qs($args), 1);
$url = 'http://api.zemanta.com/services/rest/0.0/';
return $this->getAPIResult($url, $qs);
}
function getAPIResult($url, $qs) {
ARC2::inc('Reader');
$reader = new ARC2_Reader($this->a, $this);
$reader->setHTTPMethod('POST');
$reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
$reader->setMessageBody($qs);
$reader->activate($url);
$r = '';
while ($d = $reader->readStream()) {
$r .= $d;
}
$reader->closeStream();
return $r;
}
SELECT DISTINCT ?id ?obj ?cnf ?name
FROM <' . $g . '> WHERE {
?rec a z:Recognition ;
z:object ?obj ;
z:confidence ?cnf .
?obj z:target ?id .
?id z:targetType <http://s.zemanta.com/targets#rdf> ;
z:title ?name .
FILTER(?cnf >= 0.4)
} ORDER BY ?id

Posted at 09:50
Today, we released version 0.2 of the ontology repair and enrichment (ORE) tool. It is a tool for knowledge engineers to improve an OWL ontology through a wizard like repair process and uses state-of-the-art ontology debugging methods. The main feature in version 0.2 is a mode for incrementally detecting inconsistencies in large knowledge bases available as SPARQL endpoints. Using this mode, we have detected inconsistencies and computed justifications in DBpedia Live and OpenCyc. Previously, both knowledge bases were too large to compute justifications on standard hardware to the best of our knowledge, i.e. inconsistencies could not be fixed efficiently. A screencast illustrates this process for the case of DBpedia Live. Thanks to Lorenz Bühmann for his work on ORE.
Posted at 21:31
Posted at 20:43
Posted at 20:02
Posted at 19:30
Posted at 14:04
Posted at 00:00
I’ve been at a couple of great Web of Data events in the last ten days or so.

On 13 July, I organised a Linked Data meetup in Edinburgh that I’m pleased to say went very well. Around 25 people showed up to hear interesting talks from Zach Beauvais of Talis (slides) and Paola di Maio of Strathclyde University (slides). There was a good mix of people already experienced with linked data and others who wanted to learn more about it – many of them with specific potential applications in mind.
There’s a fuller write-up at the SLDIG wiki.
Then a couple of days ago I went down to Manchester for the Vision and Media Transmission 6 event “Towards a web of data?” organised by Paul Collins of the White Room.
I presented on “Publishing Linked Data: Getting Started”. The other speakers were Paul Miller of Cloud of Data (and semantic podcaster extraordinaire) and Liz Turner of Iconomical, the creator of the well-known WhereDoesMyMoneyGo. Paul’s slides are here.
I very much enjoyed the chance to be involved and to meet with some of the thriving digital media community in Manchester.
Posted at 14:11
I’ve been at a couple of great Web of Data events in the last ten days or so.

On 13 July, I organised a Linked Data meetup in Edinburgh that I’m pleased to say went very well. Around 25 people showed up to hear interesting talks from Zach Beauvais of Talis (slides) and Paola di Maio of Strathclyde University (slides). There was a good mix of people already experienced with linked data and others who wanted to learn more about it – many of them with specific potential applications in mind.
There’s a fuller write-up at the SLDIG wiki.
Then a couple of days ago I went down to Manchester for the Vision and Media Transmission 6 event “Towards a web of data?” organised by Paul Collins of the White Room.
I presented on “Publishing Linked Data: Getting Started”. The other speakers were Paul Miller of Cloud of Data (and semantic podcaster extraordinaire) and Liz Turner of Iconomical, the creator of the well-known WhereDoesMyMoneyGo. Paul’s slides are here.
I very much enjoyed the chance to be involved and to meet with some of the thriving digital media community in Manchester.
[Credits: Manchester picture by Graham Smith, via Flickr).]
Posted at 14:11
The Talis Linked Data in Libraries event, held at the British Library in London on Wednesday 21st July was attended by 50 enthusiastic interested people interested in the topic.
Below you will find presentations from the day.
Introduction Talis and the world of Linked Data – Zach Beavais, Talis
The data.bnf.fr Project – Romain Wenz, Bibliothèque
nationale de France
(Presentation not yet
available)
Linked Data, RDF, and SPARQL – Rob Styles, Talis
By rob
Linked Data in Action – Richard Wallis, Talis
By rob
Lightning Talks:
Neil Wilson, The British Library
![]()
Sally Chambers, The European Library
![]()
Felix Ostrowsk, The North Rhine-Westphalian Library Service![]()
Linked Bibliographic Data – Rob Styles, Talis
By rob
W3C Library Linked Data Incubator Group – Antoine Isaac, Europeana
An overview of the Talis Platform – Richard Wallis, Talis
By rob
Watch this space for videos of some of the sessions.
Posted at 15:53
The Talis Linked Data in Libraries event, held at the British Library in London on Wednesday 21st July was attended by 50 enthusiastic interested people interested in the topic.
Below you will find presentations from the day.
Introduction Talis and the world of Linked Data – Zach
Beavais, Talis 
Click to play
The data.bnf.fr Project – Romain Wenz, Bibliothèque
nationale de France
(Presentation not yet
available)
Linked Data, RDF, and SPARQL – Rob Styles,
Talis
Click
to play
Linked Data in Action – Richard Wallis,
Talis
Click
to play
Lightning Talks:
Neil Wilson, The British Library
![]()
Sally Chambers, The European Library
![]()
Felix Ostrowsk, The North Rhine-Westphalian Library Service![]()
Linked Bibliographic Data – Rob Styles,
Talis
Click
to play
W3C Library Linked Data Incubator Group – Antoine Isaac,
Europeana
Click to play
An overview of the Talis Platform – Richard Wallis,
Talis
Click to
play
Watch this space for videos of some of the sessions.
Posted at 15:53
Last month's Augmented Reality on the Web workshop in Barcelona has sparked a good deal of debate within and around W3C. As the final report shows, the workshop brought together many different companies and organizations working on or with a direct interest in the field of Augmented Reality — but how can W3C help in this area?
One outcome is clear: we need a method for representing data about points of interest and proposals are advancing to achieve this in a new POI Working Group. Quite what data needs to be represented concerning Points of Interest depends on who you ask. For some it's a question of annotating a given point on the Earth's surface where the longitude, latitude and altitude are all key identifiers. For others it's more a question of the point at a given distance and angle from an object that may or may not be static as seen by an observer who may themselves also be moving.
Different communities are involved here: as well as the augmented reality community, the linked data community has a keen interest. There are other facets to the discussion too and this is what will make the POI working group's work interesting!
The workshop also recommended that a new POI WG should go further and consider the wider picture of how AR does, or might, relate to the Web. Privacy is a major concern; device APIs are critical enablers; do CSS and SVG have sufficient power to support AR functions? Even the use of HTTP as a transport mechanism is questioned by some given the real time nature of AR.
To join the debate about all this, please subscribe to the Point of Interest mailing list and keep an eye out for calls for review of the charter in the near future.
Posted at 09:26
Posted at 16:05
Well, exactly this happened yesterday: Google bought Metaweb – provider of Freebase. Freebase is an important hub in the linked data cloud providing 12 million entities with uniform resource identifiers most of them linked to other semantic web datasets like DBpedia or New York Times. For example: Google´s page on Freebase offers a rich source for machine-readable facts around this company.
What does this mean to the Semantic Web Community which
has been working on a smarter web in the last
decade?
Well, a lot… First of all, it´s good to hear that Google will
continue to develop Freebase as a free and open database to
everyone, saying “… we would be delighted if other web companies
use and contribute to the data.”
Until yesterday still a lot of companies were not fully convinced if the Semantic Web will play a central role in the further development of the Internet. Now the game has changed. The entity-driven approach to develop web applications has just started now:
We will keep on reporting and discussing how Google will influence the development of the Semantic Web – and if I had a wish for free: Please add RDF(a) to the Freebase widgets!
Posted at 10:47
Since it completed the Recommendation Track process last year, little has been said or written about POWDER. However, there have been a number of unrelated events recently that I take as evidence of a long term future. As chair of the WG that created it (and an editor of all but one of the documents and general front-person for the whole thing), this makes me happy!
One of my private measures of success for it has always been that one day, someone I don't know and who doesn't know me stands up at a big conference and says "you know this POWDER thing is really cool." That happened at SemTech last month when Matt Fisher presented it in a session called RDF Friday Part 3: Practical RDF - POWDER & Object Design Patterns. The full version of what he was saying is available in an article on his company website Putting POWDER to Work. Matt and I have exchanged e-mails since then but we hadn't had any contact before.
Secondly my friend and WG member Andrea Perego has been cooking
up some code that uses POWDER to generate RDFa in a way that could
make it really easy to add all those <link />
elements in documents on the fly under the control of a single,
central POWDER file.
Suppose you want to add RDFa to all the pages on your Web site (not a bad thing to do!). One can imagine doing this for Creative Commons licences, DC metadata etc. Something like
<link rel ="cc:license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/" /> <link rel="dcterms:creator" href="http://philarcher.org/foaf.rdf#me" />
These link elements should probably be included on every page of your site. Sounds like a job for POWDER. Andrea's PHP POWDER Processor (3P) can take a POWDER file and URI as inputs and return those RDFa link elements via a RESTful API - one that could easily be called from within an authoring tool. Full documentation, including an example using the Open Graph Protocol, is available.
Another development is still under wraps at the moment but the signs are very positive that a combination of marketing expertise, industry contacts, business dynamism and, not unimportantly, venture capital is coming together in a POWDER-fuelled start-up.
A quick reminder of the key features of POWDER:
aboutEachPrefix issue);If you haven't looked at POWDER before, maybe now's a good time to do so.
Posted at 08:28
Google announced today that it has acquired Metaweb, the company behind Freebase — a free, semantic database of “over 12 million people, places, and things in the world.” This is from their announcement on the Official Google blog:
“Over time we’ve improved search by deepening our understanding of queries and web pages. The web isn’t merely words — it’s information about things in the real world, and understanding the relationships between real-world entities can help us deliver relevant information more quickly. … With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better. Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers.”
In their announcement, Google promises to continue to maintain Freebase “as a free and open database for the world” and invites other web companies use and contribute to it.
Freebase is a system very much in the linked open data spirit, even thought RDF is not its native representation. It’s content is available as RDF and there are many links that bind it to the LOD cloud. Moreover, Freebase has a very good wiki-like interface allowing people to upload, extend and edit both its schema and data.
Here’s a video on the concepts behind Metaweb which are, of course, also those underlying the Semantic Web. What the difference — I’d say a combination of representational details and centralized (Metaweb) vs. distributed (Semantic Web).
Posted at 19:30
Web search guru Danny Sullivan has a great response to the NYT editorial on regulating search engine algorithms: The New York Times Algorithm and Why It Needs Government Regulation. Here’s how it starts:
“The New York Times is the number one newspaper web site. Analysts reckon it ranks first in reach among US opinion leaders. When the New York Times editorial staff tweaks its supersecret algorithm behind what to cover and exactly how to cover a story — as it does hundreds of times a day — it can break a business that is pushed down in coverage or not covered at all.”
Google published its own response to the Times piece as a Financial Times op-ed and also posted it to the Google public policy blog: regulating what is “best” in search?
“Search engines use algorithms and equations to produce order and organisation online where manual effort cannot. These algorithms embody rules that decide which information is “best”, and how to measure it. Clearly defining which of any product or service is best is subjective. Yet in our view, the notion of “search neutrality” threatens innovation, competition and, fundamentally,your ability as a user to improve how you find information.”
The penultimate paragraph gives what they say is their strongest argument againt mandating “search neutrality”.
“But the strongest arguments against rules for “neutral search” is that they would make the ranking of results on each search engine similar, creating a strong disincentive for each company to find new, innovative ways to seek out the best answers on an increasingly complex web. What if a better answer for your search, say, on the World Cup or “jaguar” were to appear on the web tomorrow? Also, what if a new technology were to be developed as powerful as PageRank that transforms the way search engines work? Neutrality forcing standardised results removes the potential for innovation and turns search into a commodity.”
This assumes of course, that there is real competition among Internet search engines. Microsoft has been putting a lot of research and development into Bing with good results and it’s been gaining market share. Yahoo is doing very interesting this as well. Consumer choice among a handful of competitors would be the best way to ensure that none abuse their customers.
Posted at 05:01
Posted at 23:59
Posted at 21:36
In what may be a first, today’s New York Times has an editorial about an algorithm. No, they haven’t waded into the P=NP issue, but commented on Google’s algorithm for ranking search results and accusations that Google unfairly biases it for its own self interest.
“In the past few months, Google has come under investigation by antitrust regulators in Europe. Rivals have accused Google of placing the Web sites of affiliates like Google Maps or YouTube at the top of Internet searches and relegating competitors to obscurity down the list. In the United States, Google said it expects antitrust regulators to scrutinize its $700 million purchase of the flight information software firm ITA, with which it plans to enter the online travel search market occupied by Expedia, Orbitz, Bing and others.”
This issue will become more important as the companies dominating Web search (Google, Microsoft and Yahoo) continue to increase their importance and also broaden their acquisition of companies offering web services.
The NYT’s position is moderate, recommending:
Google provides an incredibly valuable service, and the government must be careful not to stifle its ability to innovate. Forcing it to publish the algorithm or the method it uses to evaluate it would allow every Web site to game the rules in order to climb up the rankings — destroying its value as a search engine. Requiring each algorithm tweak to be approved by regulators could drastically slow down its improvements. Forbidding Google to favor its own services — such as when it offers a Google Map to queries about addresses — might reduce the value of its searches. With these caveats in mind, if Google is to continue to be the main map to the information highway, it concerns us all that it leads us fairly to where we want to go.
Posted at 18:28
We’ve seen and reported on the rise of Linked Data
from concept to practice, and our Open Days have been a great
opportunity to explore and explain Linked Data very broadly. The
broad discussions have allowed many people to imagine using
semantics with their own data, as publishers, developers,
information architects etc. across many different industries and
applications. But one area in which we are particularly interested
is health.
Biomedical science is full of structured and semi-structured information, much of which crosses the organising boundaries we’ve created for it. Every aspect of medical practice, research and policy makes use of (and in most cases creates supplementary) information, and it’s become plain that much of this data is stored, hidden and often unaccessible.
I attended some sessions on
biomedical semantics at SemTech last month, and was hugely
intrigued by the state of health data world-wide. There are many
usable ontologies for medical science, for example, which show the
relationships among biological knowledge and clinical use; but much
of the data used on the front line is not part of this structure.
There seems to be much that could be gained from taking a Linked
approach to these data!
Mark Birbeck and Dr Michael Wilkinson, in last month’s Nodalities Magazine introduced the idea of “A Linked Data Platform for Innovation,” a project of the National Innovation Centre for joining clinicians to linked visualisations through a widget-like, Linked Data platform:
The NIC is committed to using Semantic Web technologies as a way to significantly improve the speed and quality of decision- making in the area of health technology innovations.
So, we’ve decided to join forces with some of these minds and host an event to explain and explore biomedical data. We’ll be at No 76 Portland Place on 19th August from 10AM to 4PM. We’ve invited Dr Nigam Shah from Stanford University to talk to us about the state of global health data, and to suggest several ways in which linking can be done in the very near future. We will also cover the topic of Linked Data (what it is, and how it works), as well as taking a quick look at how it’s being used across the web already. The people behind the NIC’s clinical widget platform will also be there to introduce their project.
Places are free of charge, but limited so make sure to sign up to reserve your place.
We’d very much like to keep the spirit of an Open Day. This event is open for discussion, examination and exploration of using the Semantic Web in life sciences, so come armed with ideas, questions and problems!
Talis will be putting on lunch, and we will also have a ready supply of coffee on hand to help the discussions.
Image: “Science is Knowledge” by Zach Beauvais, is a mashup of “3D Stone Cells” by BlueRidgeKitties, and “Glass Bottles I” by Tim O’Brien via flickr. They are used under CC: BY, NC, SA licenses.
Posted at 18:24
Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.
All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)
There are two issues we are still currently trying to solve though:
graphs within graphs. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs.
Posted at 14:46
For those of you interested in deploying RDF on the Web, I'd like to draw your attention to three new proposed standards from IETF, "Web Linking", "Defining Well-Known URIs", and "Web Host Metadata", that create new follow-your-nose tricks that could be used by semantic web clients to obtain RDF connected to a URI - RDF that presumably defines what the URI 'means' and/or describes the thing that the URI is supposed to refer to.
Most semantic web application developers are probably familiar with three ways to nose-follow from a URI:
In case 3, X refers to what I'll call a "web page" (a more technical term is used in the TAG's httpRange-14 resolution). One of the new RFCs extends case 3 to situations where the RDF can't be embedded in the content, either because the content-type doesn't provide a place to put it (e.g. text/plain) or because for administrative reasons the content can't be modified to include it (e.g. a web archive that has to deliver the original bytes faithfully). The others cover this case as well as offering improved performance in case 2.
Before getting into the new nose-following protocols, I'll amplify case 3 above by listing a few applications of RDF in which a web page occurs as a subject. I'll rather imprecisely call such RDF "metadata".
All sorts of other statements can be made about a web page, such as a type (wiki page, blog post, etc.), SKOS concepts, links to comments and reviews, duration of a recording, how to edit, who controls it administratively, etc. Anything you might want to say about a web page can be said in RDF.
Embedded metadata is easy to deploy and to access, and should be used when possible. But while embedded metadata has the advantages of traveling around with the content, a protocol that allows the server responsible for the URI to provide metadata over a separate "channel" has two advantages over embedded metadata: First, the metadata doesn't have to be put into the content; and second, it doesn't have to be parsed out of the content. And it's not either/or: There is no reason not to provide metadata through both channels when possible.
The 'Web Linking' proposed standard defines the HTTP Link: header, which provides a way to communicate links rooted at the requested resource. These links can either encode interesting information directly in the HTTP response, or provide a link to a document that packages metadata relevant to the resource.
In the former case, one might have:
Link: <http://xmlns.com/foaf/0.1/Document>;
rel="http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
meaning that the request URI refers to something of type foaf:Document. In the latter case one might have:
Link: <http://example.com/about/foo.rdf>;
rel="describedby"; type=application/rdf+xml
meaning that metadata can be found in <http://example.com/about/foo.rdf>, and hinting that the latter resource might have a 'representation' with media type application/rdf+xml.
The motivation for the "well-known URIs" RFC is to collect all "well-known URIs" (analogous to "robots.txt") in a single place, a root-level ".well-known" directory, and create a registry of them to avoid collisions. The most pressing need comes from protocols such as webfinger and OpenID; see Eran Hammer-Lahav's blog post for the whole story.
For linked data, .well-known provides an opportunity for providing metadata for web pages, as well improving the efficiency of obtaining RDF associated with other "slash URIs", what is currently done using 303 responses.
Ever since the TAG's httpRange-14 decision in 2005, there have been concerns that it takes two round trips to collect RDF associated with a slash URI. While some might question why those complaining aren't using hash URIs, in any case the "well-known URIs" mechanism gives a way to reduce the number of round trips in many cases, eliminating many GET/303 exchanges.
The trick is to obtain, for each host, a generic rule that will transform the URI at that host that you want RDF for into the URI of a document that carries that RDF. This generic rule is stored in a file residing in the .well-known space at a path that is fixed across all hosts. That is: to find RDF for http://example.com/foo, follow these steps:
The form of the about-URI is chosen by the particular host, e.g. "http://example.com/foo,about" or "http://about.example.com/foo" or whatever works best.
Why is this fewer round trips than using 303? Because you can fetch and cache the generic rule once per site. The first use of the rule still costs an extra round trip, but subsequent URIs for a given site can be nose-followed without any extra web accesses.
A worked example can be found here.
As with any new protocol, figuring out exactly how to apply the new proposed standards will require coordination and consensus-building. For example, the choice of the "describedby" link relation and "host-meta" well-known URI need to be confirmed for linked data, and agreement reached on whether multiple Link: headers is in good taste or poor taste. (Link: and .well-known put interesting content in a peculiarly obscure place and it might be a good idea to limit their use.) Consideration should be given to Larry Masinter's suggestion to use multiple relations reflecting different attitudes the server might have regarding the various metadata sources: For example the server may choose to announce that it wants the Link: metadata to override any embedded metadata, or vice versa. Agreement should be reached on the use of Link: and host-meta with redirects (302 and so on) - personally I think it would be a great thing as you could then use a value-added forwarding service to provide metadata that the target host doesn't or can't provide.
This is not a particularly heavy coordination burden; the design odds-and-ends and implementations are all simple. The impetus might come from inside W3C (e.g. via SWIG) or bottom-up. All we really need to get this going are a bit of community discussion, a server, and a cooperating client, and if the protocols actually fill a need, they will take off.
For past TAG work on this topic, please see TAG issue 62 and the "Uniform Access to Metadata" memo.
Posted at 11:56
Frank van Harmelen’s tweet drew my attention on a
Posted at 08:39
How Does Linked Data Address This Problem? It provides critical infrastructure for the WebID Protocol that enables an innovative tweak of SSL/TLS.
What about OpenID? The WebID Protocol embraces and extends OpenID (in an open and positive way) via the WebID + OpenID Hybrid variant of the protocol -- basic effect is that OpenID calls are re-routed to the WebID aspect which simply removes Username and Password Authentication from the authentication challenge interaction pattern.
Posted at 03:25
Posted at 08:13
Google’s Open Spot Android app lets
people leaving parking spots share the information with others
searching for parking nearby. Running the app shows you parking
spots within a 1.5km. New parking spots are assumed to be gone
after 20 minutes and removed from the system.
People who announce open spots gain karma points, while those
who report false spots, known as griefers, are on
notice:
“We’re watching for behavior that looks like a griefer spoofing parking spots. We have a couple of mechanisms available to make sure someone can’t leave a bunch of fake parking spots. If we see this happening we will take steps to fix it.
This is a simple example of a context-aware mobile app that can further benefit from also knowing that you are driving, as opposed to riding, in your car and likely to want to find a parking spot, as opposed to doing 70mph on I-95 as it goes through Baltimore. Moreover, context would also inform that app that you are probably leaving a public parking spot and mark it automatically. However, such a feature should be smart enough to avoid being tagged by Google as a griefer and finding out what punishment Google has in store for you.
Posted at 23:02
Tom Steinberg
of mySociety fame joins me
on this Talking
with Talis podcast to discus the approach to open and linked
data in the context of the UK Government.
We talk about his role over the years; the emergence data.gov.uk as part of the previous administration’s Making Public Data Public initiative; and the subtle change of emphasis accompanying the new administrations name change to the Transparency Programme.
Finally we move on to the role of the newly formed Public Sector Transparency Board of which he is a member.
Posted at 11:13
Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.
This makes the mapping of UDC (and to some extent also Dewey classifications) into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.
One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):
“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″
There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?
The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.
Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.
I asked,
So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?
Aida’s reply (posted with permission):
You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?
Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008”
Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed: 331.46-053.18(735.215.2/.4)”2008”
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.
The UDC structures composed here are:
TIME “2008″
PLACE (735.215.2/.4) Counties in the Washington metropolitan area
TOPIC 1
331 Labour. Employment. Work. Labour economics. Organization of labour
331.4 Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work
331.46 Accidents at work ==> 614.8TOPIC 2
614 Prophylaxis. Public health measures. Preventive treatment
614.8 Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069 Fatal accidentsNB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.
So when you read UDC number 331.46 you do not see only e.g. ‘accidents at work’ but ==> ’accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8 it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident preventionWhen you see (735.2….) you do not only see Washington but also United States, North America
So why is this interesting? A couple of reasons…
1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.
2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), Croatian, Armenian, …
Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66″ in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.
Posted at 09:38
The secret message embedded in the USCYBERCOM logo
9ec4c12949a4f31474f299058ce2b22a
is what the md5sum function returns when applied to the string that is USCYBERCOM’s official mission statement. Here’s a demonstration of this fact done on a Mac. On linux, use the md5sum command instead of md5.
~> echo -n "USCYBERCOM plans, coordinates, integrates, \ synchronizes and conducts activities to: direct the \ operations and defense of specified Department of \ Defense information networks and; prepare to, and when \ directed, conduct full spectrum military cyberspace \ operations in order to enable actions in all domains, \ ensure US/Allied \ freedom of action in cyberspace and \ deny the same to our adversaries." | md5 9ec4c12949a4f31474f299058ce2b22a ~>
md5sum is a standard Unix command that computes a 128 bit “fingerprint” of a string of any length. It is a well designed hashing function that has the property that its very unlikely that any two non-identical strings in the real world will have the same md5sum value. Such functions have many uses in cryptography.
Thanks to Ian Soboroff for spotting the answer on Slashdot and forwarding it.
Someone familiar with md5 would recognize that the secret string has the same length and character mix as an md5 value — 32 hexadecimal characters. Each of the possible hex characters (0123456789abcdef) represents four bits, so 32 of them is a way to represent 128 bits.
We’ll leave it as an exercise for the reader to compute the 128 bit sequence that our secret code corresponds to.
Posted at 01:11