Planet RDF

It's triples all the way down

May 26

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Advertisements

Posted at 07:06

May 18

Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a
    ['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a
    ['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision ), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Posted at 22:06

May 16

Benjamin Nowack: Contextual configuration - Semantic Web development for visually minded webmasters

Let's face it, building semantic web sites and apps is still far from easy. And to some extent, this is due to the configuration overhead. The RDF stack is built around declarative languages (for simplified integration at various levels), and as a consequence, configuration directives often end up in some form of declarative format, too. While fleshing out an RDF-powered website, you have to declare a ton of things. From namespace abbreviations to data sources and API endpoints, from vocabularies to identifier mappings, from queries to object templates, and what have you.

Sadly, many of these configurations are needed to style the user interface, and because of RDF's open world context, designers have to know much more about the data model and possible variations than usually necessary. Or webmasters have to deal with design work. Not ideal either. If we want to bring RDF to mainstream web developers, we have to simplify the creation of user-optimized apps. The value proposition of semantics in the context of information overload is pretty clear, and some form of data integration is becoming mandatory for any modern website. But the entry barrier caused by large and complicated configuration files (Fresnel anyone?) is still too high. How can we get from our powerful, largely generic systems to end-user-optimized apps? Or the other way round: How can we support frontend-oriented web development with our flexible tools and freely mashable data sets? (Let me quickly mention Drupal here, which is doing a great job at near-seamlessly integrating RDF. OK, back to the post.)

Enter RDF widgets. Widgets have obvious backend-related benefits like accessing, combining and re-purposing information from remote sources within a manageable code sandbox. But they can also greatly support frontend developers. They simplify page layouting and incremental site building with instant visual feedback (add a widget, test, add another one, re-arrange, etc.). And, more importantly in the RDF case, they can offer a way to iteratively configure a system with very little technical overhead. Configuration options could not only be scoped to the widget at hand, but also to the context where the widget is currently viewed. Let's say you are building an RDF browser and need resource templates for all kinds of items. With contextual configuration, you could simply browse the site and at any position in the ontology or navigation hierarchy, you would just open a configuration dialog and define a custom template, if needed. Such an approach could enable systems that worked out of the box (raw, but usable) and which could then be continually optimized, possibly even by site users.

A lot of "could" and "would" in the paragraphs above, and the idea may sound quite abstract without actually seeing it. To illustrate the point I'm trying to make I've prepared a short video (embedded below). It uses Semantic CrunchBase and Paggr Prospect (our new faceted browser builder) as an example use case for in-context configuration.

And if you are interested in using one of our solutions for your own projects, please get in touch !



Paggr Prospect (part 1)


Paggr Prospect (part 2)

Posted at 23:06

Benjamin Nowack: 2011 Resolutions and Decisions

All right, this post could easily have become another rant about the ever-growing complexity of RDF specifications, but I'll turn it into a big shout-out to the Semantic Web community instead. After announcing the end of investing further time into ARC's open-source branch, I received so many nice tweets and mails that I was reminded of why I started the project in the first place: The positive vibe in the community, and the shared vision. Thank you very much everybody for the friendly reactions, I'm definitely very moved.

Some explanations: I still share the vision of machine-readable, integration-ready web content, but I have to face the fact that the current approach is getting too expensive for web agencies like mine. Luckily, I could spot a few areas where customer demands meet the cost-efficient implementation of certain spec subsets. (Those don't include comprehensive RDF infrastructure and free services here, though. At least not yet, and I just won't make further bets). The good news: I will continue working with semantic web technologies, and I'm personally very happy to switch focus from rather frustrating spec chasing to customer-oriented solutions and products with defined purposes . The downside: I have to discontinue a couple of projects and services in order to concentrate my energy and reduce (opportunity) costs. These are:
  • The ARC website, mailing list, and other forms of free support. The code and documentation get a new home on GitHub , though. The user community is already thinking about setting up a mailing list on their own. Development of ARC is going to continue internally, based on client projects (it's not dying).
  • Trice as an open-source project (lesson learned from ARC)
  • Semantic CrunchBase. I had a number of users but no paying ones. It was also one those projects that happily burn your marketing budget while at the same time having only negative effects on the company's image because the funds are too small to provide a reliable service (similar to the flaky DBPedia SPARQL service which makes the underlying RDF store look like a crappy product although it is absolutely not).
  • Knowee, Smesher and similar half-implemented and unfunded ideas.
Looking forward to a more simplified and streamlined 2011. Lots of success to all of you, and thanks again for the nice mails!

Posted at 07:07

May 09

Leigh Dodds: That thing we call “open”

I’ve been involved in a few conversations recently about what “open” or “being open” means in different situations.

As I’ve noted previously when people say “open” they often mean very different things. And while there may be a clear definitions of “open”, people don’t
often use the terms correctly. And some phrases like “open API” are still, well, open to interpretation.

In this post I’m going to summarise some of the ways in which I tend to think about making something “open”.

Let me know if I’m missing something so I can plug gaps in my understanding.

Openness of a “thing”

Digital objects: books, documents, images, music, software and datasets can all be open.

Making things open in this sense is the most well documented, but still the most consistently misunderstood. There are clear definitions for open content
and data, open source, etc. Open in these contexts provide various freedoms to use, remix, share, etc.

People often confuse something being visible or available to them as being open, but that’s not the same thing at all. Being able to see or read something doesn’t give you any legal permissions at all.

It’s worth noting that the definitions of open “things” in different communities are often overlapping. For example, the Creative Commons licences allow works to be licensed in ways that enable a wide variety of legal reuses. But the Open Definition only recognises a subset of those as being open, rather than shared.

Putting an open licence on something also doesn’t necessarily grant you the full freedom to reuse that thing. For example I could open source some machine learning software but it might only be practically reusable if you can train it on some data that I’ve chosen not to share.

Or I might use an open licence like the Open Government Licence that allows me to put an open licence on something whilst ignoring the existence of any third-party rights. No need to do my homework. Reuser beware.

Openness of a process

Processes can be open. It might be better to think about transparency (e.g. of how the process is running) or the ability to participate in a process in this context.

Anything that changes and evolves over time will have a process by which those changes are identified, agreed, prioritised and applied. We sometimes call that governance. The definition of an open standard includes defining both the openness of the standard (the thing) as well as the process.

Stewardship, of a software project, or a dataset, or a standard are also examples of where it might be useful for a process to be open. Questions we can ask of open processes are things like:

  • Can I contribute to the main codebase of a software package, rather than just fork it?
  • Can I get involved in the decision making around how a piece of software or standard evolves?
  • Can I directly fix errors in a dataset?
  • Can I see what decisions have been, or are being made that relate to how something is evolving?

When we’re talking about open data or open source, often we’re really talking about openness of the “thing”. But when we’re making things open to make them
better, I think we’re often talking about being open to contributions and participation. Which needs something more than a licence on a thing.

There’s probably a broader category of openness here which relates to how open a process is socially. Words like inclusivity and diversity spring to mind.

Your standards process isn’t really open to all if all of your meetings are held face to face in Hawaii.

Openness of a product, system or platform

Products, platforms and systems can be open too. Here we can think of openness as relating to the degree to which the system

  • is built around open standards and open data (made from open things)
  •  is operated using open processes
  • is available for wider access and use

We can explore this by asking questions like:

  • Is it designed to run on open infrastructure or is it tied to particular cloud infrastructure or hardware?
  • Are the interfaces to the system built around open standards?
  • Can I get access to an API? Or is it invite only?
  • How do the terms of service shape the acceptable uses of the system?
  • Can I use its outputs, e.g. the data returned by a platform or an API, under an open licence?
  • Can we observe how well the system or platform is performing, or measure its impacts in different ways (e.g. socially, economically, environmentally)

Openness of an ecosystem

Ecosystems can be open too. In one sense an open ecosystem is “all of the above”. But there are properties of an ecosystem that might itself indicate aspects of openness:

  • Is there a choice in providers, or is there a monopoly provider of services or data?
  • How easy is it for new organisations to engage with the ecosystem, e.g to provide
    competing or new services?
  • Can we measure the impacts and operations of the ecosystem?

When we’re talking about openness of an ecosystem we’re usually talking about markets and sectors and regulation and governance.

Applying this in practice

So when  thinking about whether something is “open” the first thing I tend to do is consider which of the above categories apply. In some cases its actually several.

This is evident in my attempt to define “open API“.

For example we’re doing some work @ODIHQ to explore the concept of a digital twin. According to the Gemini Principles a digital twin should be open. Here we can think of an individual digital twin as an object (a piece of software or a model), or a process (e.g. as an open source project), or an operational system or platform depending on how its made available.

We’re also looking at cities. Cities can be open in the sense of the openness of their processes of governance and decision making. They might also be considered as platforms for sharing data and connecting sofrware. Or as ecosystems of the same.

Posted at 20:05

Sebastian Trueg: Protecting And Sharing Linked Data With Virtuoso

Disclaimer: Many of the features presented here are rather new and can not be found in  the open-source version of Virtuoso.

Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.

Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.

We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:

Virtuoso Sparql Endpoint

Sparql Result

This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:

$ isql-v localhost:1112 dba dba
Connected to OpenLink Virtuoso
Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('http://www.openlinksw.com/schemas/virtrdf#PrivateGraphs', 'urn:trueg:demo');

Done. -- 2 msec.

Now our new named graph urn:trueg:demo is private and its contents cannot be seen by anyone. We can easily test this by logging out and trying to query the graph:

Sparql Query
Sparql Query Result

But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Share Demo Graph with trueg's LinkedIn account" ;
  acl:agent <http://www.linkedin.com/in/trueg> ;
  acl:accessTo <urn:trueg:demo> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:PrivateGraphs .

Virtuoso makes use of the ACL ontology proposed by the W3C and extends on it with several custom classes and properties in the OpenLink ACL Ontology. Most of this little Turtle snippet should be obvious: we create an Authorization resource which grants Read access to urn:trueg:demo for agent http://www.linkedin.com/in/trueg. The only tricky part is the scope. Virtuoso has the concept of ACL scopes which group rules by their resource type. In this case the scope is private graphs, another typical scope would be DAV resources.

Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:

$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules

As a result we get the full rule resource including additional properties added by the API.

Finally we will login using my LinkedIn identity and are granted read access to the graph:

SPARQL Endpoint Login
sparql6
sparql7
sparql8

We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.

Technical Footnote

Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs .
};

This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Allow Anyone to SPARQL Read" ;
  acl:agentClass foaf:Agent ;
  acl:accessTo <urn:virtuoso:access:sparql> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:Query .

I will explain these technical concepts in more detail in another article.

Posted at 10:06

Sebastian Trueg: Sharing Files With Whomever Is Simple

Dropbox, Google Drive, OneDrive, Box.com – they all allow you to share files with others. But they all do it via the strange concept of public links. Anyone who has this link has access to the file. On first glance this might be easy enough but what if you want to revoke read access for just one of those people? What if you want to share a set of files with a whole group?

I will not answer these questions per se. I will show an alternative based on OpenLink Virtuoso.

Virtuoso has its own WebDAV file storage system built in. Thus, any instance of Virtuoso can store files and serve these files via the WebDAV API (and an LDP API for those interested) and an HTML UI. See below for a basic example:

Virtuoso DAV Browser

This is just your typical file browser listing – nothing fancy. The fancy part lives under the hood in what we call VAL – the Virtuoso Authentication and Authorization Layer.

We can edit the permissions of one file or folder and share it with anyone we like. And this is where it gets interesting: instead of sharing with an email address or a user account on the Virtuoso instance we can share with people using their identifiers from any of the supported services. This includes Facebook, Twitter, LinkedIn, WordPress, Yahoo, Mozilla Persona, and the list goes on.

For this small demo I will share a file with my LinkedIn identity http://www.linkedin.com/in/trueg. (Virtuoso/VAL identifier people via URIs, thus, it has schemes for all supported services. For a complete list see the Service ID Examples in the ODS API documentation.)

Virtuoso Share File

Now when I logout and try to access the file in question I am presented with the authentication dialog from VAL:

VAL Authentication Dialog

This dialog allows me to authenticate using any of the supported authentication methods. In this case I will choose to authenticate via LinkedIn which will result in an OAuth handshake followed by the granted read access to the file:

LinkedIn OAuth Handshake

 

Access to file granted

It is that simple. Of course these identifiers can also be used in groups, allowing to share files and folders with a set of people instead of just one individual.

Next up: Sharing Named Graphs via VAL.

Posted at 10:06

Sebastian Trueg: Digitally Sign Emails With Your X.509 Certificate in Evolution

Digitally signing Emails is always a good idea. People can verify that you actually sent the mail and they can encrypt emails in return. A while ago Kingsley showed how to sign emails in Thunderbird.I will now follow up with a short post on how to do the same in Evolution.

The process begins with actually getting an X.509 certificate including an embedded WebID. There are a few services out there that can help with this, most notably OpenLink’s own YouID and ODS. The former allows you to create a new certificate based on existing social service accounts. The latter requires you to create an ODS account and then create a new certificate via Profile edit -> Security -> Certificate Generator. In any case make sure to use the same email address for the certificate that you will be using for email sending.

The certificate will actually be created by the web browser, making sure that the private key is safe.

If you are a Google Chrome user you can skip the next step since Evolution shares its key storage with Chrome (and several other applications). If you are a user of Firefox you need to perform one extra step: go to the Firefox preferences, into the advanced section, click the “Certificates” button, choose the previously created certificate, and export it to a .p12 file.

Back in Evolution’s settings you can now import this file:

To actually sign emails with your shiny new certificate stay in the Evolution settings, choose to edit the Mail Account in question, select the certificate in the Secure MIME (S/MIME) section and check “Digitally sign outgoing messages (by default)“:

The nice thing about Evolution here is that in contrast to Thunderbird there is no need to manually import the root certificate which was used to sign your certificate (in our case the one from OpenLink). Evolution will simply ask you to trust that certificate the first time you try to send a signed email:

That’s it. Email signing in Evolution is easy.

Posted at 10:06

Davide Palmisano: SameAs4J: little drops of water make the mighty ocean

Few days ago Milan Stankovich contacted the Sindice crew informing us that he wrote a simply Java library to interact with the public Sindice HTTP APIs. We always appreciate such kind of community efforts lead to collaboratively make Sindice a better place on the Web. Agreeing with Milan, we decided to put some efforts on his initial work to make such library the official open source tool for Java programmers.
That reminded me that, few months ago, I did for sameas.org the same thing Milan did for us. But (ashamed) I never informed those guys about what I did.
Sameas.org is a great and extremely useful tool on the Web that makes concretely possible to interlink different Linked data clouds. Simple to use (both for humans via HTML and for machines with a simple HTTP/JSON API) and extremely reactive, it allows to get all the owl:sameAs object for a given URI. And, moreover, it’s based on Sindice.com.
Do you want to know the identifier of http://dbpedia.org/resource/Rome in Freebase or Yago? Just ask it to Sameas.org.

So, after some months I just refined a couple of things, added some javadocs, set up a Maven repository and made SameAs4j publicly available (MIT licensed) to everyone on Google Code.
It’s a simple but reliable tiny set of Java classes that allows you to interact with sameas.org programatically in your Java Semantic Web applications.

Back to the beginning: every pieces of open source software is like a little drop of water which makes the mighty ocean, so please submit any issue or patch if interested.

Posted at 10:06

Davide Palmisano: RWW 2009 Top 10 Semantic Web products: one year later…


Just few days ago the popular ReadWriteWeb published a list of the 2009 Top Ten Semantic Web products as they did one year ago with the 2008 Top Ten.

This two milestones are a good opportunity to make something similar to a balance. Or just to do a quick overview on what’s changed in the “Web of Data”, only one year later.

The 2008 Top Ten foreseen the following applications, listed in the same ReadWriteWeb order and enriched with some personal opinions.

Yahoo Search Monkey

It’s great. Search Monkey represents the first kind of next-generation search engines due its capability to be fully customized by third party developers. Recently, a breaking news woke up the “sem webbers” of the whole planet: Yahoo started to show structured data exposed with RDFa in the search results page. That news bounced all over the Web and those interested in SEO started to appreciate Semantic Web technologies for their business. But, unfortunately, at the moment I’m writing, RDFa is not showed anymore on search results due to an layout update that broke this functionality. Even if there are rumors on a imminent fixing of this, the main problem is the robustness and the reliability of that kind of services: investors need to be properly guaranteed on the effectiveness of their investments.

Powerset

Probably, this neat application has became really popular when it has been acquired by Microsoft. It allows to make simple natural language queries like “film where Kevin Spacey acted” and, a first glance, the results seems really much better than other traditional search engines. Honestly I don’t really know what are the technologies they are using to do this magic. But, it would be nice to compare their results with an hypothetical service that translates such human text queries in a set of SPARQL queries over DBpedia. Anyone interested in do that? I’ll be more than happy to be engaged in a project like that.

Open Calais

With a large and massive branding operation these guys built the image of this service as it be the only one fitting everyone’s need when dealing with semantic enrichment of unstructured free-texts. Even this is partly true (why don’t mentioning the Apache UIMA Open Calais annotator?), there are a lot of other interesting services that are, for certain aspects, more intriguing than the Reuters one. Don’t believe me? Let’s give a try to AlchemyAPI.

Dapper

I have to admit my ignorance here. I never heard about it, but it looks very very interesting. Certainly this service that offers, mainly, some sort of semantic advertisement is more than promising. I’ll keep an eye on it.

Hakia

Down at the moment I’m writing. 😦

Tripit

Many friends of mine are using it and this could be enough to give it popularity. Again, I don’t know if they are using some of the W3C Semantic Web technologies to models their data. RDF or not, this is a neat example of semantic web application with a good potential: is this enough to you?

BooRah

Another case of personal ignorance. This magic is, mainly, a restaurant review site. BooRah uses semantic analysis and natural language processing to aggregate reviews from food blogs. Because of this, BooRah can recognize praise and criticism in these reviews and then rates restaurants accordingly to them. One criticism? The underlying data are perhaps not so much rich. Sounds impossible to me that searching for “Pizza in Italy” returns nothing.

Blue Organizer (or GetGlue?)

It’s not a secret that I consider Glue one of the most innovative and intriguing stuff on the Web. And when it appeared on the ReadWriteWeb 10 Top Semantic Web applications was far away from what is now. Just one year later, GetGlue (Blue Organizer seems to be the former name) appears as a growing and live community of people that realized how is important to wave the Web with the aim of a tool that act as a content cross-recommender. Moreover GetGlue provides a neat set of Web APIs that I’m widely using within the NoTube project.

Zemanta

A clear idea, a powerful branding and a well designed set of services accessible with Web APIs make Zemanta one of the most successful product on the stage. Do I have to say anything more? If you like Zemanta I suggest you to keep an eye also on Loomp, a nice stuff presented at the European Semantic Technology Conference 2009.

UpTake.com

Mainly, a semantic search engine over a huge database containing more than 400,000 hotels in the US. Where’s the semantic there? Uptake.com crawls and semantically extracts the information implicitly hidden in those records. A good example of how innovative technologies could be applied to well-know application domains as the hotels searching one.

On year later…

Indubitably, 2009 has been ruled by the Linked Data Initiative, as I love to call it. Officially Linked Data is about “using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods” and, if we look to its growing rate, could be simple to bet on it success.

Here is the the 2009 top-ten where I omitted GetGlue, Zemanta and OpenCalais since they already appeared also in the 2008 edition:

Google Search Options and Rich Snippets

When this new feature of Google has been announced the whole Semantic Web community realized that something very powerful started to move along. Google Rich Snippet makes use of the RDFa contained in the HTML Web pages to power rich snippets feature.

Feedly

It’s a very very nice feeds aggregator built upon Google Reader, Twitter and FriendFeed. It’s easy to use, nice and really useful (well, at least it seems so to me) but, unfortunately, I cannot see where is the Semantic aspects here.

Apture

This JavaScript cool stuff allows publishers to add contextual information to links via pop-ups which display when users hover over or click on them. Watching HTML pages built with the aid of this tool, Apture closely remembers me the WordPress Snap-Shot plugin. But Apture seems richer than Snap-Shot since it allows the publishers to directly add links and other stuff they want to display when the pages are rendered.

BBC Semantic Music Project

Built upon Musicbrainz.org (one of the most representative Linked Data cloud) it’s a very remarkable initiative. Personally, I’m using it within the NoTube project to disambiguate Last.fm bands. Concretely, given a certain Last.fm band identifier, I make a query to the BBC /music that returns me a URI. With this URI I ask the sameas.org service to give me other URIs referring to the same band. In this way I can associate to every Last.fm bands a set of Linked Data URIs where obtain a full flavor of coherent data about them.

Freebase

It’s an open, semantically marked up shared database powered by Metaweb.com a great company based in San Francisco. Its popularity is growing fast, as ReadWriteWeb already noticed. Somehow similar to Wikipedia, Freebase provides all the mechanisms necessary to syndicate its data in a machine-readable form. Mainly, with RDF. Moreover, other Linked Data clouds started to add owl:sameAs links to Freebase: do I have to add something else?

Dbpedia

DBpedia is the nucleus of the Web of Data. The only thing I’d like to add is: it deserves to be on the ReadWriteWeb 2009 top-ten more than the others.

Data.gov

It’s a remarkable US government initiative to “increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”. It’s a start and I dream to see something like this even here in Italy.

So what’s up in the end?

It’s my opinion that the 2009 has been the year of Linked Data. New clouds born every month, new links between the already existent ones are established and a new breed of developers are being aware of the potential and the threats of Linked Data consuming applications. It seems that the Web of Data is finally taking shape even if something strange is still in the air. First of all, if we give a closer look to the ReadWriteWeb 2009 Top Ten I have to underline that 3 products on 10 already were also in the 2008 chart. Maybe the popular blog liked to stress on the progresses that these products made but it sound a bit strange to me that they forgot nice products such as the FreeMix, Alchemy API, Sindice, OpenLink Virtuoso and the BestBuy.com usage of GoodRelations ontology. Secondly, 3 products listed in the 2009 chart are public-funded initiatives that, even if is reasonable due to the nature of the products, it leave me with the impression that private investors are not in the loop yet.

What I expect from the 2010, then?

A large and massive rush to using RDFa for SEO porpoises, a sustained grow of Linked Data clouds and, I really hope, the rise of a new application paradigm grounded to the consumption of such interlinked data.

Posted at 10:06

Davide Palmisano: the italian political activism and the semantic web

Beppe Grillo

A couple of years ago, during his live show, the popular italian blogger and activist Beppe Grillo provided a quick demonstration about how the Web concretely realizes the “six degrees of separation”. The italian blogger, today a Web enthusiast, shown that it was possible to him to get in contact with someone very famous using a couple of different websites: imdb, Wikipedia and few others. Starting from a movie where he acted, he could reach the movie producer and the producer could be in contact with another actor due to previous work with this latter and so on. 
This demonstration consisted in a series of links that were opened leading to some Web pages containing information where extract the relationships that the showman wants to achieve.
This gig came back to my mind while I was thinking on how, what I call the “Linked Data Philosophy”, is impacting on the traditional Web and I imagined what Beppe Grillo could show nowadays.
Just the following, simple, trivial and short SPARQL query:
<insert here the SPARQL query>
Although Beppe is a great comedian it may be hard also for him making people laugh with this. But, the point here is not about laughs but about data: in this sense, the Web of Data is providing an outstanding and an extremely powerful way to access to incredible twine of machine readable interlinked data.
Recently, another nice and remarkable italian initiative appeared on the Web: OpenParlamento.it. It’s, mainly, a service where the Italian congressmen are displayed and they are positioned on a chart basing on the similarity of their votes on law proposals.
Ok. Cool. But how the Semantic Web could improve this stuff?
First of all, it would be very straightforward to provide a SPARQL endpoint providing some good RDF for this data. Like the following example:
<rdf:RDF>
<foaf:name>Mario Rossi</foaf:name>
<foaf:gender>male</foaf:gender>
<openp:politicalGroup rdf:resource=”http://openparlamento.it/groups/Democratic_Party”/&gt;
</rdf:Description>
</rdf:RDF>
where names, descriptions, political belonging and more are provided. Moreover a property called openp:similarity could be used to map closer congressmen, using the same information of the already cited chart. 
Secondly, all the information about congressmen are published on the official Italian chambers web site. Wrapping this data, OpenParlamento.it could provide an extremely exhaustive set of official information and, more important, links to DBpedia will be the key to get a full set of machine processable data also from other Linked Data clouds.
How to benefits from all of this? Apart the fact of employing a cutting-edge technology to syndicate data, everyone who wants link the data provided by OpenParlamento.it on his web pages can easily do it using RDFa. Like the follow example, where a fragment of an HTML page representing a news on the above congressman:
<div>
</div>
contains some RDFa linking that page to the OpenParlamento.it cloud.
With these technologies as a basis, a new breed of applications (like web crawlers, for those interested in SEO) will access and process these data in a new, fashionable and extremely powerful way.

A couple of years ago, during his live show, the popular italian blogger and activist Beppe Grillo provided a quick demonstration about how the Web concretely realizes the “six degrees of separation”. The italian blogger, today a Web enthusiast, shown that it was possible to him to get in contact with someone very famous using a couple of different websites: imdb, Wikipedia and few others. Starting from a movie where he acted, he could reach the movie producer and the producer could be in contact with another actor due to previous work with this latter and so on. 

This demonstration consisted in a series of links that were opened leading to some Web pages containing information where extract the relationships that the showman wants to achieve.

This gig came back to my mind while I was thinking on how, what I call the “Linked Data Philosophy”, is impacting on the traditional Web and I imagined what Beppe Grillo could show nowadays.

Just the following, simple, trivial and short SPARQL query:

construct {
    ?actor1 foaf:knows ?actor2
}
    where {
    ?movie dbpprop:starring ?actor1.
    ?movie dbpprop:starring ?actor2.
    ?movie a dbpedia-owl:Film.
    FILTER(?actor1 = <http://dbpedia.org/resource/Beppe_Grillo&gt;)
}

Although Beppe is a great comedian it may be hard also for him making people laugh with this. But, the point here is not about laughs but about data: in this sense, the Web of Data is providing an outstanding and an extremely powerful way to access to incredible twine of machine readable interlinked data.

Recently, another nice and remarkable italian initiative appeared on the Web: OpenParlamento.it. It’s, mainly, a service where the Italian congressmen are displayed and they are positioned on a chart basing on the similarity of their votes on law proposals.

Ok. Cool. But how the Semantic Web could improve this stuff?

First of all, it would be very straightforward to provide a SPARQL endpoint providing some good RDF for this data. Like the following example:

<rdf:RDF>
    <rdf:Description rdf:about=”http://openparlamento.it/senate/Mario_Rossi”&gt;
        <rdf:type rdf:resource=”http://openparlamento.it/ontology/Congressman”/&gt;
        <foaf:name>Mario Rossi</foaf:name>
        <foaf:gender>male</foaf:gender>
        <openp:politicalGroup
            rdf:resource=”http://openparlamento.it/groups/Democratic_Party”/&gt;
        <owl:sameas rdf:resource=”http://dbpedia.org/resource/Mario_Rossi”/&gt;
    </rdf:Description>
</rdf:RDF>

where names, descriptions, political belonging and more are provided. Moreover a property called openp:similarity could be used to map closer congressmen, using the same information of the already cited chart. 

Secondly, all the information about congressmen are published on the official Italian chambers web site. Wrapping this data, OpenParlamento.it could provide an extremely exhaustive set of official information and, more important, links to DBpedia will be the key to get a full set of machine processable data also from other Linked Data clouds.

How to benefits from all of this? Apart the fact of employing a cutting-edge technology to syndicate data, everyone who wants link the data provided by OpenParlamento.it on his web pages can easily do it using RDFa.

With these technologies as a basis, a new breed of applications (like web crawlers, for those interested in SEO) will access and process these data in a new, fashionable and extremely powerful way.

Is the time for those guys to embrace the Semantic Web , isn’t it?

Posted at 10:06

Libby Miller: An i2c heat sensor with a Raspberry Pi camera

I had a bit of a struggle with this so thought it was worth documenting. The problem is this – the i2c bus on the Raspberry Pi is used by the official camera to initialise it. So if you want to use an i2c device at the same time as the camera, the device will stop working after a few minutes. Here’s more on this problem.

I really wanted to use this heatsensor with mynaturewatch to see if we could exclude some of the problem with false positives (trees waving in the breeze and similar). I’ve not got it working well enough yet to look at this problem in detail. But, I did get it working with the 12c bus with the camera – here’s how.

Screen Shot 2019-03-22 at 12.31.04

It’s pretty straightforward. You need to

  • Create a new i2c bus on some different GPIOs
  • Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one
  • Fin

1. Create a new i2c bus on some different GPIOs

This is super-easy:

sudo nano /boot/config.txt

Add the following line of code, preferable in the section where spi and i2c is enabled.

dtoverlay=i2c-gpio,bus=3,i2c_gpio_delay_us=1

This line will create an aditional i2c bus (bus 3) on GPIO 23 as SDA and GPIO 24 as SCL (GPIO 23 and 24 is defaults)

2. Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one

I am using this sensor, for which I need this circuitpython library (more info), installed using:

pip3 install Adafruit_CircuitPython_AMG88xx

While the pi is switched off, plug in the i2c device using pins 23 for SDA and GPIO 24 for SDL, and then boot it up and check it’s working:

 i2cdetect -y 3

Make two changes:

nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/bcm283x/pin.py

and change the SDA and SCL pins to the new pins

#SDA = Pin(2)
#SCL = Pin(3)
SDA = Pin(23)
SCL = Pin(24)
nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/generic_linux/i2c.py

Change line 21 or thereabouts to use the i2c bus 3 rather than the default, 1:

self._i2c_bus = smbus.SMBus(3)

3. Fin

Start up your camera code and your i2c peripheral. They should run happily together.

Screen Shot 2019-03-25 at 19.12.21

Posted at 10:06

Libby Miller: Neue podcast in a box, part 1

Ages ago I wrote a post on how to create a physical podcast player (“podcast in a box”) using Radiodan. Since then, we’ve completely rewritten the software, so those instructions can be much improved and simplified. Here’s a revised technique, which will get you as far as reading an RFID card. I might write a part 2, depending on how much time I have.

You’ll need:

  • A Pi 3B or 3B+
  • An 8GB or larger class 10 microSD card
  • A cheapo USB soundcard (e.g.)
  • A speaker with a 3.5mm jack
  • A power supply for the Pi
  • An MFC522 RFID reader
  • A laptop and microSD card reader / writer

The idea of Radiodan is that as much as possible happens inside web pages. A server runs on the Pi. One webpage is opened headlessly on the Pi itself (internal.html) – this page will play the audio; another can be opened on another machine to act as a remote control (external.html).

They are connected using websockets, so each can access the same messages – the RFID service talks to the underlying peripheral on the Pi, making the data from the reader available.

Here’s what you need to do:

1. Set up the the Pi as per these instructions (“setting up your Pi”)

You need to burn a microSD card with the latest Raspian with Desktop to act as the Pi’s brain, and the easiest way to do this is with Etcher. Once that’s done, the easiest way to do the rest of the install is over ssh, and the quickest way to get that in place is to edit two files while the card is still in your laptop (I’m assuming a Mac):

Enable ssh by typing:

touch /Volumes/boot/ssh

Add your wifi network to boot by adding a file called

/Volumes/boot/wpa_supplicant.conf

contents: (replace AP_NAME and AP_PASSWORD with your wifi details)

country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="AP_NAME"
  psk="AP_PASSWORD"
  key_mgmt=WPA-PSK
}

Then eject the card, put the card in the Pi, attach all the peripherals except for the RFID reader and switch it on. While on the same wifi network, you should be able to ssh to it like this:

ssh pi@raspberrypi.local

password: raspberry.

Then install the Radiodan software using the provisioning script like this:

curl https://raw.githubusercontent.com/andrewn/neue-radio/master/deployment/provision | sudo bash

2. Enable SPI on the Pi

Don’t reboot yet; type:

sudo raspi-config

Under interfaces, enable SPI, then shut the Pi down

sudo halt

and unplug it.

3. Test Radiodan and configure it

If all is well and you have connected a speaker via a USB soundcard, you should hear it say “hello” as it boots.

Please note: Radiodan does not work with the default 3.5mm jack on the Pi. We’re not sure yet why. But USB soundcards are very cheap, and work well.

There’s one app available by default for Radiodan on the Pi. To use it,

  1. Navigate to http://raspberrypi.local/radio
  2. Use the buttons to play different audio clips. If you can hear things, then it’s all working

 

radiodan_screenshot1

shut the Pi down and unplug it from the mains.

4. Connect up the RFID reader to the Pi

like this

Then start the Pi up again by plugging it in.

5. Add the piab app

Dan has made a very fancy mechanism for using Samba to drag and drop apps to the Pi, so that you can develop on your laptop. However, because we’re using RFID (which only works on the Pi), we may as well do everything on there. So, ssh to it again:

ssh pi@raspberrypi.local
cd /opt/radiodan/rde/apps/
git clone http://github.com/libbymiller/piab

This is currently a very minimal app, which just allows you to see all websocket messages going by, and doesn’t do anything else yet.

6. Enable the RFID service and piab app in the Radiodan web interface

Go to http://raspberrypi.local:5020

Enable “piab”, clicking ‘update’ beneath it. Enable the RFID service, clicking ‘update’ beneath it. Restart the manager (red button) and then install dependencies (green button), all within the web page.

radiodan_screenshot2

radiodan_screenshot4

Reboot the Pi (e.g. ssh in and sudo reboot). This will enable the RFID service.

7. Test the RFID reader

Open http://raspberrypi.local:5000/piab and open developer tools for that page. Place a card on the RFID reader. You should see a json message in the console with the RFID identifier.

radiodan_screenshot5

The rest is a matter of writing javascript / html code to:

  • Associate a podcast feed with an RFID (e.g. a web form in external.html that allows the user to add a podcast feed url)
  • Parse the podcast feed when the appropriate card id is detected by the reader
  • Find the latest episode and play it using internal.html (see the radio app example for how to play audio)
  • Add more fancy options, such as remembering where you were in an episode, stopping when the card is removed etc.

As you develop, you can see the internal page on http://raspberrypi.local:5001 and the external page on http://raspberrypi.local:5000. You can reload the app using the blue button on http://raspberrypi.local:5020.

Many more details about the architecture of Radiodan are available; full installation instructions and instructions for running it on your laptop are here; docs are here; code is in github.

Posted at 10:06

Libby Miller: #Makevember

@chickengrylls#makevember manifesto / hashtag has been an excellent experience. I’ve made maybe five nice things and a lot of nonsense, and a lot of useless junk, but that’s fine – I’ve learned a lot, mostly about servos and other motors. There’s been tons of inspiration too (check out these beautiful automata, some characterful paper sculptures, Richard’s unsuitable materials, my initial inspiration’s set of themes on a tape, and loads more). A lovely aspect was all the nice people and beautiful and silly things emerging out of the swamp of Twitter.

Screen Shot 2017-12-01 at 16.31.14

Of my own makes, my favourites were this walking creature, with feet made of crocodile clips (I was amazed it worked); a saw-toothed vertical traveller, such a simple little thing; this fast robot (I was delighted when it actually worked); some silly stilts; and (from October) this blimp / submarine pair.

I did lots of fails too – e.g. a stencil, a raspberry blower. Also lots of partial fails that got scaled back – AutoBez 1, 2, and 3; Earth-moon; a poor-quality under-water camera. And some days I just ran out of inspiration and made something crap.

Why’s it so fun? Well there’s the part about being more observant, looking at materials around you constantly to think about what to make, though that’s faded a little. As I’ve got better I’ve had more successes and when you actually make something that works, that’s amazing. I’ve loved seeing what everyone else is making, however good or less-good, whether they spent ages or five minutes on it. It feels very purposeful too, having something you have to do every day.

Downsides: I’ve spent far too long on some of these. I was very pleased with both Croc Nest, and Morse, but both of them took ages. The house is covered in bits of electronics and things I “might need” despite spending some effort tidying, but clearly not enough (and I need to have things to hand and to eye for inspiration). Oh, and I’m addicted to Twitter again. That’s it really. Small price to pay.

Posted at 10:06

Libby Miller: Libbybot eleven – webrtc / pi3 / presence robot

The libbybot posable presence robot’s latest instructions are here. It’s a lot more detailed than previous versions and much more reliable (and includes details for construction, motors, server etc).

It’s not a work project, but I do use it at work (picture by David Man).

Image_uploaded_from_iOS

Posted at 10:06

Peter Mika: Semantic Search Challenge sponsored by Yahoo! Labs

Together with my co-chairs Marko Grobelnik, Thanh Tran Duc and Haofen Wang, we again got the opportunity of organizing the 4th Semantic Search Workshop, the premier event for research on retrieving information from structured data collections or text collections annotated with metadata. Like last year, the Workshop will take place at the WWW conference, to be held March 29, 2011, in Hyderabad, India. If you wish to submit a paper, there are still a few days left: the deadline is Feb 26, 2011. We welcome both short and long submissions.

In conjunction with the workshop, and with a number of co-organizers helping us, we are also launching  a Semantic Search Challenge (sponsored by Yahoo! Labs), which is hosted at semsearch.yahoo.com. The competition will feature two tracks. The first track (entity retrieval) is the same task we evaluated last year: retrieving resources that match a keyword query, where the query contains the name of an entity, with possibly some context (such as “starbucks barcelona”). We are adding this year a new task (list retrieval) which represents the next level of difficulty: finding resources that belong to a particular set of entities, such as “countries in africa”. These queries are more complex to answer since they don’t name a particular entity. Unlike in other similar competitions, the task is to retrieve the answers from a real (messy…) dataset crawled from the Semantic Web. There is a small prize ($500) to win in each track.

The entry period will start March 1, and run through March 15. Please consider participating in either of these tracks: it’s early days in Semantic Search, and there is so much to discover.

Posted at 10:06

Peter Mika: Microformats and RDFa deployment across the Web

I have presented on previous occasions (at Semtech 2009, SemTech 2010, and later at FIA Ghent 2010, see slides for the latter, also in ISWC 2009) some information about microformat and RDFa deployment on the Web. As such information is hard to come by, this has generated some interest from the audience. Unfortunately, Q&A time after presentations is too short to get into details, hence some additional background on how we obtained this data and what it means for the Web. This level of detail is also important to compare this with information from other sources, where things might be measured differently.

The chart below shows the deployment of certain microformats and RDFa markup on the Web, as percentage of all web pages, based on an analysis of 12 billion web pages indexed by Yahoo! Search. The same analysis has been done at three different time-points and therefore the chart also shows the evolution of deployment.

Microformats and RDFa deployment on the Web (% of all web pages)

The data is given below in a tabular format.

Date RDFa eRDF tag hcard adr hatom xfn geo hreview
09-2008 0.238 0.093 N/A 1.649 N/A 0.476 0.363 N/A 0.051
03-2009 0.588 0.069 2.657 2.005 0.872 0.790 0.466 0.228 0.069
10-2010 3.591 0.000 2.289 1.058 0.237 1.177 0.339 0.137 0.159

There are a couple of comments to make:

  • There are many microformats (see microformats.org) and I only include data for the ones that are most common on the Web. To my knowledge at least, all other microformats are less common than the ones listed above.
  • eRDF has been a predecessor to RDFa, and has been obsoleted by it. RDFa is more fully featured than eRDF, and has been adopted as a standard by the W3C.
  • The data for the tag, adr and geo formats is missing from the first measurement.
  • The numbers cannot be aggregated to get a total percentage of URLs with metadata. The reason is that a webpage may contain multiple microformats and/or RDFa markup. In fact, this is almost always the case with the adr and geo microformats, which are typically used as part of hcard. The hcard microformat itself can be part of hatom markup etc.
  • Not all data is equally useful, depending on what you are trying to do. The tag microformat, for example, is nothing more than a set of keywords attached to a webpage. RDFa itself covers data using many different ontologies.
  • The data doesn’t include “trivial” RDFa usage, i.e. documents that only contain triples from the xhtml namespace. Such triples are often generated by RDFa parsers even when the page author did not intend to use RDFa.
  • This data includes all valid RDFa, and not just namespaces or vocabularies supported by Yahoo! or any other company.

The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hatom microformat.

These results make me optimistic that the Semantic Web is here already in large ways. I don’t expect that a 100% of webpages will ever adopt microformats or RDFa markup, simply because not all web pages contain structured data. As this seems interesting to watch, I will try to publish updates to the data and include the update chart here or in future presentations.

Enhanced by Zemanta

Posted at 10:06

Michael Hausenblas: Elephant filet

End of January I participated in a panel discussion on Big Data, held during the CISCO live event in London. One of my fellow panelists, I believe it was Sean McKeown of CISCO, said there something along the line:

… ideally the cluster is at 99% utilisation, concerning CPU, I/O, and network …

This stuck in my head and I gave it some thoughts. In the following I will elaborate a bit on this in the context of where Hadoop is used in a shared setup, for example in hosted offerings or, say, within an enterprise that runs different systems such as Storm, Lucene/Solr, and Hadoop on one cluster.

In essence, we witness two competing forces: from the perspective of a single user who expects performance vs. the view of the cluster owner or operator who wants to optimise throughput and maximise utilisation. If you’re not familiar with these terms you might want to read up on Cary Millsap’s Thinking Clearly About Performance (part 1 | part 2).

Now, in such as shared setup we may experience a spectrum of loads: from compute intensive over I/O intensive to communication intensive, illustrated in the following, not overly scientific figure:
Utilisations

Here are a some observations and thoughts for potential starting points of deeper research or experiments.

Multitenancy. We see more and more deployments that require strong support for multitenancy; check out the CapacityScheduler, learn from best practices or use a distribution that natively supports the specification of topologies. Additionally, you might still want to keep an eye on Serengeti – VMware’s Hadoop virtualisation project – that seems to have gone quiet in the past months, but I still have hope for it.

Software Defined Networks (SDN). See Wikipedia’s definition for it, it’s not too bad. CISCO, for example, is very active in this area and only recently there was a special issue in the recent IEEE Communications Magazine (February 2013) covering SDN research. I can perfectly see – and indeed this was also briefly discussed on our CISCO live panel back in January – how SDN can enable new ways to optimise throughput and performance. Imagine a SDN that is dynamically workload-aware in the sense of that it knows the difference of a node that runs a task tracker vs. a data node vs. a Solr shard – it should be possible to transparently better the operational parameters and everyone involved, both the users as well as the cluster owner benefit from it.

As usual, I’m very interested in what you think about the topic and looking forward learning about resources in this space from you.

Posted at 10:06

Michael Hausenblas: MapR, Europe and me

MapRYou might have already heard that MapR, the leading provider of enterprise-grade Hadoop and friends, is launching its European operations.

Guess what? I’m joining MapR Europe as of January 2013 in the role of Chief Data Engineer EMEA and will support our technical and sales teams throughout Europe. Pretty exciting times ahead!

As an aside: as I recently pointed out, I very much believe that Apache Drill and Hadoop offer great synergies and if you want to learn more about this come and join us at the Hadoop Summit where my Drill talk has been accepted for the Hadoop Futures session.

Posted at 10:06

Michael Hausenblas: Hosted MapReduce and Hadoop offerings

Hadoop in the cloud

Today’s question is: where are we regarding MapReduce/Hadoop in the cloud? That is, what are the offerings of Hadoop-as-a-Service or other hosted MapReduce implementations, currently?

A year ago, InfoQ ran a story Hadoop-as-a-Service from Amazon, Cloudera, Microsoft and IBM which will serve us as a baseline here. This article contains the following statement:

According to a 2011 TDWI survey, 34% of the companies use big data analytics to help them making decisions. Big data and Hadoop seem to be playing an important role in the future.

One year later, we learn from a recent MarketsAndMarkets study, Hadoop & Big Data Analytics Market – Trends, Geographical Analysis & Worldwide Market Forecasts (2012 – 2017) that …

The Hadoop market in 2012 is worth $1.5 billion and is expected to grow to about $13.9 billion by 2017, at a [Compound Annual Growth Rate] of 54.9% from 2012 to 2017.

In the past year there have also been some quite vivid discussions around the topic ‘Hadoop in the cloud’.

So, here are some current offerings and announcements I’m aware of:

… and now it’s up to you dear reader – I would appreciate it if you could point me to more offerings and/or announcements you know of, concerning MapReduce and Hadoop in the cloud!

Posted at 10:06

Michael Hausenblas: Interactive analysis of large-scale datasets

The value of large-scale datasets – stemming from IoT sensors, end-user and business transactions, social networks, search engine logs, etc. – apparently lies in the patterns buried deep inside them. Being able to identify these patterns, analyzing them is vital. Be it for detecting fraud, determining a new customer segment or predicting a trend. As we’re moving from the billions to trillions of records (or: from the terabyte to peta- and exabyte scale) the more ‘traditional’ methods, including MapReduce seem to have reached the end of their capabilities. The question is: what now?

But a second issue has to be addressed as well: in contrast to what current large-scale data processing solutions provide for in batch-mode (arbitrarily but in line with the state-of-the-art defined as any query that takes longer than 10 sec to execute) the need for interactive analysis increases. Complementary, visual analytics may or may not be helpful but come with their own set of challenges.

Recently, a proposal for a new Apache Incubator group called Drill has been made. This group aims at building a:

… distributed system for interactive analysis of large-scale datasets […] It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Drill’s design is supposed to be informed by Google’s Dremel and wants to efficiently process nested data (think: Protocol Buffers). You can learn more about requirements and design considerations from Tomer Shiran’s slide set.

In order to better understand where Drill fits in in the overall picture, have a look at the following (admittedly naïve) plot that tries to place it in relation to well-known and deployed data processing systems:

BTW, if you want to test-drive Dremel, you can do this already today; it’s an IaaS service offered in Google’s cloud computing suite, called BigQuery.

Posted at 10:06

Michael Hausenblas: Schema.org + WebIntents = Awesomeness

Imagine you search for a camera, say a Canon EOS 60D, and in addition to the usual search results you’re as well offered a choice of actions you can perform on it, for example share the result with a friend, write a review for the item or, why not directly buy it?

Enhancing SERP with actions

Sounds far fetched? Not at all. In fact, all the necessary components are available and deployed. With Schema.org we have a way to describe the things we publish on our Web pages, such as books or cameras and with WebIntents we have a technology at hand that allows us to interact with these things in a flexible way.

Here are some starting points in case you want to dive into WebIntents a bit:

PS: I started to develop a proof of concept for mapping Schema.org terms to WebIntents and will report on the progress, here. Stay tuned!

Posted at 10:06

Michael Hausenblas: Turning tabular data into entities

Two widely used data formats on the Web are CSV and JSON. In order to enable fine-grained access in an hypermedia-oriented fashion I’ve started to work on Tride, a mapping language that takes one or more CSV files as inputs and produces a set of (connected) JSON documents.

In the 2 min demo video I use two CSV files (people.csv and group.csv) as well as a mapping file (group-map.json) to produce a set of interconnected JSON documents.

So, the following mapping file:

{
 "input" : [
  { "name" : "people", "src" : "people.csv" },
  { "name" : "group", "src" : "group.csv" }
 ],
 "map" : {
  "people" : {
   "base" : "http://localhost:8000/people/",
   "output" : "../out/people/",
   "with" : { 
    "fname" : "people.first-name", 
    "lname" : "people.last-name",
    "member" : "link:people.group-id to:group.ID"
   }
  },
  "group" : {
   "base" : "http://localhost:8000/group/",
    "output" : "../out/group/",
    "with" : {
     "title" : "group.title",
     "homepage" : "group.homepage",
     "members" : "where:people.group-id=group.ID link:group.ID to:people.ID"
    }
   }
 }
}

… produces JSON documents representing groups. One concrete example output is shown below:

Posted at 10:06

John Goodwin: Quick Play with Cayley Graph DB and Ordnance Survey Linked Data

Earlier this month Google announced the release of the open source graph database/triplestore Cayley. This weekend I thought I would have a quick look at it, and try some simple queries using the Ordnance Survey Linked Data.

Cayley is written in Go, so first I had to download and install that. I then downloaded Cayley from here. As an initial experiment I decided to use the Boundary Line Linked Data, and you can grabbed the data as n-triples here. I only wanted a subset of this data – I didn’t need all of the triplestores storing the complex boundary geometries for my initial test so I discarded the files of the form *-geom.nt and the files of the form county.nt, dbu.nt etc. (these are the ones with the boundaries in). Finally I put the remainder of the data into one file so it was ready to load into Cayley.

It is very easy to load data into Cayley – see the getting started section part on the Cayley pages here. I decided I wanted to try the web interface so loading the data (in a file called all.nt) was a simple case of typing:

./cayley http –dbpath=./boundaryline/all.nt

Once you’ve done this point your web browser to http://localhost:64210/ and you should see something like:

Screen Shot 2014-06-29 at 10.43.35

 

One of the things that will first strike people used to using RDF/triplestores is that Cayley does not have a SPARQL interface, and instead uses a query language based on Gremlin. I am new to Gremlin, but seems it has already been used to explore linked data – see blog from Dan Brickley from a few years ago.

The main purpose of this blog post is to give a few simple examples of queries you can perform on the Ordnance Survey data in Cayley. If you have Cayley running then you can find the query language documented here.

At the simplest level the query language seems to be an easy way to traverse the graph by starting at a node/vertex and following incoming or outgoing links. So to find All the regions that touch Southampton it is a simple case of starting at the Southampton node, following a touches outbound link and returning the results:

g.V(“http://data.ordnancesurvey.co.uk/id/7000000000037256“).Out(“http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches“).All()

Giving:

Screen Shot 2014-06-29 at 10.56.15

If you want to return the names and not the IDs:

g.V(“http://data.ordnancesurvey.co.uk/id/7000000000037256“).Out(“http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches“).Out(“http://www.w3.org/2000/01/rdf-schema#label“).All()

Screen Shot 2014-06-29 at 10.58.30

You can used also filter – so to just see the counties bordering Southampton:

g.V(“http://data.ordnancesurvey.co.uk/id/7000000000037256“).Out(“http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches“).Has(“http://www.w3.org/1999/02/22-rdf-syntax-ns#type“,”http://data.ordnancesurvey.co.uk/ontology/admingeo/County“).Out(“http://www.w3.org/2000/01/rdf-schema#label“).All()

Screen Shot 2014-06-29 at 11.01.17

 

The Ordnance Survey linked data also has spatial predicates ‘contains’, ‘within’ as well as ‘touches’. Analogous queries can be done with those. E.g. find me everything Southampton contains:

g.V(“http://data.ordnancesurvey.co.uk/id/7000000000037256“).Out(“http://data.ordnancesurvey.co.uk/ontology/spatialrelations/contains“).Out(“http://www.w3.org/2000/01/rdf-schema#label“).All()

So after this very quick initial experiment it seems that Cayley is very good at providing an easy way of doing very quick/simple queries. One query I wanted to do was find everything in, say, Hampshire – the full transitive closure. This is very easy to do in SPARQL, but in Cayley (at first glance) you’d have to write some extra code (not exactly rocket science, but a bit of a faff compared to SPARQL). I rarely touch Javascript these days so for me personally this will never replace a triplestore with a SPARQL endpoint, but for JS developers this tool will be a great way to get started with and explore linked data/RDF. I might well brush up on my Javascript and provide more complicated examples in a later blog post…

 

 

 

Posted at 10:06

John Goodwin: Visualising the Location Graph – example with Gephi and Ordnance Survey linked data

This is arguably a simpler follow up to my previous blog post, and here I want to look at visualising Ordnance Survey linked data in Gephi. Now Gephi isn’t really a GIS, but it can be used to visualise the adjacency graph where regions are represented as nodes in a graph, and links represent adjacency relationships.

The approach here will be very similar to the approach in my previous blog. The main difference is that you will need to use the Ordnance Survey SPARQL endpoint and not the DBpedia one. So this time in the Gephi semantic web importer enter the following endpoint URL:

http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/sparql

The Ordnance Survey endpoint returns turtle by default, and Gephi does not seem to like this. I wanted to force the output as XML. I figured this could be done in the using a ‘REST parameter name’ (output) with value equal to xml. This did not seem to work, so instead I had to do a bit of a hack. In the ‘query tag…’ box you will need to change the value from ‘query’ to ‘output=xml&query’. You should see something like this in the Semantic Web Importer now:

Screen Shot 2014-03-28 at 11.28.28

Now click on the query tab. If we want to, for example, view the adjacent graph for consistuencies we can enter the following query:

prefix gephi:<http://gephi.org/>
construct {
?s gephi:label ?label .
?s gephi:lat ?lat .
?s gephi:long ?long .
?s <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches> ?o .}
where
{
?s a <http://data.ordnancesurvey.co.uk/ontology/admingeo/WestminsterConstituency> .
?o a <http://data.ordnancesurvey.co.uk/ontology/admingeo/WestminsterConstituency> .
?s <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches> ?o .
?s <http://www.w3.org/2000/01/rdf-schema#label> ?label .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long .
}

and click ‘run’. To visualise the output you will need to follow the exact same steps mentioned here (remember to recast the lat and long variables to decimal).

If we want to view adjacency of London Boroughs then we can do this with a similar query:

prefix gephi:<http://gephi.org/>
construct {
?s gephi:label ?label .
?s gephi:lat ?lat .
?s gephi:long ?long .
?s <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches> ?o .}
where
{
?s a <http://data.ordnancesurvey.co.uk/ontology/admingeo/LondonBorough> .
?o a <http://data.ordnancesurvey.co.uk/ontology/admingeo/LondonBorough> .
?s <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches> ?o .
?s <http://www.w3.org/2000/01/rdf-schema#label> ?label .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long .
}

When visualising you might want to change the scale parameter to 10000.0. You should see something like this:

Screen Shot 2014-03-28 at 11.40.18

So far so good. Now imagine we want to bring in some other data – recall my previous blog post here. We can use SPARQL federation to bring in data from other endpoints. Suppose we would like to make the size of the node represent the ‘IMD rank‘ of each London Borough…we can do with by bringing in data from the Open Data Communities site:

prefix gephi:<http://gephi.org/>
construct {
?s gephi:label ?label .
?s gephi:lat ?lat .
?s gephi:long ?long .
?s gephi:imd-rank ?imdrank .
?s <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches> ?o .}
where
{
?s a <http://data.ordnancesurvey.co.uk/ontology/admingeo/LondonBorough> .
?o a <http://data.ordnancesurvey.co.uk/ontology/admingeo/LondonBorough> .
?s <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/touches> ?o .
?s <http://www.w3.org/2000/01/rdf-schema#label> ?label .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long .
SERVICE <http://opendatacommunities.org/sparql> {
?x <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?s .
?x <http://opendatacommunities.org/def/IMD#IMD-score> ?imdrank . }
}

You will need to recast the imdrank as an integer for what follows (do this using the same approach used to recast the lat/long variables). You can now use Gephi to resize the nodes according to IMD rank. We do this using the ranking tab:

Screen Shot 2014-03-28 at 11.50.43

You should now see you London Boroughs re-sized according to their IMD rank:

Screen Shot 2014-03-28 at 11.51.51

turning the lights off and adding some labels we get:

Screen Shot 2014-03-28 at 12.04.27

Posted at 10:06

John Goodwin: All roads lead to? Experiments with Gephi, Linked Data and Wikipedia

Gephi is “an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs”. Tony Hirst did a great blog post a while back showing how you could use Gephi together with DBpedia (a linked data version of Wikipedia) to map an influence network in the world of philosophy. Gephi offers a semantic web plugin which allows you to work with the web of linked data. I recommend you read Tony’s blog to get started with using that plugin with Gephi. I was interested to experiment with this plugin, and to look at what sort of geospatial visualisations could be possible.

If you want to follow all the steps in this post you will need to:

Initially I was interested to see if there were any interesting networks we might visualise between places. In order to see how Wikipedia relates one place to another was a simple case of going to the DBpedia SPARQL endpoint and trying the following query:

select distinct ?p
where
{
?s a <http://schema.org/Place> .
?o a <http://schema.org/Place> .
?s ?p ?o .
}

– where s and o are places, find me what ‘p’ relates them. I noticed two properties ‘http://dbpedia.org/ontology/routeStart‘ and ‘http://dbpedia.org/ontology/routeEnd‘ so I thought I would try to visualise how places round the world were linked by transport connections.  To find places connected by a transport link you want to find pairs ‘start’ and ‘end’ that are the route start and route end, respectively, of some transport link. You can do this with the following query:

select ?start ?end
where
{
?start a <http://schema.org/Place> .
?end a <http://schema.org/Place> .
?link <http://dbpedia.org/ontology/routeStart> ?start .
?link <http://dbpedia.org/ontology/routeEnd> ?end .
}

This gives a lot of data so I thought I would restrict the links to be only road links:

select ?start ?end
where
{?start a <http://schema.org/Place> .
?end a <http://schema.org/Place> .
?link <http://dbpedia.org/ontology/routeStart> ?start .
?link <http://dbpedia.org/ontology/routeEnd> ?end .
?link a <http://dbpedia.org/ontology/Road> . }

We are now ready to visualise this transport network in Gephi. Follow the steps in Tony’s blog to bring up the Semantic Web Importer. In the ‘driver’ tab make sure ‘Remote – SOAP endpoint’ is selected, and the EndPoint URL is http://dbpedia.org/sparql. In an analogous way to Tony’s blog we need to construct our graph so we can visualise it. To simply view the connections between places it would be enough to just add this query to the ‘Query’ tab:

construct {?start <http://foo.com/connectedTo> ?end}
where
{
?start a <http://schema.org/Place> .
?end a <http://schema.org/Place> .
?link <http://dbpedia.org/ontology/routeStart> ?start .
?link <http://dbpedia.org/ontology/routeEnd> ?end .
?link a <http://dbpedia.org/ontology/Road> .
}

However, as we want to visualise this in a geospatial context we need the lat and long of the start and end points so our construct query becomes a bit more complicated:

prefix gephi:<http://gephi.org/>
construct {
?start gephi:label ?labelstart .
?end gephi:label ?labelend .
?start gephi:lat ?minlat .
?start gephi:long ?minlong .
?end gephi:lat ?minlat2 .
?end gephi:long ?minlong2 .
?start <http://foo.com/connectedTo> ?end}
where
{
?start a <http://schema.org/Place> .
?end a <http://schema.org/Place> .
?link <http://dbpedia.org/ontology/routeStart> ?start .
?link <http://dbpedia.org/ontology/routeEnd> ?end .
?link a <http://dbpedia.org/ontology/Road> .
{select ?start (MIN(?lat) AS ?minlat) (MIN(?long) AS ?minlong) where {?start <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat . ?start <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long .} }
{select ?end (MIN(?lat2) AS ?minlat2) (MIN(?long2) AS ?minlong2) where {?end <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat2 . ?end <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long2 .} }
?start <http://www.w3.org/2000/01/rdf-schema#label> ?labelstart .
?end <http://www.w3.org/2000/01/rdf-schema#label> ?labelend .
FILTER (lang(?labelstart) = ‘en’)
FILTER (lang(?labelend) = ‘en’)
}

Note that query for the lat and long is a bit more complicated that it might be. This is because DBpedia data is quite messy, and many entities will have more than one lat/long pair. I used a subquery in SPARQL to pull out the minimum lat/long for all the pairs retrieved. Additionally I also retrieved the English labels for each of the start/end points.

Now copy/paste this construct query into the ‘Query’ tab on the Semantic Web Importer:

Screen Shot 2014-03-26 at 15.54.34

Now hit the run button and watch the data load.

To visual the data we need to do a bit more work. In Gephi click on the ‘Data Laboratory’ and you should now see your data table. Unfortunately all of the lats and longs have been imported as strings and we need to recast them as decimals. To do this click on the ‘More actions’ pull down menu and look for ‘Recast column’ and click it. In the ‘Recast manipulator’ window go to ‘column’ and select ‘lat(Node Table)’ from the pull down menu. Under ‘Convert to’ select ‘Double’ and click recast. Do the same for ‘long’.

Screen Shot 2014-03-26 at 16.01.19

when you are done click ‘ok’ and return to the ‘overview’ tab in Gephi. To see this data geospatially go to the layout panel and select ‘Geo Layout’. Change the latitude and longitude to your new recast variable names, and unclick ‘center’ (my graph kept vanishing with it selected). Experiment with the scale value:

Screen Shot 2014-03-26 at 16.09.49

You should now see something like this:

Screen Shot 2014-03-26 at 16.11.13

in your display panel (click image to view in higher resolution).

Given that this is supposed to be a road network you will find some oddities. This it seems to down to ‘European routes’ like European route E15 that link from Scotland down to Spain.

Posted at 10:06

Leigh Dodds: Thinking about the governance of data

I find “governance” to be a tricky word. Particularly when we’re talking about the governance of data.

For example, I’ve experienced conversations with people from a public policy background and people with a background in data management, where its clear that there are different perspectives. From a policy perspective, governance of data could be described as the work that governments do to enforce, encourage or enable an environment where data works for everyone. Which is slightly different to the work that organisations do in order to ensure that data is treated as an asset, which is how I tend to think about organisational data governance.

These aren’t mutually exclusive perspectives. But they operate at different scales with a different emphasis, which I think can sometimes lead to crossed wires or missed opportunities.

As another example, reading this interesting piece of open data governance recently, I found myself wondering about that phrase: “open data governance”. Does it refer to the governance of open data? Being open about how data is governed? The use of open data in governance (e.g. as a public policy tool), or the role of open data in demonstrating good governance (e.g. through transparency). I think the article touched on all of these but they seem quite different things. (Personally I’m not sure there is anything special about the governance of open data as opposed to data in general: open data isn’t special).

Now, all of the above might be completely clear to everyone else and I’m just falling into my usual trap of getting caught up on words and meanings. But picking away at definitions is often useful, so here we are.

The way I’ve rationalised the different data management and public policy perspectives is in thinking about the governance of data as a set of (partly) overlapping contexts. Like this:

 

Governance of data as a set of overlapping contexts

 

Whenever we are managing and using data we are doing so within a nested set of rules, processes, legislation and norms.

In the UK our use of data is bounded by a number of contexts. This includes, for example: legislation from the EU (currently!), legislation from the UK government, legislation defined by regulators, best practices that might be defined how a sector operates, our norms as a society and community, and then the governance processes that apply within our specific organisations, departments and even teams.

Depending on what you’re doing with the data, and the type of data you’re working with, then different contexts might apply. The obvious one being the use of personal data. As data moves between organisations and countries, then different contexts will apply, but we can’t necessarily ignore the broader contexts in which it already sits.

The narrowest contexts, e.g. those within an organisations, will focus on questions like: “how are we managing dataset XYZ to ensure it is protected and managed to a high quality?” The broadest contexts are likely to focus on questions like: “how do we safely manage personal data?

Narrow contexts define the governance and stewardship of individual datasets. Wider contexts guide the stewardship of data more broadly.

What the above diagram hopefully shows is that data, and our use of data, is never free from governance. It’s just that the terms under which it is governed may just be very loosely defined.

This terrible sketch I shared on twitter a while ago shows another way of looking at this. The laws, permissions, norms and guidelines that define the context in which we use data.

Data use in context

One of the ways in which I’ve found this “overlapping contexts” perspective useful, is in thinking about how data moves into and out of different contexts. For example when it is published or shared between organisations and communities. Here’s an example from this week.

IBM have been under fire because they recently released (or re-released) a dataset intended to support facial recognition research. The dataset was constructed by linking to public and openly licensed images already published on the web, e.g. on Flickr. The photographers, and in some cases the people featured in those images, are unhappy about the photographs being used in this new way. In this new context.

In my view, the IBM researchers producing this dataset made two mistakes. Firstly, they didn’t give proper appreciation to the norms and regulations that apply to this data — the broader contexts which inform how it is governed and used, even though its published under an open licence. For example, e.g. people’s expectations about how photographs of them will be used.

An open licence helps data move between organisations — between contexts — but doesn’t absolve anyone from complying with all of the other rules, regulations, norms, etc that will still apply to how it is accessed, used and shared. The statement from Creative Commons helps to clarify that their licenses are not a tool for governance. They just help to support the reuse of information.

This lead to IBM’s second mistake. By creating a new dataset they took on responsibility as its data steward. And being a data steward means having a well-defined, set of data governance processes that are informed and guided by all of the applicable contexts of governance. But they missed some things.

The dataset included content that was created by and features individuals. So their lack of engagement with the community of contributors, in order to discuss norms and expectations was mistaken. The lack of good tools to allow people to remove photos — NBC News created a better tool to allow Flickr users to check the contents of the dataset — is also a shortfall in their duties. Its the combination of these that has lead to the outcry.

If IBM had instead launched an initiative similar where they built this dataset, collaboratively, with the community then they could have avoided this issue. This is the approach that Mozilla took with Voice. IBM, and the world, might even have had a better dataset as a result because people have have opted-in to including more photos. This is important because, as John Wilbanks has pointed out, the market isn’t creating these fairer, more inclusive datasets. We need them to create an open, trustworthy data ecosystem.

Anyway, that’s one example of how I’ve found thinking about the different contexts of governing data, helpful in understanding how to build stronger data infrastructure. What do you think? Am I thinking about this all wrong? What else should I be reading?

 

Posted at 10:05

John Breslin: Web Archive Ontology (SIOC+CDM)

Ontology Prototype

We (John G. Breslin and Guangyuan Piao, Unit for Social Semantics, Insight Centre for Data Analytics, NUI Galway) have created a prototype ontology for web archives based on two existing ontologies: Semantically-Interlinked Online Communities (SIOC) and the Common Data Model (CDM).

SIOC+CDM

Figure 1: Initial Prototype of Web Archive Ontology, Linking to SIOC and CDM

In Figure 1, we give an initial prototype for a general web archive ontology, linked to concepts in the CDM, but allowing flexibility in terms of archiving works, media, web pages, etc. through the “Item” concept. Items are versioned and linked to each other, as well as to concepts appearing in the archived items themselves.

We have not shown the full CDM for ease of display in this document, but rather some of the more commonly used concepts. We can also map to other vocabulary terms shown in the last column of Table 1 below; some mappings and reused terms are shown in Figure 1.

Essentially, the top part of the model differentiates between the archive / storage mechanism for an item in an area (Container) on a website (Site), i.e. where it originally came from , who made it, when it was created / modified, when it was archived, the content stream, etc., and on the bottom, what the item actually is (for example, in terms of CDM, the single exemplar of the manifestation of an expression of a work).

Also, the agents who make the item and the work may differ (e.g. a bot may generate a HTML copy of a PDF publication written by Ms. Smith).

Relevant Public Ontologies

In Table 1, we list some relevant public ontologies and terms of interest. Some terms can be reused, and others can be mapped to for interoperability purposes.

Ontology Name Overview Why relevant? What terms are useful?
FRBR For describing functional requirements for bibliographic records. To describe bibliographic records.
Expression

Work 
FRBRoo Express the conceptualisation of FRBR with an object-oriented methodology instead of the entity-relationship methodology, as an alternative. In general, FRBRoo “inherits” all concepts of CIDOC-CRM and harmonises with it.
ClassicalWork

LegalWork

ScholarlyWork

Publication

Expression
BIBFrame For describing bibliographic descriptions, both on the Web and in the broader networked world. To represent and exchange bibliographic data.
Work

Instance

Annotation

Authority
EDM The Europeana Data Model models data in and supports functionality for Europeana, an internet portal that acts as an interface to millions of books, paintings, films, museum objects and archival records that have been digitised throughout Europe. Complements FRBRoo with additional properties and classes.
incorporate

isDerivativeOf

WebResource

TimeSpan

Agent

Place

PhysicalThing
CIDOC-

CRM

For describing the implicit and explicit concepts and relationships used in the cultural heritage domain. To describe cultural heritage information.
EndofExistence

Creation

Time-Span
EAC-CPF Encoded Archival Context for Corporate Bodies, Persons and Families is used for encoding the names of creators of archival materials and related information. Used closely in association with EAD to provide a formal method for recording the descriptions of record creators.
lastDateTimeVerified

Control

Identity
EU PO CDM Ontology based on the FRBR model, for describing the relationships between resource types managed by the EU Publications Office and their views, according to the FRBR model. To describe records.
Expression

Work

Manifestation

Agent

Subject

Item
OAI-ORE Defines standards for the description and exchange of aggregations of Web resources. To describe relationships among resources (also used in EDM).
aggregates

Aggregation

ResourceMap
EAD Standard used for hierarchical descriptions of archival records. Terms are designed to describe archival records.
audience

abbreviation

certainty

repositorycode

AcquisitionInformation

ArchivalDescription
WGS84 Geo For describing information about spatially located things. Terms can be used with the Place ontology for describing place information.
lat

long
Media For describing media resources on the Web. To describe media contents for web archiving.
compression

format

MediaType
Places For describing places of geographic interest. To describe place information for events, etc.
City

Country

Continent
Event For describing events. To describe specific event in content. Also can be used for representing events at an administrative level.
agent

product

place

Agent

Event
SKOS A common data model for sharing and linking knowledge organisation systems. To capture similarities among ontologies and makes the relationships explicit.
broader

related

semanticRelation

relatedMatch

Concept

Collection
SIOC For describing social content. Terms are general enough to be used for web archiving.
previous_version

next_version

earlier_version

later_version

latest_version

Item

Container

Site

embed_knowledge
Dublin Core Provide a metadata vocabulary of “core” properties that is able to provide basic descriptive information about any kind of resource. Fundamental terms used with other ontologies.
creator

date

description

identifier

language

publisher
LOC METS Profile The Metadata Encoding and Transmission Standard (METS) is a metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. The METS profile expresses the requirements that a METS document must satisfy. To describe and organise the components of a digital object.
controlled_

vocabularies

external_schema
DCAT and DCAT-AP A specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. Enable the exchange of description metadata between data portals.
downloadURL

accessURL

Distribution

Dataset

CatalogRecord
Formex A format for the exchange of data between the Publication Office and its contractors. In particular (but not only), it defines the logical markup for documents, which are published in the different series of the Official Journal of the European Union. Useful for annotating archived items as well for exchange purposes.
Archived

Annotation

FT

Note
ODP Ontology describing the metadata vocabulary for the Open Data Portal of the European Union. To describe dataset portals.
datasetType

datasetStatus

accrualPeriodicity

DatasetDocumentation
LOC PREMIS Used to describe preservation metadata. Applicable to archives.
ContentLocation

CreatingApplication

Dependency
VIAF Virtual International Authority File is an international service designed to provide convenient access to the world’s major name authority files (lists of names of people, organisations, places, etc. used by libraries). Enables switching of the displayed form of names to the preferred language of a web user. Useful for linking to name authority files and helping to serve different language communities in Europe.
AuthorityAgency

NameAuthority

NameAuthorityCluster

Table 1: Relevant Ontologies and Terms

Posted at 10:05

John Breslin: Tales From the SIOC-O-Sphere #10

siocapps_medium

SIOC is a Social Semantic Web project that originated at DERI, NUI Galway (funded by SFI) and which aims to interlink online communities with semantic technologies. You can read more about SIOC on the Wikipedia page for SIOC or in this paper. But in brief, SIOC provides a set of terms that describe the main concepts in social websites: posts, user accounts, thread structures, reply counts, blogs and microblogs, forums, etc. It can be used for interoperability between social websites, for augmenting search results, for data exchange, for enhanced feed readers, and more. It’s also one of the metadata formats used in the forthcoming Drupal 7 content management system, and has been deployed on hundreds of websites including Newsweek.com.

As part of our dissemination activities, I’ve tried to regularly summarise recent developments in the project so as to give an overview of what’s going on and also to help in connecting interested parties. It’s been much too long (over a year) since my last report, so this will be a long one! In reverse chronological order, here’s a list of recent applications and websites that are using SIOC:

  • SMOB Version 2. As you may have read on Y Combinator Hacker News yesterday, a re-architected and re-coded version of SMOB (Semantic Microblogging) has been created by Alex Passant. As with our original SMOB design, a user’s SMOB site stores and shares tweets and user information using SIOC and FOAF, but the new version also exposes data via RDFa and additional vocabularies (including the Online Presence Ontology, MOAT, Common Tag). The new SMOB suggests relevant URIs from DBpedia and Sindice when #hashtags are entered, and has moved from a client-server model to a set of distributed hubs. Contact @terraces.
  • on-the-wave. This script creates an enhanced browsing experience (that is SIOC-enabled) for the popular PTT bulletin board system. Contact kennyluck@csail.mit.edu.
  • Newsweek.com. American news magazine Newsweek are now publishing RDFa on their main site, including DC, CommonTag, FOAF and SIOC. Contact @markcatalano.
  • Linked Data from Picasa. OpenLink Software’s URI Burner can now provide Linked Data views of Google Picasa photo albums. See an example hereContact @kidehen.
  • Facebook Open Graph Protocol. Facebook recently announced its Open Graph Protocol (OGP), which allows any web page to become a rich object in their social graph. While OGP defines its own set of classes and properties, the RDF schema contains direct mappings to existing concepts in FOAF, DBpedia and BIBO, and indirect mappings to concepts in Geo, vCard, SIOC and GoodRelations. OpenLink also have a data dictionary meshup of some OGP and SIOC terms (ogp:Blog is mapped to sioct:Weblog). Contact @daveman692.
  • Linked Data from Slideshare. A service to produce Linked Data from the popular Slideshare presentation sharing service has been created, and is available here. Data is represented in SIOC and DC. Contact @pgroth.
  • Fanhubz. FanHubz supports community building and discovery around BBC content items such as TV shows and radio programmes. It reuses the sioct:MicroblogPost term and also has some interesting additional annotation terms for in-show tweets (e.g. twitterSubtitles). Contact @ldodds.
  • RDFa-enhanced FusionForge. An RDFa-enhanced version of FusionForge, a software project management and collaboration system, has been created that generates metadata about projects, users and groups using SIOC, DOAP and FOAF. You can look at the Forge ontology proposal, and also view a demo site. Contact @olberger.
  • Falconer. Falconer is a Semantic Web search engine application enhanced with SIOC. It allows newly-created Social Web content to be represented in SIOC, but it also allows this content to be annotated with any semantic statements available from Falcons, and all of this data can then be indexed by the search engine to form an ecosystem of semantic data. Contact wu@seu.edu.cn.
  • Django to RDF. A script is available here to turn Django data into SIOC RDF and JSON. View the full repository of related scripts on github. Contact @niklasl.
  • SIOC Actions Module. A new SIOC module has been created to describe actions, with potential applications ranging from modelling actions in a developer community to tracing interactions in large-scale wikis. There is a SIOC Actions translator site for converting Activity Streams, Wikipedia interactions and Subversion actions into RDF. Contact @pchampin.
  • SIOC Quotes Module. Another SIOC module has been developed for representing quotes in e-mail conversations and other social media content. You can view a presentation on this topic. Contact @terraces.
  • Siocwave. Siocwave is a desktop tool for viewing and exploring SIOC data, and is based on Python, RDFLib and wxWidgets. Contact vfaronov@gmail.com.
  • RDFa in Drupal 7. Following the Drupal RDF code sprint in DERI last year, RDFa support (FOAF, SIOC, SKOS, DC) in Drupal core was committed to version 7 in October, and work has been apace on refining this module. Drupal 7 is currently on its fifth alpha version, and a full release candidate is expected later this summer. Find out more about the RDFa in Drupal initiative at semantic-drupal.com. Contact @scorlosquet.
  • Omeka Linked Data Plugin (Forthcoming). A plugin to produce Linked Data from the Omeka web publishing platform is in progress that will generate data using SIOC, FOAF, DOAP and other formats. Contact @patrickgmj.
  • Boeing inSite. inSite is an internal social media platform for Boeing employees that provides SIOC and FOAF data services as part of its architecture. Contact @adamboyet.
  • Virtuoso Sponger. Virtuoso Sponger is a middleware component of Virtuoso that generates RDF Linked Data from a variety of data sources (working as an “RDFizer”). It supports SIOC as an input format, and also uses SIOC as its data space “glue” ontology (view the slides). Contact @kidehen.
  • SuRF. SuRF is a Python library for working with RDF data in an object-oriented way, with SIOC being one of the default namespaces. Contact basca@ifi.uzh.ch.
  • Triplify phpBB 3. A Triplify configuration file for phpBB 3 has been created that allows RDF data (including SIOC) to be generated from this popular bulletin board system. Various other Triplify configurations are also available. Contact auer@informatik.uni-leipzig.de.
  • SiocLog. SiocLog is an IRC logging application that provides discussion channels and chat user profiles as Linked Data, using SIOC and FOAF respectively. You can see a deployment and view our slides. Contact @tuukkah.
  • myExperiment Ontology. myExperiment is a collaborative environment where scientists can publish their workflows and experiment plans, share them with groups and find those of others. In their model, myExperiment reuses ontologies like DC, FOAF, SIOC, CC and OAI-ORE. Contact drn@ecs.soton.ac.uk.
  • aTag. The aTag generator produces snippets of HTML enriched with SIOC RDFa and DBpedia-linked tags about highlighted items of interest on any web page, but aiming at the biomedical domain. Contact @matthiassamwald.
  • ELGG SID Module. A Semantically-Interlinked Data (SID) module for the ELGG educational social network system has been described that allows UGC and tags from ELGG platforms to become part of the Linked Data cloud. Contact @selvers.
  • Liferay Linked Data Module. The Linked Data module for Liferay, an enterprise portal solution, supports mapping of data to the SIOC, MOAT and FOAF vocabularies. Contact @bryan_.
  • ourSpaces. ourSpaces is a VRE enabling online collaboration between researchers from various disciplines. It combines FOAF and SIOC with data provenance ontologies for sharing digital artefacts. Contact r.reid@abdn.ac.uk.
  • Good Relations and SIOC. This post describes nicely how the Good Relations vocabulary for e-commerce can be combined with SIOC, e.g. to link a gr:Offering (either being offered or sought by a gr:BusinessEntity) to a natural-language discussion about that thing in a sioc:Post. Contact sdmonroe@gmail.com.
  • Debian BTS to RDF. Discussions from the Debian bug-tracking system (BTS) can be converted to SIOC and RDF and browsed or visualised in interesting ways, e.g. who replied to whom. Contact quang_vu.dang@it-sudparis.eu.
  • RDFex. For those wishing to reuse parts of popular vocabularies in their own Semantic Web vocabularies, RDFex is a mechanism for importing snippets from other namespaces without having to copy and paste them. RDFex can be used as a proxy for various ontologies including DC, FOAF and SIOC. Contact holger@knublauch.com.
  • IRC Logger with RDFa and SIOC. A fork of Dave Beckett’s IRC Logger has been created to include support for RDFa and SIOC by Toby Inkster. Contact mail@tobyinkster.co.uk.
  • mbox2rdf. A mbox2rdf script has been created that converts a mailing list in an mbox file to RDF (RSS, SIOC and DC). Contact mail@tobyinkster.co.uk.
  • Chisimba SIOC Export Module. A SIOC Export module for the Chisimba CMS/LMS platform has been created, which allows various Chisimba modules (CMS, forum, blog, Jabberblog, Twitterizer) to export SIOC data. Contact @paulscott56.
  • vBulletin SIOC Exporter. Omitted from the last report, the vBulletin SIOC plugin generates SIOC and FOAF data from vBulletin discussion forums. It includes a plugin that allows users to opt to export the SHA1 of their e-mail address (and other inverse functional properties) and their network of friends via vBulletin’s user control panel. Contact @johnbreslin.
  • Discuss SIOC on Google Wave. You can now chat about SIOC on our Google Wave.

Posted at 10:05

John Breslin: Book launch for "The Social Semantic Web"

We had the official book launch of “The Social Semantic Web” last month in the President’s Drawing Room at NUI Galway. The book was officially launched by Dr. James J. Browne, President of NUI Galway. The book was authored by myself, Dr. Alexandre Passant and Prof. Stefan Decker from the Digital Enterprise Research Institute at NUI Galway (sponsored by SFI). Here is a short blurb:

Web 2.0, a platform where people are connecting through their shared objects of interest, is encountering boundaries in the areas of information integration, portability, search, and demanding tasks like querying. The Semantic Web is an ideal platform for interlinking and performing operations on the diverse data available from Web 2.0, and has produced a variety of approaches to overcome limitations with Web 2.0. In this book, Breslin et al. describe some of the applications of Semantic Web technologies to Web 2.0. The book is intended for professionals, researchers, graduates, practitioners and developers.

Some photographs from the launch event are below.

Reblog this post [with Zemanta]

Posted at 10:05

Copyright of the postings is owned by the original blog authors. Contact us.