Planet RDF

It's triples all the way down

March 19

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Posted at 20:12

March 17

Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • OpenCalais gives you more free API calls out of the box than Zemanta (50.000 vs. 1.000 per day). You can get a free upgrade to 10.000 Zemanta calls via a simple email, though (or excessive API use; Andraž auto-upgraded my API limit when he noticed my crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a
    ['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a
    ['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Zemanta responses, in contrast, do not (yet, Andraž told me they are working on it) contain entity types at all. You always need an additional request to retrieve type information (unless you are doing nasty URI inspection, which is what I did with detected URIs from Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision ), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Posted at 09:09

Benjamin Nowack: Trice' Semantic Richtext Editor

In my previous post I mentioned that I'm building a Linked Data CMS. One of its components is a rich-text editor that allows the creation (and embedding) of structured markup.

An earlier version supported limited Microdata annotations, but now I've switched the mechanism and use an intermediate, but even simpler approach based on HTML5's handy data-* attributes. This lets you build almost arbitrary markup with the editor, including Microformats, Microdata, or RDFa. I don't know yet when the CMS will be publicly available (3 sites are under development right now), but as mentioned, I'd be happy about another pilot project or two. Below is a video demonstrating the editor and its easy customization options.

Posted at 09:09

March 16

AKSW Group - University of Leipzig: DBpedia Tutorial @ The Web Conference 2022

Dear all,

We are proud to announce that we will organize an online tutorial at the Web Conference on 25th of April 2022. A particular focus will be put on the DBpedia Infrastructure, i.e. DBpedia’s Databus publishing platform and the associated DBpedia services (Spotlight, Lookup and the DBpedia endpoints). In practical examples we will illustrate the potential and the benefit of using DBpedia in the context of the Web of Data.

Quick Facts

Style and Duration

We will organize a DBpedia Knowledge Graph hands-on tutorial. Although the tutorial will be shaped in a way that no specific prerequisites will be required, the participants would benefit if they have some background knowledge in Semantic Web concepts and technologies (RDF, OWL, SPARQL), general overview of the Web Architecture (HTTP, URI, JSON, etc.) and basic programming skills (bash, Java, JavaScript). The online tutorial will be 90 minutes.

Tickets

Please register at the Web Conference page to be part of the masterclass. You need to buy a full access pass to join the DBpedia Tutorial. 

Organisation

  • Milan Dojchinovski , InfAI, DBpedia Association, CTU
  • Sebastian Hellmann, InfAI, DBpedia Association
  • Jan Forberg, InfAI, DBpedia Association
  • Johannes Frey, InfAI, DBpedia Association
  • Julia Holze, InfAI, DBpedia Association

We are looking forward to meeting you online!

Kind regards,

Julia

on behalf of the DBpedia Association

Posted at 13:05

March 15

John Breslin: Book launch for "The Social Semantic Web"

We had the official book launch of “The Social Semantic Web” last month in the President’s Drawing Room at NUI Galway. The book was officially launched by Dr. James J. Browne, President of NUI Galway. The book was authored by myself, Dr. Alexandre Passant and Prof. Stefan Decker from the Digital Enterprise Research Institute at NUI Galway (sponsored by SFI). Here is a short blurb:

Web 2.0, a platform where people are connecting through their shared objects of interest, is encountering boundaries in the areas of information integration, portability, search, and demanding tasks like querying. The Semantic Web is an ideal platform for interlinking and performing operations on the diverse data available from Web 2.0, and has produced a variety of approaches to overcome limitations with Web 2.0. In this book, Breslin et al. describe some of the applications of Semantic Web technologies to Web 2.0. The book is intended for professionals, researchers, graduates, practitioners and developers.

Some photographs from the launch event are below.

Reblog this post [with Zemanta]

Posted at 02:05

John Breslin: Another successful defense by Uldis Bojars in November

Uldis Bojars submitted his PhD thesis entitled “The SIOC MEthodology for Lightweight Ontology Development” to the University in September 2009. We had a nice night out to celebrate in one of our favourite haunts, Oscars Bistro.

Jodi, John, Alex, Julie, Liga, Sheila and Smita
Jodi, John, Alex, Julie, Liga, Sheila and Smita

This was followed by a successful defense at the end of November 2009. The examiners were Chris Bizer and Stefan Decker. Uldis even wore a suit for the event, see below.

I will rule the world!
I will rule the world!

Uldis established a formal ontology design process called the SIOC MEthodology, based on an evolution of existing methodologies that have been streamlined, experience developing the SIOC ontology, and observations regarding the development of lightweight ontologies on the Web. Ontology promotion and dissemination is established as a core part of the ontology development process. To demonstrate the usage of the SIOC MEthodology, Uldis described the SIOC project case study which brings together the Social Web and the Semantic Web by providing semantic interoperability between social websites. This framework allows data to be exported, aggregated and consumed from social websites using the SIOC ontology (in the SIOC application food chain). Uldis’ research work has been published in 4 journal articles, 8 conference papers, 13 workshop papers, and 1 book chapter. The SIOC framework has also been adopted in 33 third-party applications. The Semantic Radar tool he initiated for Firefox has been downloaded 24,000 times. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).

We wish Uldis all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.

Reblog this post [with Zemanta]

Posted at 02:05

John Breslin: Haklae Kim and his successful defense in September

This is a few months late but better late then never! We said goodbye to PhD researcher Haklae Kim in May of this year when he returned to Korea and took up a position with Samsung Electronics soon afterward. We had a nice going away lunch for Haklae with the rest of the team from the Social Software Unit (picture below).

Sheila, Uldis, John, Haklae, Julie, Alex and Smita
Sheila, Uldis, John, Haklae, Julie, Alex and Smita

Haklae returned to Galway in September to defend his PhD entitled “Leveraging a Semantic Framework for Augmenting Social Tagging Practices in Heterogeneous Content Sharing Platforms”. The examiners were Stefan Decker, Tom Gruber and Philippe Laublet. Haklae successfully defended his thesis during the viva, and he will be awarded his PhD in 2010. We got a nice photo of the examiners during the viva which was conducted via Cisco Telepresence, with Stefan (in Galway) “resting” his hand on Tom’s shoulder (in San Jose)!

Philippe Laublet, Haklae Kim, Tom Gruber, Stefan Decker and John Breslin
Philippe Laublet, Haklae Kim, Tom Gruber, Stefan Decker and John Breslin

Haklae created a formal model called SCOT (Social Semantic Cloud of Tags) that can semantically describe tagging activities. The SCOT ontology provides enhanced features for representing tagging and folksonomies. This model can be used for sharing and exchanging tagging data across different platforms. To demonstrate the usage of SCOT, Haklae developed the int.ere.st open tagging platform that combined techniques from both the Social Web and the Semantic Web. The SCOT model also provides benefits for constructing social networks. Haklae’s work allows the discovery of social relationships by analysing tagging practices in SCOT metadata. He performed these analyses using both Formal Concept Analysis and tag clustering algorithms. The SCOT model has also been adopted in six applications (OpenLink Virtuoso, SPARCool, RelaxSEO, RDFa on Rails, OpenRDF, SCAN), and the int.ere.st service has 1,200 registered members. Haklae’s research work was published in 2 journal articles, 15 conference papers, 3 workshop papers, and 2 book chapters. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).

We wish Haklae all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.

Reblog this post [with Zemanta]

Posted at 02:05

John Breslin: BlogTalk 2009 (6th International Social Software Conference) – Call for Proposals – September 1st and 2nd – Jeju, Korea

20090529a

BlogTalk 2009
The 6th International Conf. on Social Software
September 1st and 2nd, 2009
Jeju Island, Korea

Overview

Following the international success of the last five BlogTalk events, the next BlogTalk – to be held in Jeju Island, Korea on September 1st and 2nd, 2009 – is continuing with its focus on social software, while remaining committed to the diverse cultures, practices and tools of our emerging networked society. The conference (which this year will be co-located with Lift Asia 09) is designed to maintain a sustainable dialog between developers, innovative academics and scholars who study social software and social media, practitioners and administrators in corporate and educational settings, and other general members of the social software and social media communities.

We invite you to submit a proposal for presentation at the BlogTalk 2009 conference. Possible areas include, but are not limited to:

  • Forms and consequences of emerging social software practices
  • Social software in enterprise and educational environments
  • The political impact of social software and social media
  • Applications, prototypes, concepts and standards

Participants and proposal categories

Due to the interdisciplinary nature of the conference, audiences will come from different fields of practice and will have different professional backgrounds. We strongly encourage proposals to bridge these cultural differences and to be understandable for all groups alike. Along those lines, we will offer three different submission categories:

  • Academic
  • Developer
  • Practitioner

For academics, BlogTalk is an ideal conference for presenting and exchanging research work from current and future social software projects at an international level. For developers, the conference is a great opportunity to fly ideas, visions and prototypes in front of a distinguished audience of peers, to discuss, to link-up and to learn (developers may choose to give a practical demonstration rather than a formal presentation if they so wish). For practitioners, this is a venue to discuss use cases for social software and social media, and to report on any results you may have with like-minded individuals.

Submitting your proposals

You must submit a one-page abstract of the work you intend to present for review purposes (not to exceed 600 words). Please upload your submission along with some personal information using the EasyChair conference area for BlogTalk 2009. You will receive a confirmation of the arrival of your submission immediately. The submission deadline is June 27th, 2009.

Following notification of acceptance, you will be invited to submit a short or long paper (four or eight pages respectively) for the conference proceedings. BlogTalk is a peer-reviewed conference.

Timeline and important dates

  • One-page abstract submission deadline: June 27th, 2009
  • Notification of acceptance or rejection: July 13th, 2009
  • Full paper submission deadline: August 27th, 2009

(Due to the tight schedule we expect that there will be no deadline extension. As with previous BlogTalk conferences, we will work hard to endow a fund for supporting travel costs. As soon as we review all of the papers we will be able to announce more details.)

Topics

Application Portability
Bookmarking
Business
Categorisation
Collaboration
Content Sharing
Data Acquisition
Data Mining
Data Portability
Digital Rights
Education
Enterprise
Ethnography
Folksonomies and Tagging
Human Computer Interaction
Identity
Microblogging
Mobile
Multimedia
Podcasting
Politics
Portals
Psychology
Recommender Systems
RSS and Syndication
Search
Semantic Web
Social Media
Social Networks
Social Software
Transparency and Openness
Trend Analysis
Trust and Reputation
Virtual Worlds
Web 2.0
Weblogs
Wikis
Reblog this post [with Zemanta]

Posted at 02:05

March 06

Egon Willighagen: Contributions to two new papers: skin lipids and AOPs

Figure 2 of the AOP paper showing the
content of the AOP-DB.
Because people are counting, it is hard to decline an invitation to contribute to a paper. Honestly, I rather be invited to contribute to research, but as I am told in a far past: writing papers and grant proposals make you think better about your research. Mmm, yes, that's because without those I never thought about my research. Anyways, These cross-project collaborations are really nice, and thank the leading authors for our joined efforts. I always then think that we can achieve that with a properly crafted pull request too and only if those were recognized and rewarded too. So, here I post some recent papers to which I or people paid by my research money (yeah, another thing we should discuss) have contributed too.

Skin lipids

The first such papers is a review of the state of studying lipids in our skin: Research Techniques Made Simple: Lipidomic Analysis in Skin Research (doi:10.1016/j.jid.2021.09.017). This paper originates from WG4 of the EpiLipidNet project (funded by COST) where we are developing molecular pathways involving lipids. Florian Gruber is chair of WG4 and our role is the pathways. We have set up lipids.wikipathways.org for this and the article further mentions the pathway and network system biology approaches used in our group.

Adverse Outcome Pathways

Combining Adverse Outcome Pathways (AOPs) with molecular pathways is one of the longer running research lines in our group. This already started during the eNanoMapper project and Marvin has been studying alternative approaches in his PhD projected funded by OpenRiskNet and EU-ToxRisk. During OpenRiskNet he collaborated with the team of Holly Mortensen of the USA EPA resulting in some collaborative projects. One outcome is the recent EPA paper on an RDF version of their AOP-DB: The AOP-DB RDF: Applying FAIR Principles to the Semantic Integration of AOP Data Using the Research Description Framework (doi:10.3389/ftox.2022.803983).

Thanks again to everyone involved in these papers for these nice collaborations!

Posted at 09:52

March 03

Libby Miller: Sock-puppet – an improved, simpler presence robot

Makevember and lockdown have encouraged me to make an improved version of libbybot, which is a physical version of a person for remote participation. I’m trying to think of a better name – she’s not all about representing me, obviously, but anyone who can’t be somewhere but wants to participate. [update Jan 15: she’s now called “sock_puppet”].

This one is much, much simpler to make, thanks to the addition of a pan-tilt hat and a simpler body. It’s also more expressive thanks to these lovely little 5*5 led matrixes.

Her main feature is that – using a laptop or phone – you can see, hear and speak to people in a different physical place to you. I used to use a version of this at work to be in meetings when I was the only remote participant. That’s not much use now of course. But perhaps in the future it might make sense for some people to be remote and some present.

New recent features:

  • easy to make*
  • wears clothes**
  • googly eyes
  • expressive mouth (moves when the remote participant is speaking, can be happy, sad, etc, whatever can be expressed in 25 pixels)
  • can be “told” wifi details using QR codes
  • can move her head a bit (up / down / left / right)

* ish
**a sock

I’m still writing docs, but the repo is here.

Libbybot-lite – portrait by Damian

Posted at 19:12

February 24

Leigh Dodds: Assessing data infrastructure: the Digital Public Goods standard and registry

This is the second in a short series of posts in which I’m sharing my notes and thoughts on a variety of different approaches for assessing data infrastructure and data institutions.

The first post in the series looked at The Principles of Open Scholarly Infrastructure.

In this post I want to take a look at the Digital Public Good (DPG) registry developed by The Digital Public Goods Alliance.

What are Digital Public Goods?

The Digital Public Goods Alliance define digital public goods as:

open-source software, open data, open AI models, open standards, and open content that adhere to privacy and other applicable laws and best practices, do no harm by design, and help attain the Sustainable Development Goals (SDGs)

Digital Public Goods Alliance

While the links to the Sustainable Development Goals narrows the field this definition still encompasses a very diverse set of openly licensed resources.

Investing in the creation and use of DPGs was one of eight key actions in the UN Roadmap for Digital Cooperation published in 2020.

What is the Digital Public Goods Standard?

The Digital Public Goods Standard consists of 9 indicators and requirements that are used to assess whether a dataset, AI model, standard, software package or content can be considered a DPG.

To summarise, the indicators and requirements cover:

  • relevance to the Sustainable Development Goals
  • openness: open licensing, clarity over ownership of the resource and access to data (in software systems)
  • reusability: platform independence and comprehensive documentation, in addition to open licensing
  • use of standards and best practices
  • minimising harms, with the ninth “Do No Harm by Design” principles decomposed into data privacy and security, policies for handling inappropriate and illegal content, and protection from harassment

In contrast to the Principles of Open Scholarly Infrastructure, which defines principles for infrastructure services (i.e. data infrastructure and data institutions) the Digital Public Goods Standard can be viewed as focusing on the outputs of that infrastructure, e.g. the datasets that they publish or the software or standards that they produce.

But assessing a resource to determine if it is a Digital Public Good inevitably involves some consideration of the processes by which it has been produced.

A recent Rockefeller Foundation report on co-developing Digital Public Infrastructure endorsed by the Digital Public Goods Alliance, highlights that Digital Public Goods might also be used to create new digital infrastructure. E.g. by deploying open platforms in other countries or using data and AI models to build new infrastructure.

So Digital Public Goods are produced by, used by, and support the deployment of data and digital infrastructure.

How was the Standard developed?

The Digital Public Goods Standard was developed by the Digital Public Good Alliance (DPGA), “a multi-stakeholder initiative with a mission to accelerate the attainment of the sustainable development goals in low- and middle-income countries by facilitating the discovery, development, use of, and investment in digital public goods

An early pilot of the standard was developed to assess Digital Public Goods focused on Early Grade Reading. The initial assessment criteria were developed by a technical group that explored cross-domain indicators and an expert group that focused on topics relevant to literacy.

This ended up covering 11 categories and 51 different indicators.

That results of that pilot was turned into the initial version of the DPG Standard and published in September 2020. In that process the 51 indicators were reduced down to just 9.

It is interesting to see what was removed, for example:

  • Utility and Impact — whether the Digital Public Good was actually in use in multiple countries
  • Product Design — whether there’s a process for prioritising and managing feature requests
  • Product Quality — accessibility statements and testing, version control, multi-lingual support
  • Community — code of conducts, community management
  • Do No Harm — security audits, data minimisation
  • Financial Sustainability — are there revenue streams that support continual development of the public good?

The process of engaging with domain experts has continued, with the DPGA developing Communities of Practice that have produced reports highlighting key digital public goods in specific domains. An example of what we called “data landscaping” at the ODI.

How are Digital Public Goods assessed?

The assessment process is as follows:

  1. The owner of an openly licensed resource use an eligibility tool to determine whether their resource is suitable for assessment
  2. If eligible, the owner will submit a nomination. The submission process involves answering all of these questions
  3. If accepted, a nominated resource will be listed in the public registry
  4. Nominated submissions will be further reviewed by the DPGA team, in order to complete the assessment at which point the nomination is marked as a Digital Public Good

While nominations can be made by third-parties, some indicators are only assessed based on evidence provided directly by the publisher of the resource.

At the time of writing there are 651 nominees and 87 assessed public goods in the registry. The list of Digital Public Goods consists of the following (items can be in multiple categories):

Category Count
Software 68
Content 17
Data 8
AI Model 4
Standard 4
Distribution of Digital Public Goods catalogued at https://digitalpublicgoods.net/registry/ on 24th February 2022

Its worth noting that several of the items in the “Data” category are actually APIs and services.

The assessment of a verified Digital Public Good is publicly included in the registry. For example here is the recently published assessment of the Mozilla Common Voice dataset. However all of the data supporting the individual nominations can be found in this public repository.

The documentation and the submission guide explain that the benefits of becoming a Digital Public Good include

  • increased adoption or use
  • discoverability, promotion and recognition within development agencies, the UN and governments
  • in the future — additional branding opportunities through use of icons or brand marks
  • in the future — potential to be included in recommendations to government procurers and funding bodies
  • in the future — additional support, e.g. mentoring and funding

Indirectly, by providing a standard for assessment, the DPGA will be influencing the process by which openly licensed resources might be created.

Could the Standard be used in other contexts?

Is the Standard useful as a broader assessment tool, e.g. for projects that are not directly tied to the SDGs? Or for organisations looking to improve their approach to publishing open data, open source or machine-learning models?

I think the detailed submission questions provide some useful points of reflection.

But I think the process of trying to produce a single set of assessment criteria that covers data, code, AI models and content means that useful and important detail is lost.

Even trying to produce a single set of criteria for assessing (open) datasets across domains is difficult. We tried that at the ODI with Open Data Certificates. Others are trying to do this now with the FAIR data principles. You inevitably end up with language like “using appropriate best practices and standards” which is hard to assess without knowledge of the context in which data is being collected and published.

I also think the process of winnowing down the original 51 indicators to just 9 and some supporting questions, has lost some important elements. Particularly around sustainability.

Again, in Open Data Certificates, we asked questions about longer-term sustainability and access to data. This is also seems highly relevant in the context in which the DPGA are operating.

I think the standard might have been better having separate criteria for different types of resource and then directly referencing existing criteria (e.g. FAIR data assessment tools for data) or best practices (e.g. Model Cards for Model Reporting for AI), etc.

Posted at 14:07

February 23

Leigh Dodds: Assessing data infrastructure: the Principles of Open Scholarly Infrastructure

How do we create well-designed, trustworthy, sustainable data infrastructure and institutions?

This is a question that I remain deeply interested in. Much of the freelance work I’ve been doing since leaving the ODI has been in that area. For example, I’m currently helping with a multi-year evaluation of an grant-funded data institution.

I’m particularly interested in frameworks and methodologies for assessing infrastructure and institutions. With a view towards helping them become more open, more trustworthy and more sustainable.

This is the first in a series of blog posts looking at some of this existing work.

What are the Principles of Open Scholarly Infrastructure?

The Principles of Open Scholarly Infrastructure (POSI) consist of 16 principles grouped into three themes: governance, sustainability and insurance.

The seven Governance principles touch on how the infrastructure will be governed and managed. These highlight the need for, e.g. stakeholder led governance, transparency, and the need to plan across the entire lifecycle of the infrastructure, including its wind-down.

The five Sustainability principles highlight the need to for revenue generation to align with the mission of the organisation, and emphases generating revenue from services rather than data. They also highlight the need to generate a surplus and finding long-term sources of revenue, rather than relying on grant funding.

The five Insurance principles centre openness: open source and open data, as well as IP issues. In short, ensuring that if the infrastructure fails (or loses the trust of its community) its core assets can be reused.

How was it developed?

The principles were first presented in a 2015 blog post. The principles attempted to codify a set of rules and norms that was already informing the operations of CrossRef, this was prompted by an growing distrust in infrastructure services by the scholarly community.

Reliance on time-limited grant funding was impacting sustainability and reliability of services, alongside growing concerns over commercial ownership of key services.

Since then the principles have been discussed and adopted by a number of others.

What type of infrastructure and institutions does it focus on?

POSI is intended to help guide the development and operations of infrastructure that supports any kind of scholarly activities (e.g. both research and teaching) across a range of domains (e.g. both the sciences and humanities).

This includes infrastructure services that support the management and publication of research data and metadata, scholarly research archives, identifier schemes, etc.

The FAQ highlights that the principles were also intended to help support procurement and comparison of different services.

Could the principles be adopted in other contexts?

Any set of principles will reflect the priorities of community that produced them. Care should be taken before blindly applying principles from one context to another. Some issues might be foregrounded that are less important. While other important concerns might be appropriately centred.

See my post on the FAIR data principles for more on that.

However, I think much of POSI is applicable to all types of open infrastructure. Good governance and sustainability is important in any context. Open data and open source also play a fundamental role.

However there are some elements that might not apply in all contexts, or might be presented differently. And others which might be missing. For example:

  • The governance principle “coverage across the research enterprise” is clearly domain specific. Although there may be a broader formulation that focuses on governance that respects the entire ecosystem in which the infrastructure exists
  • The principles state that the infrastructure “cannot lobby” — I think in some contexts organisations operating infrastructure services might want to, or already see themselves as involved in, driving regulatory change? E.g. to help to encourage strong data protection laws, or create greater transparency within a sector
  • The foregrounding of limited use of time-limited funds clearly reflects the origins of the principles in a community that often relies on grant funding. This may be less of an issue in other contexts
  • As an insurance policy, open source and open data allows systems to be forked or cloned. But open source and open data might underpin other areas of how an infrastructure operates, e.g. security or collaboration. So they might fulfil a different role in other communities
  • There are no principles that directly foreground issues of privacy or ethics — infrastructure services in other sectors might reasonably want clearer statements on privacy, inclusion and responsible use of data

The POSI FAQ is also worth reading as it includes a number of clarifications about the scope and intent behind some of the principles.

In short, I think it would be useful to compare POSI with approaches originating in other sectors, in order to identify common themes. Within the research space, IOI is also exploring ways to assess and compare infrastructure.

How have organisations adopted the principles?

A range of organisations have adopted the principles, most recently Europe PMC.

No organisation meets all of the principles.

But the intention isn’t that an organisation should comply with all of them before doing an assessment. An assessment is intended to prompt reflection and development of a plan for improvement.

There is an assumption that organisations will regularly reassess themselves against the principles.

All of the existing assessments have taken the same broad approach, replicating that used by CrossRef:

  • a public statement that the organisation has decided to adopt the principles
  • a high-level self-assessment against the principles, using a Red/Amber/Green (RAG) rating for each item on the list
  • for each principle, a more detailed discussion of how the organisation has implemented that principle or has plans to do so
  • links to relevant evidence, e.g. public policies or governance documents that back up the assessment

I’ve produced a public spreadsheet listing the current RAG ratings for each organisation.

To help with future assessments, I think there’s scope to:

  • produce a common openly licensed template that can be used to publish the self-assessment ratings and capture relevant links
  • some additional guidance about how to assess and interpret each principles
  • suggestions for the types of evidence that could be referenced, or published to help to make the assessments verifiable

Posted at 14:07

February 22

Leigh Dodds: Reflecting on “How to Do Nothing”

I recently finished reading “How to Do Nothing” by Jenny Odell. It’s a great, thought-provoking read.

Despite the title the book isn’t a treatise on disconnecting or a guide to mindfulness. It’s an exploration of attention: what is it? how do we direct it? can it be trained? And how is it hijacked by social media?

While the book makes a powerful case of the importance of stepping away from technology to reconnect with our local environments and communities, Odell’s recommendation isn’t that we should just disconnect from social media or technology. Her argument is that we need to redesign the ways that we connect with one another. Reframing rather than disengaging.

She illustrates the power of redirecting our attention with examples from birdwatching, art and music.

Having spent much of last summer, identifying the bees in my garden, I’m very aware of how a change in focus can suddenly bring a small corner of the world to life.

In her discussion of social media, Odell touches on context collapse. But she also highlights how social feeds themselves lack context. There is no organising principle of geography, theme or community to that tumble of posts. This leaves us endlessly scrolling, searching for meaning our attention (and emotions) flitting from one thing to the next.

This crystallised for me my recent frustrations with Twitter: no matter how well I curate my list of followers what I see is rarely what I’m looking for in the moment. Feeds lack structure and there’s no way real way for me to assert any control over it.

It’s also why I uninstalled Tik Tok when I realised that an endless scroll of algorithmically recommended content was the only real way to engage.

Odell argues that we need new frameworks for connecting online, touching on Community Memory and decentralised systems like Mastodon. Systems that build or provide context across communities.

One of the first articles I read after finishing “How To Do Nothing” was a post called “What using RSS feeds feels like” by Giles Turnbull. It neatly describes the flexibility that using a Feed Reader provides. They can support us in focusing our attention on the things we enjoy. While not a social space, they can connect to them.

The broader themes of “How To Do Nothing” are the importance of (re)connecting to our local communities and environment so that we can deal with the big challenges ahead of us. And perhaps as an antidote to an increasingly polarised society.

This really challenged me to reflect on what it would mean to reduce my use of social media. What would fill that space both online and off? I’m not sure of the answer to that yet.

But I feel like social spaces — at least the ones I currently use, at any rate — have become less fulfilling. Trading connectivity over context. I’d like to find some different options.

Posted at 13:05

February 19

Leigh Dodds: How could watermarking AI help build trust?

I’ve been reading about different approaches to watermarking AI and the datasets used to train them.

This seems to be an active area of research within the machine learning community. But, of the papers I’ve looked at so far, there hasn’t been much discussion of how these techniques might be applied and what groundwork needs to be done to achieve that.

This post is intended as a brief explainer on watermarking AI and machine-learning datasets. It also includes some potential use cases for how these might be applied in real-world systems to help build trust, e.g. by supporting audits or other assurance activities.

Please leave any comments or corrections if you think I’ve misrepresented something.

What is watermarking?

There are techniques that allow watermarking of digital objects, like images, audio clips and video. It’s also been applied to data.

Sometimes the watermark is visible. Sometimes it’s not.

Watermarking is frequently used as part of rights management, e.g. to track the provenance of an image to support copyright claims.

There are sophisticated techniques to apply hidden watermarks to digital objects in ways that resist attempts to remove them.

Watermarking involves add message, logo, signature or some other data to something in order to determine its origin. There’s a very long history of watermarking physical objects like bank notes and postage stamps.

How can watermarking be applied to AI and machine-learning datasets?

Researchers are currently exploring ways to apply watermarking techniques to machine-learning models and the data used to produce them.

There are broadly two approaches:

  • Model watermarking – adding a watermark to a machine-learning model so that it becomes possible to detect whether that specific model has been used to generate a prediction.
  • Dataset watermarking – invisibly modifying a training dataset so that it becomes possible to detect whether a machine-learning model has been trained on that dataset

The are various ways of implementing and using these approaches.

For example, a model can be watermarked by:

  • Injecting some modified data into the training dataset, so that it changes the model in ways that can be later detected
  • Adjusting the weights of the model during or after training, in ways that can later be detected

Dataset watermarking assumes that the publisher of the dataset isn’t involved in the later training of an AI. So it relies on adjusting the training dataset only. It’s a way of finding out how a model was produced. Whereas model watermarking allows the detection of a model when it is deployed.

Dataset watermarking requires new techniques to be developed because existing watermarking approaches don’t work in a machine-learning context.

For example, when training a image classification model, any watermarks present in the training images will just be discarded. The watermarks are not relevant to the learning process. To be useful, watermarking a machine-learning dataset, involves modifying the data in ways that are consistent with the labelling, so that they induce changes in the model that can later be generated.

In this context then, dataset watermarking is a technique that is specifically intended to apply to machine-learning datasets: labelled dataset intended to be used in machine-learning applications and research. It’s not a technique you would just apply to a random dataset published to a government data portal.

Checking whether a model is watermarked, e.g. to determine where the model came from or whether it was trained on a specific dataset can be done without having direct access to the model.

Importantly, it’s possible to verify either a model or dataset watermark without having direct access to the model. The watermark can be checked by inspecting the output of the model in response to specific inputs that are designed to expose it.

If you want a more technical introduction to model and dataset watermarking, then I recommend starting with these papers:

These techniques are related to areas like “data poisoning” (modifying training data to cause defects in a model) and “membership inference” (determining whether some sample data was used to train a model as a privacy measure).

How might watermarking be used in real systems?

In their blog introducing the concept of “Radioactive data” the Facebook research team suggest that the technique:

“…can help researchers and engineers to keep track of which dataset was used to train a model so they can better understand how various datasets affect the performance of different neural networks.”

Using ‘radioactive data’ to detect if a dataset was used for training

They later expand on this a little:

Techniques such as radioactive data can also help researchers and engineers better understand how others in the field are training their models. This can help detect potential bias in those models, for example. Radioactive data could also help protect against the misuse of particular datasets in machine learning.

Using ‘radioactive data’ to detect if a dataset was used for training

The paper on “Open Source Dataset Protection” suggests it would be a useful way to confirm that commercial AI models have not been trained on datasets that are only licensed for academic or educational use.

I’ve yet to see a more specific set of use cases, so I’d like to suggest some.

I think all of the below are potential uses that align with the capabilities of the existing techniques. Future research might open up other potential uses.

Model watermarking

  • As a government agency, I want to verify that a machine-learning model used in a product I have procured is the same model that has been separately assessed against our principles for responsible data practices, so that I can be sure that the product will operate as expected
  • As a civil society organisation, I want verification that a model making decisions that impact my community, is the same model that has been independently assessed via an audit, so that I can be more certain about its impacts
  • As a regulator, I want to verify whether a commercial organisation has deployed a specific third-party machine-learning model, so that I can warn them about its biases, certify the product, or issue a recall notice

Dataset watermarking

  • As a civil society organisation, I want to determine whether a machine-learning model has been trained on biased or incorrect data, so that I can warn consumers
  • As a trusted steward of data, I want to determine whether a machine-learning model has been trained on data that I have supplied, so that I can take action to protect the rights of those represented in the data
  • As a publisher of data, I want to determine whether a machine-learning model has been trained on an earlier version of a dataset, so that I can warn users of that model about known biases or errors
  • As a regulator, I want to determine which datasets are being used in machine-learning models, so that I can prioritise which datasets to audit for potential bias, privacy or ethical issues

There’s undoubtedly a lot more of these. In general watermarking can help us determine what model is being used in a service and what dataset(s) were used in training them.

For some of the use cases, there are likely to be other ways to achieve the same goal. Regulators could directly require companies to inform them about sources of data they are using, rather than independently checking for watermarks.

But sometimes you need multiple approaches to help build trust. And I wanted to flesh out a list of possible users that were outside of research and which were not about IP enforcement.

What do we need to do to make this viable?

Assuming that these types of watermarking techniques are useful in the ways I’ve suggested, then there are a range of things required to build a proper ecosystem around them.

All of the hard work to create the appropriate standards, governance and supporting infrastructure to make them work.

This includes:

  • Further research to identify and refine the watermarking techniques so that the community can converge on common standards for different types of dataset
  • Developing common standards for integrating watermarking into the curation and publishing of training datasets. This includes introducing the watermarks into the data, production of suitable documentation and publication of data required to support verification
  • Developing common standards for integrating watermarking steps into the training and publishing of machine-learning models. This includes introducing the watermarks into the training process, production of suitable documentation and publication of data required to support verification
  • Developing a registry of watermarks and watermarking data to support independent verification.
  • Development of tools to support independent verification of watermarks by organisations carrying out audits and other assurance activities
  • ..etc, etc

Posted at 16:05

Leigh Dodds: Downloading magazines from the Internet Archive (and making gifs from their covers)

I like reading old magazines and books over at the Internet Archive.

They’ve got a great online reader that works just fine in the browser. But sometimes I want a local copy I can put on my tablet or other device. And reading locally saves them some bandwidth.

Downloading individual items is simple, but it can be tedious to grab multiple items. So here’s a quick tutorial on automatically downloading items in bulk and then doing something with them.

The IA command-line tool

The Archive provide an open API to their collections and a command-line tool that uses that API to let you access metadata, and upload and download content.

The Getting Started guide has plenty of examples and installation instructions for Unix systems. I also found some Windows instructions.

Finding the collection identifier

The Archive organises items into “collections”. The issues of a magazine will be organised into a single collection.

There are also collections of collections. The Magazine Rack collection is a great entry point into a whole range of magazine collections, so its a good starting point to explore if you want to see what the Archive currently holds.

To download every issue of a magazine you just need to first identify the name of its collection.

The easiest way to do that is to take the identifier from the URL. E.g. the INPUT magazine collection, has the following URL:

https://archive.org/details/inputmagazine

The identifier is the last part of the URL (“inputmagazine“).

You can also click on the “About” section of the collection and look at its metadata. The Identifier is listed there.

Metadata for the INPUT magazine collection. Note the Identifier

Downloading the collection

Assuming you have the ia tool installed, the following command-line will let you download the contents of a named collection. Just change the identifier from “inputmagazine” to the one you want.

ia download --search 'collection:inputmagazine' --glob="*.pdf"

The “glob” parameter asks the tool to only download PDFs (“files that have a .pdf extension”). If you don’t do this then you will end up downloading every format that the Archive holds for every item. You almost certainly don’t want that that. Its slow and uses up bandwidth.

If you’re downloading to put the content on an ereader or kindle, then you could use “*.epub” or “.mobi” instead.

When downloading the files, the ia tool will put each one into a separate folder.

And that’s it: you now have you own local copy of all the magazines. You can use the same approach to download any type of content, not just magazines.

Now to do something with them.

Extracting the covers

Magazines usually have great cover art. I like turning them into animated GIFs. Here’s one I made for Science for the People. And another for INPUT magazine.

To do that you need to do two things:

  1. Extract the first page of the magazine from the PDF, saving it as an image
  2. Compile all of those images into an animated gif

To extract the images, I use the pdftoppm tool. This is also cross-platform so should work on any system.

The following command will extract the first page of a file called example.pdf and save it into a new file called example-01.jpg.

pdftoppm -f 1 -l 1 -jpeg -r 300 example.pdf example

See the documentation page for more information on the parameters.

Having downloaded an entire collection using the ia tool, you will have a set of folders each containing a single PDF file. Here’s a quick bash script that you can run from the folder where you downloaded the content.

It will find every PDF you’ve downloaded, then use pdftoppm to extract the first page, storing the images in a separate “images” directory.

#!/bin/bash

mkdir -p images
i=0
for FILE in **/*.pdf; 
do      
i=$((i + 1))
echo $FILE;
pdftoppm -f 1 -l 1 -jpeg -r 300 $FILE images/issue-$i
done

Creating a GIF from the covers

Finally, to create a GIF from those JPEG files I use the ImagicMagick convert tool.

If you create an animated GIF from a lot of files, especially at higher-resolutions, then you’re going to end up with a very large file size. So I resize the images when creating the GIF.

The following command will find all the images we’ve just created, resize them by 25% and turn them into a GIF that will loop forever.

convert -resize 25% -loop 0 `ls -v images/*.jpg` input.gif

The ls command gives us a listing of the image files in a natural order. So we get the issues appearing in the expected sequence.

You can add other options to tweak the delay between frames of the GIF. Or change the loop variable to only do a fixed number of loops.

If you want to post the GIF to Twitter then there is a maximum 15MB file size via the web interface. So you may need to tweak the resize command further.

Happy reading and animating.

Posted at 12:05

February 18

Leigh Dodds: The COVID is coming from inside the house

We’re two years into the COVID-19 pandemic and I still keep having moments of “Holy shit, we’re in a global pandemic“.

We’ve all been through so many emotions. And there’s more to come. But it still seems surreal at times.

I say that not to deny or dismiss what is happening. It’s just a lot to deal with at times.

Both my daughter and wife now have COVID. They’re doing fine thankfully. We’re all isolating in separate parts of the house. We’re lucky to be able to do that. Lucky that I work from home and can look after them.

But that surreal feeling has been particularly acute this week.

Across the planet, a virus mutated. It infected someone. It’s been passed from person to person across the globe until now it’s inside the house. Inside the people who live here. At some point it will be inside me.

There’s an extra visceral feeling from knowing that the virus is inside these four walls. It feels different to knowing it is out there.

That process of transmission happens all the time. It’s how I’ve caught every cold I’ve ever had. It’s a fact of life. But, like so many other facts of our lives, it’s one that doesn’t always get the attention it deserves.

We’re lucky to have vaccines for COVID-19. We’re lucky to have been triple-vaccinated. I knew that we’d catch it at some point. I’m so glad it’s now, after we’ve been vaccinated. Many other people have not been so lucky.

Although it privilege really, rather than luck.

I assume at least some of the denial that I hear that “things will go back to normal” stems from other people also feeling that the current situation is weird, unusual, temporary. That “it will all be over soon”. But that’s not going to happen.

Things have changed now for good. I know that. Doesn’t stop me from occasionally having these “Holy shit” moments though.

Posted at 17:05

Leigh Dodds: Remembering INPUT magazine

In 1984 a new magazine hit the racks in W. H Smiths: INPUT.

It offered to help you to “Learn programming for fun and the future” via a weekly course in programming and computing. It ran for a total of 52 issues.

I was 12 when it came out. And I collected every issue.

INPUT gave me my first proper introduction to programming. The ZX Spectrum user guide and the BASIC programming manual were useful references and supplied some example programmes. (I played a lot of Pangolins).

But it was INPUT that taught me more than just the basics and introduced me to a whole range of new topics.

Sadly, I got rid of my copies, along with all of my other 80s computer magazines many years ago. Happily, the full collection of INPUT is available on the Internet Archive.

I had a dig through it again recently. They covered a surprising range of material. Both simple and advanced programming, as well as how the hardware worked.

All of the code was provided for a range of desktop computers. I always found it interesting to compare the different versions of BASIC. I even read the C64 only features, which were usually showing of its more advanced features.

I never did get the hang of Assembly though.

The magazine included some longer tutorials that built up a programme or covered a topic over a number of issues.

One of my favourites started in Issue 9. A five part series on how to write text adventure games. I spent a lot of time playing at making my own games.

A pretty accurate depiction of the inside of my brain in the 80s. There were a lot of tapes involved. And graph paper

Looking back now, I can see that one of the biggest impacts it had on me came from a listing in Issue 2 and 3. These two articles introduced a simple filing system with basic record management and search functions. A little database to help people track their hobbies.

I used it to catalogue all the games I copied off my mates.

I made a database for my Dad to keep track of his pigeons. How many races they’d won. And which ones he’d been breeding. He never used it. But I had fun designing it.

Most of my career has involved working with data and databases. INPUT gave me not just an introduction to programming, but my first introduction to that whole topic.

There’s a line that starts with me working with that BASIC code on the ZX Spectrum which continues all the way to the present day.

During the second year of my biology degree I took a computing module. As part of that I wrote a Pascal programme to simulate the activity of restriction enzymes. It sounds fancy, but it was just string manipulation.

Looking at that code its clearly informed by what I had learned from that INPUT article. It had a simple menu system to access different functions. I even wrote some tools to help me create the data files that drove the application, so I could manage those records separately.

I had a lot of fun writing it.

I described that Pascal programme in the interview I had to take to get accepted for my Masters in Computing. It was a conversion course for people who didn’t have a Computing or Maths background. It definitely helped me get on that course.

I took all the modules about databases, obviously. As well as the networking ones. Because I was started to get interested in the web.

And then I got interested in the web as a database. And data on the web.

It’s funny how things work out.

So thanks Mum for bringing that magazine home every Thursday night. And thank you Internet Archive for making them available for me to read again.

Posted at 15:05

February 17

John Breslin: Web Archive Ontology (SIOC+CDM)

Ontology Prototype

We (John G. Breslin and Guangyuan Piao, Unit for Social Semantics, Insight Centre for Data Analytics, NUI Galway) have created a prototype ontology for web archives based on two existing ontologies: Semantically-Interlinked Online Communities (SIOC) and the Common Data Model (CDM).

SIOC+CDM

Figure 1: Initial Prototype of Web Archive Ontology, Linking to SIOC and CDM

In Figure 1, we give an initial prototype for a general web archive ontology, linked to concepts in the CDM, but allowing flexibility in terms of archiving works, media, web pages, etc. through the “Item” concept. Items are versioned and linked to each other, as well as to concepts appearing in the archived items themselves.

We have not shown the full CDM for ease of display in this document, but rather some of the more commonly used concepts. We can also map to other vocabulary terms shown in the last column of Table 1 below; some mappings and reused terms are shown in Figure 1.

Essentially, the top part of the model differentiates between the archive / storage mechanism for an item in an area (Container) on a website (Site), i.e. where it originally came from , who made it, when it was created / modified, when it was archived, the content stream, etc., and on the bottom, what the item actually is (for example, in terms of CDM, the single exemplar of the manifestation of an expression of a work).

Also, the agents who make the item and the work may differ (e.g. a bot may generate a HTML copy of a PDF publication written by Ms. Smith).

Relevant Public Ontologies

In Table 1, we list some relevant public ontologies and terms of interest. Some terms can be reused, and others can be mapped to for interoperability purposes.

Ontology Name Overview Why relevant? What terms are useful?
FRBR For describing functional requirements for bibliographic records. To describe bibliographic records.
Expression

Work 
FRBRoo Express the conceptualisation of FRBR with an object-oriented methodology instead of the entity-relationship methodology, as an alternative. In general, FRBRoo “inherits” all concepts of CIDOC-CRM and harmonises with it.
ClassicalWork

LegalWork

ScholarlyWork

Publication

Expression
BIBFrame For describing bibliographic descriptions, both on the Web and in the broader networked world. To represent and exchange bibliographic data.
Work

Instance

Annotation

Authority
EDM The Europeana Data Model models data in and supports functionality for Europeana, an internet portal that acts as an interface to millions of books, paintings, films, museum objects and archival records that have been digitised throughout Europe. Complements FRBRoo with additional properties and classes.
incorporate

isDerivativeOf

WebResource

TimeSpan

Agent

Place

PhysicalThing
CIDOC-

CRM

For describing the implicit and explicit concepts and relationships used in the cultural heritage domain. To describe cultural heritage information.
EndofExistence

Creation

Time-Span
EAC-CPF Encoded Archival Context for Corporate Bodies, Persons and Families is used for encoding the names of creators of archival materials and related information. Used closely in association with EAD to provide a formal method for recording the descriptions of record creators.
lastDateTimeVerified

Control

Identity
EU PO CDM Ontology based on the FRBR model, for describing the relationships between resource types managed by the EU Publications Office and their views, according to the FRBR model. To describe records.
Expression

Work

Manifestation

Agent

Subject

Item
OAI-ORE Defines standards for the description and exchange of aggregations of Web resources. To describe relationships among resources (also used in EDM).
aggregates

Aggregation

ResourceMap
EAD Standard used for hierarchical descriptions of archival records. Terms are designed to describe archival records.
audience

abbreviation

certainty

repositorycode

AcquisitionInformation

ArchivalDescription
WGS84 Geo For describing information about spatially located things. Terms can be used with the Place ontology for describing place information.
lat

long
Media For describing media resources on the Web. To describe media contents for web archiving.
compression

format

MediaType
Places For describing places of geographic interest. To describe place information for events, etc.
City

Country

Continent
Event For describing events. To describe specific event in content. Also can be used for representing events at an administrative level.
agent

product

place

Agent

Event
SKOS A common data model for sharing and linking knowledge organisation systems. To capture similarities among ontologies and makes the relationships explicit.
broader

related

semanticRelation

relatedMatch

Concept

Collection
SIOC For describing social content. Terms are general enough to be used for web archiving.
previous_version

next_version

earlier_version

later_version

latest_version

Item

Container

Site

embed_knowledge
Dublin Core Provide a metadata vocabulary of “core” properties that is able to provide basic descriptive information about any kind of resource. Fundamental terms used with other ontologies.
creator

date

description

identifier

language

publisher
LOC METS Profile The Metadata Encoding and Transmission Standard (METS) is a metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. The METS profile expresses the requirements that a METS document must satisfy. To describe and organise the components of a digital object.
controlled_

vocabularies

external_schema
DCAT and DCAT-AP A specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. Enable the exchange of description metadata between data portals.
downloadURL

accessURL

Distribution

Dataset

CatalogRecord
Formex A format for the exchange of data between the Publication Office and its contractors. In particular (but not only), it defines the logical markup for documents, which are published in the different series of the Official Journal of the European Union. Useful for annotating archived items as well for exchange purposes.
Archived

Annotation

FT

Note
ODP Ontology describing the metadata vocabulary for the Open Data Portal of the European Union. To describe dataset portals.
datasetType

datasetStatus

accrualPeriodicity

DatasetDocumentation
LOC PREMIS Used to describe preservation metadata. Applicable to archives.
ContentLocation

CreatingApplication

Dependency
VIAF Virtual International Authority File is an international service designed to provide convenient access to the world’s major name authority files (lists of names of people, organisations, places, etc. used by libraries). Enables switching of the displayed form of names to the preferred language of a web user. Useful for linking to name authority files and helping to serve different language communities in Europe.
AuthorityAgency

NameAuthority

NameAuthorityCluster

Table 1: Relevant Ontologies and Terms

Posted at 15:05

John Breslin: Tales From the SIOC-O-Sphere #10

siocapps_medium

SIOC is a Social Semantic Web project that originated at DERI, NUI Galway (funded by SFI) and which aims to interlink online communities with semantic technologies. You can read more about SIOC on the Wikipedia page for SIOC or in this paper. But in brief, SIOC provides a set of terms that describe the main concepts in social websites: posts, user accounts, thread structures, reply counts, blogs and microblogs, forums, etc. It can be used for interoperability between social websites, for augmenting search results, for data exchange, for enhanced feed readers, and more. It’s also one of the metadata formats used in the forthcoming Drupal 7 content management system, and has been deployed on hundreds of websites including Newsweek.com.

As part of our dissemination activities, I’ve tried to regularly summarise recent developments in the project so as to give an overview of what’s going on and also to help in connecting interested parties. It’s been much too long (over a year) since my last report, so this will be a long one! In reverse chronological order, here’s a list of recent applications and websites that are using SIOC:

  • SMOB Version 2. As you may have read on Y Combinator Hacker News yesterday, a re-architected and re-coded version of SMOB (Semantic Microblogging) has been created by Alex Passant. As with our original SMOB design, a user’s SMOB site stores and shares tweets and user information using SIOC and FOAF, but the new version also exposes data via RDFa and additional vocabularies (including the Online Presence Ontology, MOAT, Common Tag). The new SMOB suggests relevant URIs from DBpedia and Sindice when #hashtags are entered, and has moved from a client-server model to a set of distributed hubs. Contact @terraces.
  • on-the-wave. This script creates an enhanced browsing experience (that is SIOC-enabled) for the popular PTT bulletin board system. Contact kennyluck@csail.mit.edu.
  • Newsweek.com. American news magazine Newsweek are now publishing RDFa on their main site, including DC, CommonTag, FOAF and SIOC. Contact @markcatalano.
  • Linked Data from Picasa. OpenLink Software’s URI Burner can now provide Linked Data views of Google Picasa photo albums. See an example hereContact @kidehen.
  • Facebook Open Graph Protocol. Facebook recently announced its Open Graph Protocol (OGP), which allows any web page to become a rich object in their social graph. While OGP defines its own set of classes and properties, the RDF schema contains direct mappings to existing concepts in FOAF, DBpedia and BIBO, and indirect mappings to concepts in Geo, vCard, SIOC and GoodRelations. OpenLink also have a data dictionary meshup of some OGP and SIOC terms (ogp:Blog is mapped to sioct:Weblog). Contact @daveman692.
  • Linked Data from Slideshare. A service to produce Linked Data from the popular Slideshare presentation sharing service has been created, and is available here. Data is represented in SIOC and DC. Contact @pgroth.
  • Fanhubz. FanHubz supports community building and discovery around BBC content items such as TV shows and radio programmes. It reuses the sioct:MicroblogPost term and also has some interesting additional annotation terms for in-show tweets (e.g. twitterSubtitles). Contact @ldodds.
  • RDFa-enhanced FusionForge. An RDFa-enhanced version of FusionForge, a software project management and collaboration system, has been created that generates metadata about projects, users and groups using SIOC, DOAP and FOAF. You can look at the Forge ontology proposal, and also view a demo site. Contact @olberger.
  • Falconer. Falconer is a Semantic Web search engine application enhanced with SIOC. It allows newly-created Social Web content to be represented in SIOC, but it also allows this content to be annotated with any semantic statements available from Falcons, and all of this data can then be indexed by the search engine to form an ecosystem of semantic data. Contact wu@seu.edu.cn.
  • Django to RDF. A script is available here to turn Django data into SIOC RDF and JSON. View the full repository of related scripts on github. Contact @niklasl.
  • SIOC Actions Module. A new SIOC module has been created to describe actions, with potential applications ranging from modelling actions in a developer community to tracing interactions in large-scale wikis. There is a SIOC Actions translator site for converting Activity Streams, Wikipedia interactions and Subversion actions into RDF. Contact @pchampin.
  • SIOC Quotes Module. Another SIOC module has been developed for representing quotes in e-mail conversations and other social media content. You can view a presentation on this topic. Contact @terraces.
  • Siocwave. Siocwave is a desktop tool for viewing and exploring SIOC data, and is based on Python, RDFLib and wxWidgets. Contact vfaronov@gmail.com.
  • RDFa in Drupal 7. Following the Drupal RDF code sprint in DERI last year, RDFa support (FOAF, SIOC, SKOS, DC) in Drupal core was committed to version 7 in October, and work has been apace on refining this module. Drupal 7 is currently on its fifth alpha version, and a full release candidate is expected later this summer. Find out more about the RDFa in Drupal initiative at semantic-drupal.com. Contact @scorlosquet.
  • Omeka Linked Data Plugin (Forthcoming). A plugin to produce Linked Data from the Omeka web publishing platform is in progress that will generate data using SIOC, FOAF, DOAP and other formats. Contact @patrickgmj.
  • Boeing inSite. inSite is an internal social media platform for Boeing employees that provides SIOC and FOAF data services as part of its architecture. Contact @adamboyet.
  • Virtuoso Sponger. Virtuoso Sponger is a middleware component of Virtuoso that generates RDF Linked Data from a variety of data sources (working as an “RDFizer”). It supports SIOC as an input format, and also uses SIOC as its data space “glue” ontology (view the slides). Contact @kidehen.
  • SuRF. SuRF is a Python library for working with RDF data in an object-oriented way, with SIOC being one of the default namespaces. Contact basca@ifi.uzh.ch.
  • Triplify phpBB 3. A Triplify configuration file for phpBB 3 has been created that allows RDF data (including SIOC) to be generated from this popular bulletin board system. Various other Triplify configurations are also available. Contact auer@informatik.uni-leipzig.de.
  • SiocLog. SiocLog is an IRC logging application that provides discussion channels and chat user profiles as Linked Data, using SIOC and FOAF respectively. You can see a deployment and view our slides. Contact @tuukkah.
  • myExperiment Ontology. myExperiment is a collaborative environment where scientists can publish their workflows and experiment plans, share them with groups and find those of others. In their model, myExperiment reuses ontologies like DC, FOAF, SIOC, CC and OAI-ORE. Contact drn@ecs.soton.ac.uk.
  • aTag. The aTag generator produces snippets of HTML enriched with SIOC RDFa and DBpedia-linked tags about highlighted items of interest on any web page, but aiming at the biomedical domain. Contact @matthiassamwald.
  • ELGG SID Module. A Semantically-Interlinked Data (SID) module for the ELGG educational social network system has been described that allows UGC and tags from ELGG platforms to become part of the Linked Data cloud. Contact @selvers.
  • Liferay Linked Data Module. The Linked Data module for Liferay, an enterprise portal solution, supports mapping of data to the SIOC, MOAT and FOAF vocabularies. Contact @bryan_.
  • ourSpaces. ourSpaces is a VRE enabling online collaboration between researchers from various disciplines. It combines FOAF and SIOC with data provenance ontologies for sharing digital artefacts. Contact r.reid@abdn.ac.uk.
  • Good Relations and SIOC. This post describes nicely how the Good Relations vocabulary for e-commerce can be combined with SIOC, e.g. to link a gr:Offering (either being offered or sought by a gr:BusinessEntity) to a natural-language discussion about that thing in a sioc:Post. Contact sdmonroe@gmail.com.
  • Debian BTS to RDF. Discussions from the Debian bug-tracking system (BTS) can be converted to SIOC and RDF and browsed or visualised in interesting ways, e.g. who replied to whom. Contact quang_vu.dang@it-sudparis.eu.
  • RDFex. For those wishing to reuse parts of popular vocabularies in their own Semantic Web vocabularies, RDFex is a mechanism for importing snippets from other namespaces without having to copy and paste them. RDFex can be used as a proxy for various ontologies including DC, FOAF and SIOC. Contact holger@knublauch.com.
  • IRC Logger with RDFa and SIOC. A fork of Dave Beckett’s IRC Logger has been created to include support for RDFa and SIOC by Toby Inkster. Contact mail@tobyinkster.co.uk.
  • mbox2rdf. A mbox2rdf script has been created that converts a mailing list in an mbox file to RDF (RSS, SIOC and DC). Contact mail@tobyinkster.co.uk.
  • Chisimba SIOC Export Module. A SIOC Export module for the Chisimba CMS/LMS platform has been created, which allows various Chisimba modules (CMS, forum, blog, Jabberblog, Twitterizer) to export SIOC data. Contact @paulscott56.
  • vBulletin SIOC Exporter. Omitted from the last report, the vBulletin SIOC plugin generates SIOC and FOAF data from vBulletin discussion forums. It includes a plugin that allows users to opt to export the SHA1 of their e-mail address (and other inverse functional properties) and their network of friends via vBulletin’s user control panel. Contact @johnbreslin.
  • Discuss SIOC on Google Wave. You can now chat about SIOC on our Google Wave.

Posted at 15:05

John Breslin: Some of my (very) preliminary opinions on Google Wave

I was interviewed by Marie Boran from Silicon Republic recently for an interesting article she was writing entitled “Will Google Wave topple the e-mail status quo and change the way we work?“. I thought that maybe my longer answers may be of interest and am pasting them below.

Disclaimer: My knowledge of Google Wave is second hand through various videos and demonstrations I’ve seen… Also, my answers were written pretty quickly!

As someone who is both behind Ireland’s biggest online community boards.ie and a researcher at DERI on the Semantic Web, are you excited about Google Wave?

Technically, I think it’s an exciting development – commercially, it obviously provides potential for others (Google included) to set up a competing service to us (!), but I think what is good is the way it has been shown that Google Wave can integrate with existing platforms. For example, there’s a nice demo showing how Google Wave plus MediaWiki (the software that powers the Wikipedia) can be used to help editors who are simultaneously editing a wiki page. If it can be done for wikis, it could aid with lots of things relevant to online communities like boards.ie. For example, moderators could see what other moderators are online at the same time, communicate on issues such as troublesome users, posts with questionable content, and then avoid stepping on each other’s toes when dealing with issues.

Does it potential for collaborative research projects? Or is it heavyweight/serious enough?

I think it has some potential when combined with other tools that people are using already. There’s an example from SAP of Google Wave being integrated with a business process modelling application. People always seem to step back to e-mail for doing various research actions. While wikis and the like can be useful tools for quickly drafting research ideas, papers, projects, etc., there is that element of not knowing who is doing stuff at the same time as you. Just as people are using Gtalk to augment Gmail by being able to communicate in contacts in real-time when browsing e-mails, Google Wave could potentially be integrated with other platforms such as collaborative work environments, document sharing systems, etc. It may not be heavyweight enough on its own but at least it can augment what we already use.

Where does Google Wave sit in terms of the development of the Semantic Web?

I think it could be a huge source of data for the Semantic Web. What we find with various social and collaborative platforms is that people are voluntarily creating lots of useful related data about various objects (people, events, hobbies, organisations) and having a more real-time approach to creating content collaboratively will only make that source of data bigger and hopefully more interlinked. I’d hope that data from Google Wave can be made available using technologies such as SIOC from DERI, NUI Galway and the Online Presence Ontology (something we are also working on).

If we are to use Google Wave to pull in feeds from all over the Web will both RSS and widgets become sexy again?

I haven’t seen the example of Wave pulling in feeds, but in theory, what I could imagine is that real-time updating of information from various sources could allow that stream of current information to be updated, commented upon and forwarded to various other Waves in a very dynamic way. We’ve seen how Twitter has already provided some new life for RSS feeds in terms of services like Twitterfeed automatically pushing RSS updates to Twitter, and this results in some significant amounts of rebroadcasting of that content via retweets etc.

Certainly, one of the big things about Wave is its integration of various third-party widgets, and I think once it is fully launched we will see lots of cool applications building on the APIs that they provide. There’s been a few basic demonstrator gadgets shown already like polls, board games and event planning, but it’ll be the third-party ones that make good use of the real-time collaboration that will probably be the most interesting, as there’ll be many more people with ideas compared to some internal developers.

Is Wave the first serious example of a communications platform that will only be as good as the third-party developers that contribute to it?

Not really. I think that title applies to many of the communications platforms we use on the Web. Facebook was a busy service but really took off once the user-contributable applications layer was added. Drupal was obviously the work of a core group of people but again the third-party contributions outweigh those of the few that made it.

We already have e-mail and IM combined in Gmail and Google Docs covers the collaborative element so people might be thinking ‘what is so new, groundbreaking or beneficial about Wave?’ What’s your opinion on this?

Perhaps the real-time editing and updating process. Often times, it’s difficult to go back in a conversation and add to or fix something you’ve said earlier. But it’s not just a matter of rewriting the past – you can also go back and see what people said before they made an update (“rewind the Wave”).

Is Google heading towards unified communications with Wave, and is it possible that it will combine Gmail, Wave and Google Voice in the future?

I guess Wave could be one portion of a UC suite but I think the Wave idea doesn’t encompass all of the parts…

Do you think Google is looking to pull in conversations the way FriendFeed, Facebook and Twitter does? If so, will it succeed?

Yes, certainly Google have had interests in this area with their acquisition of Jaiku some time back (everyone assumed this would lead to a competitor to Twitter; most recently they made the Jaiku engine available as open source). I am not sure if Google intends to make available a single entry point to all public waves that would rival Twitter or Facebook status updates, but if so, it could be a very powerful competitor.

Is it possible that Wave will become as widely used and ubiquitous as Gmail?

It will take some critical mass to get it going, integrating it into Gmail could be a good first step.

And finally – is the game changing in your opinion?

Certainly, we’ve moved from frequently updated blogs (every few hours/days) to more frequently updated microblogs (every few minutes/seconds) to being able to not just update in real-time but go back and easily add to / update what’s been said any time in the past. People want the freshest content, and this is another step towards not just providing content that is fresh now but a way of freshening the content we’ve made in the past.

Reblog this post [with Zemanta]

Posted at 15:05

John Breslin: Open government and Linked Data; now it's time to draft…

For the past few months, there have been a variety of calls for feedback and suggestions on how the US Government can move towards becoming more open and transparent, especially in terms of their dealings with citizens and also for disseminating information about their recent financial stimulus package.

As part of this, the National Dialogue forum was set up to solicit solutions for ways of monitoring the “expenditure and use of recovery funds”. Tim Berners-Lee wrote a proposal on how linked open data could provide semantically-rich, linkable and reusable data from Recovery.gov. I also blogged about this recently, detailing some ideas for how discussions by citizens on the various uses of expenditure (represented using SIOC and FOAF) could be linked together with financial grant information (in custom vocabularies).

More recently, the Open Government Initiative solicited ideas for a government that is “more transparent, participatory, and collaborative”, and the brainstorming and discussion phases have just ended. This process is now in its third phase, where the ideas proposed to solve various challenges are to be more formally drafted in a collaborative manner.

What is surprising about this is how few submissions and contributions have been put into this third and final phase (see graph below), especially considering that there is only one week for this to be completed. Some topics have zero submissions, e.g. “Data Transparency via Data.gov: Putting More Data Online”.

20090624b

This doesn’t mean that people aren’t still thinking about this. On Monday, Tim Berners-Lee published a personal draft document entitled “Putting Government Data Online“. But we need more contributions from the Linked Data community to the drafts during phase three of the Open Government Directive if we truly believe that this solution can make a difference.

For those who want to learn more about Linked Data, click on the image below to go to Tim Berners-Lee’s TED talk on Linked Data.

(I watched it again today, and added a little speech bubble to the image below to express my delight at seeing SIOC profiles on the Linked Open Data cloud slide.)

We also have a recently-established Linked Data Research Centre at DERI in NUI Galway.

20090624a

Reblog this post [with Zemanta]

Posted at 15:05

John Breslin: Idea for Linked Open Data from the US recovery effort

Tim Berners-Lee recently posted an important request for the provision of Linked Open Data from the US recovery effort website Recovery.gov.

The National Dialogue website (set up to solicit ideas for data collection, storage, warehousing, analysis and visualisation; website design; waste, fraud and abuse detection; and other solutions for transparency and accountability) says that for Recovery.gov to be a useful portal for citizens, it “requires finding innovative ways to integrate, track, and display data from thousands of federal, state, and local entities”.

If you support the idea of Linked Open Data from Recovery.gov, you can have a read and provide some justifications on this thread.

(I’ve recently given some initial ideas about how grant feed data could be linked with user contributions in the form of associated threaded discussions on different topics, see picture below, all to be exposed as Linked Open Data using custom schemas plus SIOC and FOAF across a number of agencies / funding programs / websites.)

20090502a

Reblog this post [with Zemanta]

Posted at 15:05

John Breslin: "The Social Semantic Web": now available to pre-order from Springer and Amazon

Our forthcoming book entitled “The Social Semantic Web”, to be published by Springer in Autumn 2009, is now available to pre-order from both Springer and Amazon.

20090323a

An accompanying website for the book will be at socialsemanticweb.net.

Posted at 15:05

February 15

Sebastian Trueg: Conditional Sharing – Virtuoso ACL Groups Revisited

Previously we saw how ACLs can be used in Virtuoso to protect different types of resources. Today we will look into conditional groups which allow to share resources or grant permissions to a dynamic group of individuals. This means that we do not maintain a list of group members but instead define a set of conditions which an individual needs to fulfill in order to be part of the group in question.

That does sound very dry. Let’s just jump to an example:

@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
[] a oplacl:ConditionalGroup ;
  foaf:name "People I know" ;
  oplacl:hasCondition [
    a oplacl:QueryCondition ;
    oplacl:hasQuery """ask where { graph <urn:my> { <urn:me> foaf:knows ^{uri}^ } }"""
  ] .

This group is based on a single condition which uses a simple SPARQL ASK query. The ask query contains a variable ^{uri}^ which the ACL engine will replace with the URI of the authenticated user. The group contains anyone who is in a foaf:knows relationship to urn:me in named graph urn:my. (Ideally the latter graph should be write-protected using ACLs as described before.)

Now we use this group in ACL rules. That means we first create it:

$ curl -X POST \
    --data-binary @group.ttl \
    -H"Content-Type: text/turtle" \
    -u dba:dba \
    http://localhost:8890/acl/groups

As a result we get a description of the newly created group which also contains its URI. Let’s imagine this URI is http://localhost:8890/acl/groups/1.

To mix things up we will use the group for sharing permission to access a service instead of files or named graphs. Like many of the Virtuoso-hosted services the URI Shortener is ACL controlled. We can restrict access to it using ACLs.

As always the URI Shortener has its own ACL scope which we need to enable for the ACL system to kick in:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope <urn:virtuoso:val:scopes:curi> .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope <urn:virtuoso:val:scopes:curi> .
};

Now we can go ahead and create our new ACL rule which allows anyone in our conditional group to shorten URLs:

[] a acl:Authorization ;
  oplacl:hasAccessMode oplacl:Write ;
  acl:accessTo <http://localhost:8890/c> ;
  acl:agent <http://localhost:8890/acl/groups/1> ;
  oplacl:hasScope <urn:virtuoso:val:scopes:curi> ;
  oplacl:hasRealm oplacl:DefaultRealm .

Finally we add one URI to the conditional group as follows:

sparql
insert into <urn:my> {
  <urn:me> foaf:knows <http://www.facebook.com/sebastian.trug> .
};

As a result my facebook account has access to the URL Shortener:
Virtuoso URI Shortener

The example we saw here uses a simple query to determine the members of the conditional group. These queries could get much more complex and multiple query conditions could be combined. In addition Virtuoso handles a set of non-query conditions (see also oplacl:GenericCondition). The most basic one being the following which matches any authenticated person:

[] a oplacl:ConditionalGroup ;
  foaf:name "Valid Identifiers" ;
  oplacl:hasCondition [
    a oplacl:GroupCondition, oplacl:GenericCondition ;
    oplacl:hasCriteria oplacl:NetID ;
    oplacl:hasComparator oplacl:IsNotNull ;
    oplacl:hasValue 1
  ] .

This shall be enough on conditional groups for today. There will be more playing around with ACLs in the future…

Posted at 21:10

Sebastian Trueg: Protecting And Sharing Linked Data With Virtuoso

Disclaimer: Many of the features presented here are rather new and can not be found in  the open-source version of Virtuoso.

Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.

Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.

We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:

Virtuoso Sparql Endpoint

Sparql Result

This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:

$ isql-v localhost:1112 dba dba
Connected to OpenLink Virtuoso
Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('http://www.openlinksw.com/schemas/virtrdf#PrivateGraphs', 'urn:trueg:demo');

Done. -- 2 msec.

Now our new named graph urn:trueg:demo is private and its contents cannot be seen by anyone. We can easily test this by logging out and trying to query the graph:

Sparql Query
Sparql Query Result

But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Share Demo Graph with trueg's LinkedIn account" ;
  acl:agent <http://www.linkedin.com/in/trueg> ;
  acl:accessTo <urn:trueg:demo> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:PrivateGraphs .

Virtuoso makes use of the ACL ontology proposed by the W3C and extends on it with several custom classes and properties in the OpenLink ACL Ontology. Most of this little Turtle snippet should be obvious: we create an Authorization resource which grants Read access to urn:trueg:demo for agent http://www.linkedin.com/in/trueg. The only tricky part is the scope. Virtuoso has the concept of ACL scopes which group rules by their resource type. In this case the scope is private graphs, another typical scope would be DAV resources.

Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:

$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules

As a result we get the full rule resource including additional properties added by the API.

Finally we will login using my LinkedIn identity and are granted read access to the graph:

SPARQL Endpoint Login
sparql6
sparql7
sparql8

We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.

Technical Footnote

Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs .
};

This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Allow Anyone to SPARQL Read" ;
  acl:agentClass foaf:Agent ;
  acl:accessTo <urn:virtuoso:access:sparql> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:Query .

I will explain these technical concepts in more detail in another article.

Posted at 21:10

Sebastian Trueg: Sharing Files With Whomever Is Simple

Dropbox, Google Drive, OneDrive, Box.com – they all allow you to share files with others. But they all do it via the strange concept of public links. Anyone who has this link has access to the file. On first glance this might be easy enough but what if you want to revoke read access for just one of those people? What if you want to share a set of files with a whole group?

I will not answer these questions per se. I will show an alternative based on OpenLink Virtuoso.

Virtuoso has its own WebDAV file storage system built in. Thus, any instance of Virtuoso can store files and serve these files via the WebDAV API (and an LDP API for those interested) and an HTML UI. See below for a basic example:

Virtuoso DAV Browser

This is just your typical file browser listing – nothing fancy. The fancy part lives under the hood in what we call VAL – the Virtuoso Authentication and Authorization Layer.

We can edit the permissions of one file or folder and share it with anyone we like. And this is where it gets interesting: instead of sharing with an email address or a user account on the Virtuoso instance we can share with people using their identifiers from any of the supported services. This includes Facebook, Twitter, LinkedIn, WordPress, Yahoo, Mozilla Persona, and the list goes on.

For this small demo I will share a file with my LinkedIn identity http://www.linkedin.com/in/trueg. (Virtuoso/VAL identifier people via URIs, thus, it has schemes for all supported services. For a complete list see the Service ID Examples in the ODS API documentation.)

Virtuoso Share File

Now when I logout and try to access the file in question I am presented with the authentication dialog from VAL:

VAL Authentication Dialog

This dialog allows me to authenticate using any of the supported authentication methods. In this case I will choose to authenticate via LinkedIn which will result in an OAuth handshake followed by the granted read access to the file:

LinkedIn OAuth Handshake

 

Access to file granted

It is that simple. Of course these identifiers can also be used in groups, allowing to share files and folders with a set of people instead of just one individual.

Next up: Sharing Named Graphs via VAL.

Posted at 21:10

Sebastian Trueg: Digitally Sign Emails With Your X.509 Certificate in Evolution

Digitally signing Emails is always a good idea. People can verify that you actually sent the mail and they can encrypt emails in return. A while ago Kingsley showed how to sign emails in Thunderbird.I will now follow up with a short post on how to do the same in Evolution.

The process begins with actually getting an X.509 certificate including an embedded WebID. There are a few services out there that can help with this, most notably OpenLink’s own YouID and ODS. The former allows you to create a new certificate based on existing social service accounts. The latter requires you to create an ODS account and then create a new certificate via Profile edit -> Security -> Certificate Generator. In any case make sure to use the same email address for the certificate that you will be using for email sending.

The certificate will actually be created by the web browser, making sure that the private key is safe.

If you are a Google Chrome user you can skip the next step since Evolution shares its key storage with Chrome (and several other applications). If you are a user of Firefox you need to perform one extra step: go to the Firefox preferences, into the advanced section, click the “Certificates” button, choose the previously created certificate, and export it to a .p12 file.

Back in Evolution’s settings you can now import this file:

To actually sign emails with your shiny new certificate stay in the Evolution settings, choose to edit the Mail Account in question, select the certificate in the Secure MIME (S/MIME) section and check “Digitally sign outgoing messages (by default)“:

The nice thing about Evolution here is that in contrast to Thunderbird there is no need to manually import the root certificate which was used to sign your certificate (in our case the one from OpenLink). Evolution will simply ask you to trust that certificate the first time you try to send a signed email:

That’s it. Email signing in Evolution is easy.

Posted at 21:10

Davide Palmisano: SameAs4J: little drops of water make the mighty ocean

Few days ago Milan Stankovich contacted the Sindice crew informing us that he wrote a simply Java library to interact with the public Sindice HTTP APIs. We always appreciate such kind of community efforts lead to collaboratively make Sindice a better place on the Web. Agreeing with Milan, we decided to put some efforts on his initial work to make such library the official open source tool for Java programmers.
That reminded me that, few months ago, I did for sameas.org the same thing Milan did for us. But (ashamed) I never informed those guys about what I did.
Sameas.org is a great and extremely useful tool on the Web that makes concretely possible to interlink different Linked data clouds. Simple to use (both for humans via HTML and for machines with a simple HTTP/JSON API) and extremely reactive, it allows to get all the owl:sameAs object for a given URI. And, moreover, it’s based on Sindice.com.
Do you want to know the identifier of http://dbpedia.org/resource/Rome in Freebase or Yago? Just ask it to Sameas.org.

So, after some months I just refined a couple of things, added some javadocs, set up a Maven repository and made SameAs4j publicly available (MIT licensed) to everyone on Google Code.
It’s a simple but reliable tiny set of Java classes that allows you to interact with sameas.org programatically in your Java Semantic Web applications.

Back to the beginning: every pieces of open source software is like a little drop of water which makes the mighty ocean, so please submit any issue or patch if interested.

Posted at 21:10

Davide Palmisano: FBK, Any23 and my involvement in Sindice.com

After almost two years spent working at Asemantics, I left it to join the Fondazione Bruno Kessler (FBK), a quite large research institute based in Trento.

These last two years have been amazing: I met very skilled and enthusiastic people working with them on a broad set of different technologies. Every day spent there has been an opportunity for me to learn something new from them, and at the very end they are now very good friends more than colleagues. Now Asemantics is part of the bigger Pro-netics Group.

Moved from Rome, I decided to follow Giovanni Tummarello and Michele Mostarda to launch from scratch a new research unit at FBK called “Web of Data”. FBK is a well-established organization with several units acting on a plethora of different research fields. Every day there is the opportunity to join workshops and other kind of events.

Just to give you an idea of how the things work here, in the April 2009 David Orban gave a talk here on “The Open Internet of Things” attended by a large number of researchers and students. Aside FBK, in Trento there is a quite active community hanging out around the Semantic Web.

The Semantic Valley”, that’s how they call this euphoric movement around these technologies.

Back to me, the new “Web of Data” unit has joined the Sindice.com army and the last minute release of Any23 0.2 is only the first outcome of this joint effort on the Semantic Web Index between DERI and FBK.

In particularly, the Any23 0.2 release has been my first task here. It’s library, a service, an RDF distiller. It’s used on board the Sindice ingestion pipeline, it’s publicly available here and yesterday I spent a couple of minutes to write this simple bookmarklet:

javascript:window.open(‘http://any23.org/best/’%20+%20window.location);

Once on your browser, it returns a bunch of distilled RDF triples using the Any23 servlet if pressed on a Web page.

So, what’s next?

The Web of Data unit has just started. More things, from the next release of Sindice.com to other projects currently in inception, will see the light. I really hope to keep on contributing on the concrete consolidation of the Semantic Web, the Web of Data or Web3.0 or whatever we’d like to call it.

Posted at 21:10

Davide Palmisano: Cheap Linked Data identifiers

This is a (short) technical post.

Everyday, I face the problem of getting some Linked Data URIs that uniquely identify a “thing” starting from an ambiguous, poor and flat keyword or description. One of the first step dealing with the development of application that consumes Linked Data is to provide a mechanism that allows to link our own data sets to one (or more) LoD bubble. To gain a clear idea on why identifiers matters I suggest you to read this note from Dan Brickley: starting from some needs we encountered within the NoTube project he clearly underlined the importance of LoD identifiers. Even if the problem of uniquely identifying words and terms falls in the biggest category usually known as term disambiguation, I’d like to clarify here, that what I’m going to explain is a narrow restriction of the whole problem.

What I really need is a simple mechanism that allows me to convert one specific type of identifiers to a set of Linked Data URIs.

For example, I need something that given a book ISBN number it returns me a set of URIs that are referring to that book. Or, given the title of a movie I expect back some URIs (from DBpedia or LinkedMDB or whatever) identifying and describing it in a unique way.

Isn’t SPARQL enough for you to do that?

Yes, obviously the following SPARQL query may be sufficient:

but what I need is something quicker that I may invoke as an HTTP GET like:

http://localhost:8080/resolver?value=978-0-374-16527-7&category=isbn

returning back to me a simple JSON:

{ "mappings": [
"http://dbpedia.org/resource/Gomorrah_%28book%29"],
"status": "ok"
}

But the real issue here is the code overhead necessary if you want to add other kind of identifiers resolution. Let’s imagine, for instance, that I already implemented this kind of service and I want to add another resolution category. What I should do is to hard code another SPARQL query, modify the code allowing to invoke it as a service and redeploy it.

I’m sure we could do better.

If we give a closer look at the above SPARQL query, we easily figure out that the problem could be highly generalized. In fact, often resolving such kind of resolution means perform a SPARQL query asking for URIs that have a certain value for a certain property. As dbprop:isbn for the ISBN case.

And this is what I did the last two days: The NoTube Identity Resolver.

A simple Web service (described in the figure below) fully customizable by simply editing an XML configuration file.

NoTube Identity Resolver architecture

The resolvers.xml file allows you to provide a simple description of the resolution policy that will be accessible with a simple HTTP GET call.

Back to the ISBN example, the following piece of XML is enough to describe the resolver:

<resolver id=”2″ type=”normal”>
<category>isbn</category>
<endpoint>http://dbpedia.org/sparql</endpoint>
<lookup>dbpedia-owl:isbn</lookup>
<sameas>true</sameas>
<matching>LITERAL</matching>
</resolver>

Where:

  • category is the value that have to be passed as parameter in the HTTP GET call to invoke this resolver
  • endpoint is the address of a SPARQL Endpoint where make the resolution
  • lookup is the name of the property intended to be
  • type (optional) the rdf:type of the resources to be resolved
  • sameas boolean value enabling or not the calling of the SameAs.org service to gain equivalent URIs
  • matching (allowing only URI and LITERAL as value) this element describes the type of the value to be resolved.

Moreover, the NoTube Identity Resolver gives you also the possibility to specify more complex resolution policies through a SPARQL query as shown below:

<resolver id="3" type="custom">
<category>movie</category>
<endpoint>http://dbpedia.org/sparql</endpoint>
<sparql><![CDATA[SELECT DISTINCT ?subject
WHERE { ?subject a <http://dbpedia.org/ontology/Film>.
?subject <http://dbpedia.org/property/title> ?title.
FILTER (regex(?title, "#VALUE#")) }]]>
</sparql>
<sameas>true</sameas>
</resolver>

In other words, every resolver described in the resolvers.xml file allows you to enable one kind of resolution mechanism without writing a line af Java code.

Do you want to try?

Just download the war package, get this resolvers.xml (or write your own), export the RESOLVERS_XML_LOCATION environment variable pointing to the folder where the resolvers.xml is located, deploy the war on your Apache Tomcat application server, start the application and try it out heading your browser to:

http://localhost:8080/notube-identity-resolver/resolver?value=978-0-374-16527-7&category=isbn

That’s all folks

Posted at 21:10

Copyright of the postings is owned by the original blog authors. Contact us.