Planet RDF

It's triples all the way down

January 29

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Posted at 18:06

January 25

Egon Willighagen: MetaboEU2020 in Toulouse and the ELIXIR Metabolomics Community assemblies

This week I attended the European RFMF Metabomeeting 2020, aka #MetaboEU2020, held in Toulouse. Originally, I had hoped to do this by train, but that turned out unfeasible. Co-located with this meeting where ELIXIR Metabolomics Community meetings. We're involved in two implementation studies for together less than a month of work. But both this community and the conference are great places to talk about WikiPathways, BridgeDb (our website is still disconnected from the internet), and cheminformatics.

Toulouse was generally great. It comes with its big city issues, like fairly expensive hotels, and a very frequent public transport system. It also had a great food market where we had our "gala dinner". Toulouse is also home to Airbus, so it was hard to miss the Beluga:


The MetaboEU2020 conference itself had some 400 participants, of course, with a lot of wet lab metabolomics. As a chemist, with a good pile of training in analytical chemistry, it's great to see the progress. From a data analysis perspective, the community has a long way to come. We're still talking about known known, unknown knowns, and unknown unknowns. The posters were often cryptic, e.g. stating they found 35 interesting metabolites, without actually listing them. The talks were also really interesting.

Now, if you read this, there is a good chance you were not at the meeting. You can check the above linked hashtag for coverage on Twitter, but we can do better. I loved Lanyrd, but their business model was not scalable and the service no longer exists. But Scholia (see doi:10.3897/rio.5.e35820) could fill the gap (it uses the Wikidata RDF and SPARQL queries). I followed Finn's steps and created a page for the meeting and started associated speakers (I've done this in the past for other meetings too):


Finn also created proceedings pages in the past, which I also followed. So, I asked people on Twitter to post their slidedeck and posters on Figshare or Zenodo, and so far we ended up with 10 "proceedings" (thanks to everyone who did!!!):



As you can see, there is an RSS feed which you can follow (e.g. with Feedly) to get updates if more materials appears online! I wish all conferences did this!

Posted at 10:05

January 24

Leigh Dodds: Licence Friction: A Tale of Two Datasets

For years now at the Open Data Institute we’ve been working to increase access to data, to create a range of social and economic benefits across a range of sectors. While the details change across projects one of the more consistent aspects of our work and guidance has been to support data stewards in making data as open as possible, whilst ensuring that is clearly licensed.

Reference data, like addresses and other geospatial data, that underpins our national and global data infrastructure needs to be available under an open licence. If it’s not, which is the ongoing situation in the UK, then other data cannot be made as open as possible. 

Other considerations aside, data can only be as open as the reference data it relies upon. Ideally, reference data would be in the public domain, e.g. using a CC0 waiver. Attribution should be a consistent norm regardless of what licence is used

Data becomes more useful when it is linked with other data. When it comes to data, adding context adds value. It can also add risks, but more value can be created from linking data. 

When data is published using bespoke or restrictive licences then it is harder to combine different datasets together, because there are often limitations in the licensing terms that restrict how data can be used and redistributed.

This means data needs to be licensed using common, consistent licences. Licences that work with a range of different types of data, collected and used by different communities across jurisdictions. 

Incompatible licences create friction that can make it impossible to create useful products and services. 

It’s well-reported that data scientists and other users spend huge amounts of time cleaning and tidying data because it’s messy and non-standardised. It’s probably less well-reported how many great ideas are simply shelved because of lack of access to data. Or are impossible because of issues with restrictive or incompatible data licences. Or are cancelled or simply needlessly expensive due to the need for legal consultations and drafting of data sharing agreements.

These are the hurdles you often need to overcome before you even get started with that messy data.

Here’s a real-world example of where the lack of open geospatial data in the UK, and ongoing incompatibilities between data licensing is getting in the way of useful work. 

Introducing Active Places

Active Places is a dataset stewarded by Sport England. It provides a curated database of sporting facilities across England. It includes facilities provided by a range of organisations across the public, private and third-sectors. It’s designed to help support decision making about the provision of tens of thousands of sporting sites and facilities around the UK to drive investment and policy making. 

The dataset is rich and includes a wide range of information from disabled access through to the length of ski slopes or the number of turns on a cycling track.

While Sport England are the data steward, the curation of the dataset is partly subcontracted to a data management firm and partly carried out collaboratively with the owners of those sites and facilities.

The dataset is published under a standard open licence, the Creative Commons Attribution 4.0 licence. So anyone can access, use and share the data so long as they acknowledge its source. Contributors to the dataset agree to this licence as part of registering to contribute to the site.

The dataset includes geospatial data, including the addresses and locations of individual sites. This data includes IP from Ordnance Survey and Royal Mail, which means they have a say over what happens to it. In order to release the data under an open licence, Sport England had to request an exemption from the Ordnance Survey to their default position, which is that data containing OS IP cannot be sublicensed. When granted an exemption, an organisation may publish their data under an open licence. In short, OS waive their rights over the geographic locations in the data. 

The OS can’t, however waive any rights that Royal Mail has over the address data. In order to grant Sport England an exemption, the OS also had to seek permission from Royal Mail.  The Sport England team were able to confirm this for me. 

Unfortunately it’s not clear, without having checked, that this is actually the case. It’s not evident in the documentation of either Active Places or the OS exemption process. Is it clarifying all third-party rights a routine part of the exemption process or not?

It would be helpful to know. As the ODI has highlighted, lack of transparency around third-party rights in open data is a problem. For many datasets the situation remains unclear. And Unclear positions are fantastic generators of legal and insurance fees.

So, to recap: Sport England has invested time in convincing Ordnance Survey to allow it to openly publish a rich dataset for the public good. A dataset in which geospatial data is clearly important, but is not the main feature of the dataset. The reference data is dictating how open the dataset can be and, as a result how much value can be created from it.

In case you’re wondering, lots of other organisations have had to do the same thing. The process is standardised to try and streamline it for everyone. A 2016 FOI request shows that between 2011 and 2015 the Ordnance Survey handled more than a 1000 of these requests

Enter OpenStreetMap

At the end of 2019, members of the OpenStreetmap community contacted Sport England to request permission to use the Active Places dataset. 

If you’re not familiar with OpenStreetmap, then you should be. It’s an openly licensed map of the world maintained by a huge community of volunteers, humanitarian organisations, public and private sector businesses around the world.

The OpenStreetmap Foundation is the official steward of the dataset with the day to data curation and operations happening through its volunteer network. As a small not-for-profit, it has to be very cautious about legal issues relating to the data. It can’t afford to be sued. The community is careful to ensure that data that is imported or added into the database comes from openly licensed sources.

In March 2017, after a consultation with the Creative Commons, the OpenStreetmap Licence/Legal Working Group concluded that data published under the Creative Commons Attribution licence is not compatible with the licence used by OpenStreetmap which is called the Open Database Licence. They felt that some specific terms in the licence (and particularly in its 4.0 version) meant that they needed additional permission in order to include that data in OpenStreetmap.

Since then the OpenStreetmap community, has been contacting data stewards to ask them to sign an additional waiver that grants the OSM community explicit permission to use the data. This is exactly what open licensing of data is intended to avoid.

CC-BY is one of the most frequently used open data licences, so this isn’t a rare occurrence. 

As an indicator of the extra effort required, in a 2018 talk from the Bing Maps team in which they discuss how they have been supporting the OpenStreetmap community in Australia, they called out their legal team as one of the most important assets they had to provide to the local mapping community, helping them to get waivers signed. At the time of writing nearly 90 waivers have been circulated in Australia alone, not all of which have been signed.

So, to recap, due to a perceived incompatibility between two of the most frequently used open data licences, the OpenStreetmap community and its supporters are spending time negotiating access to data that is already published under an open licence.

I am not a lawyer. So these are like, just my opinions. But while I understand why the OSM Licence Working Group needs to be cautious, it feels like they are being overly cautious. Then again, I’m not the one responsible for stewarding an increasingly important part of a global data infrastructure. 

Another opinion is that perhaps the Microsoft legal team might be better deployed to solve the licence incompatibility issues. Instead they are now drafting their own new open data licences, which are compatible with CC-BY.

Active Places and OpenStreetmap

Still with me?

At the end of last year, members of the OpenStreetMap community contacted Sport England to ask them to sign a waiver so that they could use the Active Places data. Presumably to incorporate some of the data into the OSM database.

The Sport England data and legal teams then had to understand what they were being asked to do and why. And they asked for some independent advice, which is where I provided some support through our work with Sport England on the OpenActive programme. 

The discussion included:

  • questions about why an additional waiver was actually necessary
  • the differences in how CC-BY and ODbL are designed to require data to remain open and accessible – CC-BY includes limitation on use of technical restrictions, which is allowed by the open definition, whilst ODbL adopts a principle of encouraging “parallel distribution”. 
  • acceptable forms and methods of attribution
  • who, within an organisation like Sport England, might have responsibility to decide what acceptable attribution looked like
  • why the OSM community had come to its decisions
  • who actually had authority to sign-off on the proposed waiver
  • whether signing a waiver and granting a specific permission undermined Sport England’s goal to adopt standard open data practices and licences, and a consistent approach for every user
  • whether the OS exemption, which granted permission to SE to publish the dataset under an open licence, impacted any of the above

All reasonable questions from a team being asked to do something new. 

Like a number of organisations asked to sign waiver in Australia, SE have not yet signed a waiver and may choose not to do so. Like all public sector organisations, SE are being cautious about taking risks. 

The discussion has spilled out onto twitter. I’m writing this to provide some context and background to the discussion in that thread. I’m not criticising anyone as I think everyone is trying to come to a reasonable outcome. 

As the twitter thread highlights, the OSM community are not just concerned about the CC-BY licence but also about the potential that additional third-party rights are lurking in the data. Clarifying that may require SE to share more details about how the address and location data in the dataset is collected, validated and normalised for the OSM community to be happy. But, as noted earlier in the blog, I’ve at least been able to determine the status of any third-party rights in the data. So perhaps this will help to move things further.

The End

So, as a final recap, we have two organisations both aiming to publish and use data for the public good. But, because of complexities around derived data and licence compatibilities, data that might otherwise be used in new, innovative ways is instead going unused.

This is a situation that needs solving. It needs the UK government and Geospatial Commission to open up more geospatial data.

It needs the open data community to invest in resolving licence incompatibilities (and less in creating new licences) so that everyone benefits. 

We also need to understand when licences are the appropriate means of governing how data is used and when norms, e.g. around attribution, can usefully shape how data is accessed, used and shared.

Until then these issues are going to continue to undermine the creation of value from open (geospatial) data.

Posted at 20:05

January 22

schema.org: Schema.org 6.0

Schema.org version 6.0 has been released. See the release notes for full details. Versions 5.0, and 4.0 also added many improvements and additions, even if we forgot to blog about them. As always, the release notes have details (and links to even more details).

We are now aiming to release updated schemas on an approximately monthly basis (with longer gaps around vacation periods). Typically, new terms are first added to our "Pending" area to give time for the definitions to benefit from implementation experience before they are added to the "core" of Schema.org. As always, many thanks to everyone who has contributed to this release of Schema.org.

--
Dan Brickley, for Schema.org.

Posted at 16:13

January 18

Leigh Dodds: [Paper Review] The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work

This blog post is a quick review and notes relating to a research paper called: The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work (PDF available here)

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

This paper explores the impact of data infrastructure, and in particular the use of identifiers and the design of databases, on the delivery of human (public) services. By reviewing the use of identifiers and data in service delivery to support homelessness and those affected by AIDS, the authors highlight a number of tensions between how the design of data infrastructure and the need to share data with funders and other agencies has an inevitable impact on frontline services.

For example, the need to evidence impact to funders requires the collection of additional personal, legal identifiers. Even when that information is not critical to the delivery of support.

The paper also explores the interplay between the well defined, unforgiving world of database design, and the messy nature of delivering services to individuals. Along the way the authors touch on aspects of identity, identification, and explore different types of identifiers and data collection practices.

The authors draw out a number of infrastructure problems and provide some design provocations for alternate approaches. The three main problems are the immutability of identifiers in database schema, the “hegemony of NOT NULL” (or the need for identification), and the demand for uniqueness across contexts.

Three reasons to read

Here’s three reasons why you might want to read this paper:

  1. If, like me, you’re often advocating for use of consistent, open identifiers, then this paper provides a useful perspective of how this approach might create issues or unwanted side effects outside of the simpler world of reference data
  2. If you’re designing digital public services then the design provocations around identifiers and approaches to identification are definitely worth reading. I think there’s some useful reflections about how we capture and manage personal information
  3. If you’re a public policy person and advocating for consistent use of identifiers across agencies, then there’s some important considerations around the the policy, privacy and personal impacts of data collection in this paper

Three things I learned

Here’s three things that I learned from reading the paper.

  1. In a section on “The Data Work of Human Services Provision“, the authors highlighted three aspects of frontline data collection which I found it useful to think about:
    • data compliance work – collecting data purely to support the needs of funders, which might be at odds with the needs of both the people being supported and the service delivery staff
    • data coordination work – which stems from the need to link and aggregate data across agencies and funders to provide coordinated support
    • data confidence work – the need to build a trusted relationship with people, at the front-line, in order to capture valid, useful data
  2. Similarly, the authors tease out four reasons for capturing identifiers, each of which have different motivations, outcomes and approaches to identification:
    • counting clients – a basic need to monitor and evaluate service provision, identification here is only necessary to avoid duplicates when counting
    • developing longitudinal histories – e.g. identifying and tracking support given to a person over time can help service workers to develop understanding and improve support for individuals
    • as a means of accessing services – e.g. helping to identify eligibility for support
    • to coordinate service provision – e.g. sharing information about individuals with other agencies and services, which may also have different approaches to identification and use of identifiers
  3. The design provocations around database design were helpful to highlight some alternate approaches to capturing personal information and the needs of the service vs that of the individual

Thoughts and impressions

As someone who has not been directly involved in the design of digital systems to support human services, I found the perspectives and insight shared in this paper really useful. If you’ve been working in this space for some time, then it may be less insightful.

However I haven’t seen much discussion about good ways to design more humane digital services and, in particular, the databases behind them, so I suspect the paper could do with a wider airing. Its useful reading alongside things like Falsehoods Programmers Believe About Names and Falsehoods Programmers Believe About Gender.

Why don’t we have a better approach to managing personal information in databases? Are there solutions our there already?

Finally, the paper makes some pointed comments about the role of funders in data ecosystems. Funders are routinely collecting and aggregating data as part of evaluation studies, but this data might also help support service delivery if it were more accessible. It’s interesting to consider the balance between minimising unnecessary collection of data simply to support evaluation versus the potential role of funders as intermediaries that can provide additional support to charities, agencies or other service delivery organisations that may lack the time, funding and capability to do more with that data.

 

 

Posted at 13:05

January 17

AKSW Group - University of Leipzig: SANSA 0.7.1 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find usage guidelines and examples at http://sansa-stack.net/user-guide.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop and Tensors
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • TRIX support
  • A new query engine over compressed RDF data
  • OWL/XML Support

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based querying, scalable RDB2RDF query execution, quality assessment and semantic partitioning.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

 

Posted at 09:05

January 16

Egon Willighagen: Help! Digital Object Identifiers: Usability reduced if given at the bottom of the page

The (for J. Cheminform.) new SpringerNature article template has the Digital Object Identifier (DOI) at the bottom of the article page. So, every time I want to use the DOI I have to scroll all the way down to the page. That could be find for abstracts, but totally unusable for Open Access articles.

So, after our J. Cheminform. editors telcon this Monday, I started a Twitter poll:


Where I want the DOI? At the top, with the other metadata:
Recent article in the Journal of Cheminformatics.
If you agree, please vote. With enough votes, we can engage with upper SpringerNature manager to have journals choose where they want the DOI to be shown.

(Of course, the DOI as semantic data in the HTML is also important, but there is quite good annotation of that in the HTML <head>. Link out to RDF about the article, is still missing, I think.)

Posted at 06:59

January 14

Leigh Dodds: [Paper review] Open data for electricity modeling: Legal aspects

This blog post is a quick review and notes relating to a research paper called: Open data for electronic modeling: Legal aspects.

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

The paper reviews the legal status of publicly available energy data (and some related datasets) in Europe, with a focus on German law. The paper is intended to help identify some of the legal issues relevant to creation of analytical models to support use of energy data, e.g. for capacity planning.

As background, the paper describes the types of data relevant to building these types of model, the relevant aspects of database and copyright law in the EU and the properties of open licences. This background is used to assess some of the key data assets published in the EU and how they are licensed (or not) for reuse.

The paper concludes that the majority of uses of this data to support energy modelling in the EU, whether for research or other purposes, is likely to be infringing on the rights of the database holders, meaning that users are currently carrying legal risks. The paper notes that in many cases this is likely not the intended outcome.

The paper provides a range of recommendations to address this issue, including the adoption of open licences.

Three reasons to read

Here’s three reasons why you might want to read this paper

  1. It provides a helpful primer on the range of datasets and data types that are used to develop applications in the energy sector in the EU. Useful if you want to know more about the domain
  2. The background information on database rights and related IP law is clearly written and a good introduction to the topic
  3. The paper provides a great case study of how licensing and legal protections applies to data use in a sector. The approach taken could be reused and extended to other areas

Three things I learned

Here’s three things that I learned from reading the paper.

  1. That a database might be covered by copyright (an “original” database) in addition to database rights. But the authors note this doesn’t apply in the case of a typical energy dataset
  2. That individual member states might have their own statutory exemptions to the the Database Directive. E.g. in Germany it doesn’t apply to use of data in non-commercial teaching. So there is variation in how it applies.
  3. The discussion on how the Database Directive relates to statutory obligations to publish data was interesting, but highlights that the situation is unclear.

Thoughts and impressions

Great paper that clearly articulates the legal issues relating to publication and use of data in the energy sector in the EU. It’s easy to extrapolate from this work to other use cases in energy and by extension to other sectors.

The paper concludes with a good set of recommendations: the adoption of open licences, the need to clarify rights around data reuse and the role of data institutions in doing that, and how policy makers can push towards a more open ecosystem.

However there’s a suggestion that funders should just mandate open licences when funding academic research. While this is the general trend I see across research funding, in the context of this article it lacks a bit of nuance. The paper clearly indicates that the current status quo is that data users do not have the rights to apply open licences to the data they are publishing and generating. I think funders also need to engage with other policy makers to ensure that upstream provision of data is aligned with an open research agenda. Otherwise we risk perpetuating an unclear landscape of rights and permissions. The authors do note the need to address wider issues, but I think there’s a potential role of research funders in helping to drive change.

Finally, in their review of open licences, the authors recommend a move towards adoption of CC0 (public domain waivers and marks) and CC-BY 4.0. But they don’t address the fact that upstream licensing might limit the choice of how researchers can licence downstream data.

Specifically, the authors note the use of OpenStreetmap data to provide infrastructure data. However depending on your use, you may need to adopt this licence when republishing data. This can be at odds with a mandate to use other licences or restrictive licences used by other data stewards.

 

Posted at 19:06

December 10

Peter Mika: Common Tag semantic tagging format released today

The Common Tag format for semantic tagging has been finally released today after almost a year of intense work on it by a group of Web companies active in the semantic technologies area, among them Yahoo. It’s been great fun working on this and I’m proud to have been involved: while there have been vocabularies before for representing tags in RDF, this effort is different in at least two respects.

First, a significant effort of time has been spent on making sure the specification meets the needs of all partners involved. The support of these companies for the specification will ensure that developers in the future can rely on a single format for annotation with semantic tags and interchanging tag data. The website already lists a number of applications but I’m pretty sure that a common tagging format will open entirely new possibilities in searching, navigating and aggregating web content.

Second, the format has been developed with publishers in mind, in particular in making it as easy as possible to embed semantic tags in HTML using RDFa, a syntax universally embraced by all those involved. The choice for RDF also means that unlike in the case of the rel-tag microformat, Common Tags can be applied to any object, not just documents.

So, it’s time for a new era in tagging!

Reblog this post [with Zemanta]

Posted at 17:10

Sandro Hawke: Simplified RDF

I propose that we designate a certain subset of the RDF model as “Simplified RDF” and standardize a method of encoding full RDF in Simplified RDF. The subset I have in mind is exactly the subset used by Facebook’s Open Graph Protocol (OGP), and my proposed encoding technique is relatively straightforward.

I’ve been mulling over this approach for a few months, and I’m fairly confident it will work, but I don’t claim to have all the details perfect yet. Comments and discussion are quite welcome, on this posting or on the semantic-web@w3.org mailing list. This discussion, I’m afraid, is going to be heavily steeped in RDF tech; simplified RDF will be useful for people who don’t know all the details of RDF, but this discussion probably wont be.

My motivation comes from several directions, including OGP. With OGP, Facebook has motivated a huge number of Web sites to add RDFa markup to their pages. But the RDF they’ve added is quite constrained, and is not practically interoperable with the rest of the Semantic Web, because it uses simplified RDF. One could argue that Facebook made a mistake here, that they should be requiring full “normal” RDF, but my feeling is their engineering decisions were correct, that this extreme degree of simplification is necessary to get any reasonable uptake.

I also think simplified RDF will play well with JSON developers. JRON is pretty simple, but simplified RDF would allow it to be simpler still. Or, rather, it would mean folks using JRON could limit themselves to an even smaller number of “easy steps” (about three, depending on how open design issues are resolved).

Cutting Out All The Confusing Stuff

Simplified RDF makes the following radical restrictions to the RDF model and to deployment practice:

  1. The subject URIs are always web page addresses. The content-negotiation hack for “hash” URIs and the 303-see-other hack for “slash” URIs are both avoided.

    (Open issue: are html fragment URIs okay? Not in OGP, but I think it will be okay and useful.)

  2. The values of the properties (the “object” components of the RDF triples) are always strings. No datatype information is provided in the data, and object references are done by just putting the object URI into the string, instead of making it a normal URI-label node.

    (Open issue: what about language tags? I think RDFa will provide this for free in OGP, if the html has a language tag.)

    (Open issue: what about multi-valued (repeated) properties? Are they just repeated, or are the multiple values packing into the string, perhaps? OGP has multiple administrators listed as “USER_ID1,USER_ID2”. JSON lists are another factor here.)

At first inspection this reduction appears to remove so much from RDF as to make it essentally useless. Our beloved RDF has been blown into a hundred pieces and scattered to the wind. It turns out, however, it still has enough enough magic to reassemble itself (with a little help from its friends, http and rdfs).

This image may give a feeling for the relationship of full RDF and simplified RDF:

Reassembling Full RDF

The basic idea is that given some metadata (mostly: the schema), we can construct a new set of triples in full RDF which convey what the simplified RDF intended. The new set will be distinguished by using different predicates, and the predicates are related by schema information available by dereferencing the predicate URI. The specific relations used, and other schema information, allows us to unambiguously perform the conversion.

For example, og:title is intended to convey the same basic notion as rdfs:label. They are not the same property, though, because og:title is applied to a page about the thing which is being labeled, rather than the thing itself. So rather than saying they are related by owl:equivalentProperty, we say:

  og:title srdf:twin rdfs:label.

This ties to them together, saying they are “parallel” or “convertable”, and allowing us to use other information in the schema(s) for og:title and rdfs:label to enable conversion.

The conversion goes something like this:

  1. The subject URLs should usually be taken as pages whose foaf:primaryTopic is the real subject. (Expressing the XFN microformat in RDF provides a gentle introduction to this kind of idea.) That real subject can be identified with a blank node or with a constructed URI using a “thing described by” service such as t-d-b.org. A little more work is needed on how to make such services efficient, but I think the concept is proven. I’d expect facebook to want to run such a service.

    In some cases, the subject URL really does identify the intended subject, such as when the triple is giving the license information for the web page itself. These cases can be distinguished in the schema by indicating the simplified RDF property is an IndirectProperty or MetadataProperty.

  2. The object (value) can be reconstructed by looking at the range of the full-RDF twin. For example, given that something has an og:latitude of “37.416343”, og:latitude and example:latitude are twins, and example:latitude has a range of xs:decimal, we can conclude the thing has an example:latitude of “37.416343”^^xs:decimal.

    Similarly, the Simplified RDF technique of puting URIs in strings for the object can be undone by know the twin is an ObjectProperty, or has some non-Literal range.

    I believe language tagging could also be wrapped into the predicate (like comment_fr, comment_en, comment_jp, etc) if that kind of thing turns out to be necessary, using an OWL 2 range restrictions on the rdf:langRange facet.

    Next Steps

    So, that’s a rough sketch, and I need to wrap this up. If you’re at ISWC, I’ll be giving a 2 minute lightning talk about this at lunch later today. But if you’ve ready this far, the talk wont say say anything you don’t already know.

    FWIW, I believe this is implementable in RIF Core, which would mean data consumers which do RIF Core processing could get this functionality automatically. But since we don’t have any data consumer libraries which do that yet, it’s probably easiest to implement this with normal code for now.

    I think this is a fairly urgent topic because of the adoption curve (and energy) on OGP, and because it might possibly inform the design of a standand JSON serialization for RDF, which I’m expecting W3C to work on very soon.

Posted at 17:09

Leigh Dodds: Lets talk about plugs

This is a summary of a short talk I gave internally at the ODI to help illustrate some of the important aspects of data standards for non-technical folk. I thought I’d write it up here too, in case its useful for anyone else. Let me know what you think.

We benefit from standards in every aspect of our daily lives. But because we take them for granted, we don’t tend to think about them very much. At the ODI we’re frequently talking about standards for data which, if you don’t have a technical background, might be even harder to wrap your heard around.

A good example can help to illustrate the value of standards. People frequently refer to telephone lines, railway tracks, etc. But there’s an example that we all have plenty of personal experience with.

Lets talk about plugs!

You can confidently plug any of your devices into a wall socket and it will just work. No thought required.

Have you ever thought about what it would be like if plugs and wall sockets were all different sizes and shapes?

You couldn’t rely on being able to consistently plug your device into any random socket, so you’d have to carry around loads of different cables. Manufacturers might not design their plugs and sockets very well so there might be greater risks of electrocution or fires. Or maybe the company that built your new house decided to only fit a specific type of wall socket because its agree a deal with an electrical manufacturer, so when you move in you needed to buy a completely new set of devices.

We don’t live in that world thankfully. As a nation we’ve agreed that all of our plugs should be designed the same way.

That’s all a standard is. A documented, reusable agreement that everyone uses.

Notice that a single standard, “how to design a really great plug“, has multiple benefits. Safety is increased. We save time and money. Manufacturers can be confident that their equipment will work in any home or office.

That’s true of different standards too. Standards have economic, policy, technical and social impacts.

Open up a UK plug and it looks a bit like this.

Notice that there are colours for different types of wires (2, 3, 4). And that fuses (5) are expected to be the same size and shape. Those are all standards too. The wiring and voltages are standardised too.

So the wiring, wall sockets and plugs in your house are designed affording to a whole family of different standards, that are designed to work with one another.

We can design more complex systems from smaller standards. It helps us make new things faster, because we are reusing existing work.

That’s a lot of time and agreement that we all benefit from. Someone somewhere has invested the time and energy into thinking all of that through. Lucky us!

When we visit other countries, we learn that their plugs and sockets are different. Oh no!

That can be a bit frustrating, and means we have to spend a bit more money and remember to pack the right adapters. It’d be nice if the whole world agreed on how to design a plug. But that seems unlikely. It would cost a lot of time and money in replacing wiring and sockets.

But maybe those different designs are intentional? Perhaps there are different local expectations around safety, for example. Or in what devices people might be using in their homes. There might be reasons why different communities choose to design and adopt slightly different standards. Because they’re meeting slightly different needs. But sometimes those differences might be unnecessary. It can be hard to tell sometimes.

The people most impacted by these differences aren’t tourists, its the manufacturers that have to design equipment to work in different locations. Which is why your electrical devices normally has a separate cable. So, depending on whether you travel or whether you’re a device manufacturer you’ll have different perceptions of how much a problem that is.

All of the above is true for data standards.

Standards for data are agreements that help us collect, access, share, use and publish data in consistent ways.  They have a range of different impacts.

There are lots of different types of standard and we combine them together to create different ways to successfully exchange data. Different communities often have their own standards for similar things, e.g. for describing metadata or accessing data via an API.

Sometimes those are simple differences that an adapter can easily fix. Sometimes those differences are because the standards are designed to meet different needs.

Unfortunately we don’t live in a world of standardised data plugs and wires and fuses. We live in that other world. The one where its hard to connect one thing to another thing. Where the stuff coming down the wires is completely unexpected. And we get repeated shocks from accidental releases of data.

I guarantee that in every user research, interview, government consultation or call for evidence, people will be consistently highlighting the need for more standards for data. People will often say this explicitly, “We need more standards!”. But sometimes they refer to the need in other ways: “We need make data more discoverable!” (metadata standards) or “We need to make it easier to safely release data!” (standardised codes of practice).

Unfortunately that’s not always that helpful because when you probe a little deeper you find that people are talking about lots of different things. Some people want to standardise the wiring. Others just want to agree on a voltage. While others are still debating the definition of “fuse”. These are all useful and important things. You just need to dig a little deeper to find the most useful place to start.

Its also not always clear whose job it is to actually create those standards. Because we take standards for granted, we’re not always clear about how they get created. Or how long it takes and what process to follow to ensure they’re well designed.

The reason we published the open standards for data guidebook was to help communities get started in designing the standards they need.

Standards development needs time and investment, as someone somewhere needs to do the work of creating them. That, as ever, is the really hard part.

Standards are part of the data infrastructure that help us unlock value from data. We need to invest in creating and maintaining them like we do other parts of our infrastructure.

Don’t just listen to me, listen to some of the people who’ve being creating standards for their communities.

Posted at 17:07

John Breslin: Book launch for "The Social Semantic Web"

We had the official book launch of “The Social Semantic Web” last month in the President’s Drawing Room at NUI Galway. The book was officially launched by Dr. James J. Browne, President of NUI Galway. The book was authored by myself, Dr. Alexandre Passant and Prof. Stefan Decker from the Digital Enterprise Research Institute at NUI Galway (sponsored by SFI). Here is a short blurb:

Web 2.0, a platform where people are connecting through their shared objects of interest, is encountering boundaries in the areas of information integration, portability, search, and demanding tasks like querying. The Semantic Web is an ideal platform for interlinking and performing operations on the diverse data available from Web 2.0, and has produced a variety of approaches to overcome limitations with Web 2.0. In this book, Breslin et al. describe some of the applications of Semantic Web technologies to Web 2.0. The book is intended for professionals, researchers, graduates, practitioners and developers.

Some photographs from the launch event are below.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Another successful defense by Uldis Bojars in November

Uldis Bojars submitted his PhD thesis entitled “The SIOC MEthodology for Lightweight Ontology Development” to the University in September 2009. We had a nice night out to celebrate in one of our favourite haunts, Oscars Bistro.

Jodi, John, Alex, Julie, Liga, Sheila and Smita
Jodi, John, Alex, Julie, Liga, Sheila and Smita

This was followed by a successful defense at the end of November 2009. The examiners were Chris Bizer and Stefan Decker. Uldis even wore a suit for the event, see below.

I will rule the world!
I will rule the world!

Uldis established a formal ontology design process called the SIOC MEthodology, based on an evolution of existing methodologies that have been streamlined, experience developing the SIOC ontology, and observations regarding the development of lightweight ontologies on the Web. Ontology promotion and dissemination is established as a core part of the ontology development process. To demonstrate the usage of the SIOC MEthodology, Uldis described the SIOC project case study which brings together the Social Web and the Semantic Web by providing semantic interoperability between social websites. This framework allows data to be exported, aggregated and consumed from social websites using the SIOC ontology (in the SIOC application food chain). Uldis’ research work has been published in 4 journal articles, 8 conference papers, 13 workshop papers, and 1 book chapter. The SIOC framework has also been adopted in 33 third-party applications. The Semantic Radar tool he initiated for Firefox has been downloaded 24,000 times. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).

We wish Uldis all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Haklae Kim and his successful defense in September

This is a few months late but better late then never! We said goodbye to PhD researcher Haklae Kim in May of this year when he returned to Korea and took up a position with Samsung Electronics soon afterward. We had a nice going away lunch for Haklae with the rest of the team from the Social Software Unit (picture below).

Sheila, Uldis, John, Haklae, Julie, Alex and Smita
Sheila, Uldis, John, Haklae, Julie, Alex and Smita

Haklae returned to Galway in September to defend his PhD entitled “Leveraging a Semantic Framework for Augmenting Social Tagging Practices in Heterogeneous Content Sharing Platforms”. The examiners were Stefan Decker, Tom Gruber and Philippe Laublet. Haklae successfully defended his thesis during the viva, and he will be awarded his PhD in 2010. We got a nice photo of the examiners during the viva which was conducted via Cisco Telepresence, with Stefan (in Galway) “resting” his hand on Tom’s shoulder (in San Jose)!

Philippe Laublet, Haklae Kim, Tom Gruber, Stefan Decker and John Breslin
Philippe Laublet, Haklae Kim, Tom Gruber, Stefan Decker and John Breslin

Haklae created a formal model called SCOT (Social Semantic Cloud of Tags) that can semantically describe tagging activities. The SCOT ontology provides enhanced features for representing tagging and folksonomies. This model can be used for sharing and exchanging tagging data across different platforms. To demonstrate the usage of SCOT, Haklae developed the int.ere.st open tagging platform that combined techniques from both the Social Web and the Semantic Web. The SCOT model also provides benefits for constructing social networks. Haklae’s work allows the discovery of social relationships by analysing tagging practices in SCOT metadata. He performed these analyses using both Formal Concept Analysis and tag clustering algorithms. The SCOT model has also been adopted in six applications (OpenLink Virtuoso, SPARCool, RelaxSEO, RDFa on Rails, OpenRDF, SCAN), and the int.ere.st service has 1,200 registered members. Haklae’s research work was published in 2 journal articles, 15 conference papers, 3 workshop papers, and 2 book chapters. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).

We wish Haklae all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Some of my (very) preliminary opinions on Google Wave

I was interviewed by Marie Boran from Silicon Republic recently for an interesting article she was writing entitled “Will Google Wave topple the e-mail status quo and change the way we work?“. I thought that maybe my longer answers may be of interest and am pasting them below.

Disclaimer: My knowledge of Google Wave is second hand through various videos and demonstrations I’ve seen… Also, my answers were written pretty quickly!

As someone who is both behind Ireland’s biggest online community boards.ie and a researcher at DERI on the Semantic Web, are you excited about Google Wave?

Technically, I think it’s an exciting development – commercially, it obviously provides potential for others (Google included) to set up a competing service to us (!), but I think what is good is the way it has been shown that Google Wave can integrate with existing platforms. For example, there’s a nice demo showing how Google Wave plus MediaWiki (the software that powers the Wikipedia) can be used to help editors who are simultaneously editing a wiki page. If it can be done for wikis, it could aid with lots of things relevant to online communities like boards.ie. For example, moderators could see what other moderators are online at the same time, communicate on issues such as troublesome users, posts with questionable content, and then avoid stepping on each other’s toes when dealing with issues.

Does it potential for collaborative research projects? Or is it heavyweight/serious enough?

I think it has some potential when combined with other tools that people are using already. There’s an example from SAP of Google Wave being integrated with a business process modelling application. People always seem to step back to e-mail for doing various research actions. While wikis and the like can be useful tools for quickly drafting research ideas, papers, projects, etc., there is that element of not knowing who is doing stuff at the same time as you. Just as people are using Gtalk to augment Gmail by being able to communicate in contacts in real-time when browsing e-mails, Google Wave could potentially be integrated with other platforms such as collaborative work environments, document sharing systems, etc. It may not be heavyweight enough on its own but at least it can augment what we already use.

Where does Google Wave sit in terms of the development of the Semantic Web?

I think it could be a huge source of data for the Semantic Web. What we find with various social and collaborative platforms is that people are voluntarily creating lots of useful related data about various objects (people, events, hobbies, organisations) and having a more real-time approach to creating content collaboratively will only make that source of data bigger and hopefully more interlinked. I’d hope that data from Google Wave can be made available using technologies such as SIOC from DERI, NUI Galway and the Online Presence Ontology (something we are also working on).

If we are to use Google Wave to pull in feeds from all over the Web will both RSS and widgets become sexy again?

I haven’t seen the example of Wave pulling in feeds, but in theory, what I could imagine is that real-time updating of information from various sources could allow that stream of current information to be updated, commented upon and forwarded to various other Waves in a very dynamic way. We’ve seen how Twitter has already provided some new life for RSS feeds in terms of services like Twitterfeed automatically pushing RSS updates to Twitter, and this results in some significant amounts of rebroadcasting of that content via retweets etc.

Certainly, one of the big things about Wave is its integration of various third-party widgets, and I think once it is fully launched we will see lots of cool applications building on the APIs that they provide. There’s been a few basic demonstrator gadgets shown already like polls, board games and event planning, but it’ll be the third-party ones that make good use of the real-time collaboration that will probably be the most interesting, as there’ll be many more people with ideas compared to some internal developers.

Is Wave the first serious example of a communications platform that will only be as good as the third-party developers that contribute to it?

Not really. I think that title applies to many of the communications platforms we use on the Web. Facebook was a busy service but really took off once the user-contributable applications layer was added. Drupal was obviously the work of a core group of people but again the third-party contributions outweigh those of the few that made it.

We already have e-mail and IM combined in Gmail and Google Docs covers the collaborative element so people might be thinking ‘what is so new, groundbreaking or beneficial about Wave?’ What’s your opinion on this?

Perhaps the real-time editing and updating process. Often times, it’s difficult to go back in a conversation and add to or fix something you’ve said earlier. But it’s not just a matter of rewriting the past – you can also go back and see what people said before they made an update (“rewind the Wave”).

Is Google heading towards unified communications with Wave, and is it possible that it will combine Gmail, Wave and Google Voice in the future?

I guess Wave could be one portion of a UC suite but I think the Wave idea doesn’t encompass all of the parts…

Do you think Google is looking to pull in conversations the way FriendFeed, Facebook and Twitter does? If so, will it succeed?

Yes, certainly Google have had interests in this area with their acquisition of Jaiku some time back (everyone assumed this would lead to a competitor to Twitter; most recently they made the Jaiku engine available as open source). I am not sure if Google intends to make available a single entry point to all public waves that would rival Twitter or Facebook status updates, but if so, it could be a very powerful competitor.

Is it possible that Wave will become as widely used and ubiquitous as Gmail?

It will take some critical mass to get it going, integrating it into Gmail could be a good first step.

And finally – is the game changing in your opinion?

Certainly, we’ve moved from frequently updated blogs (every few hours/days) to more frequently updated microblogs (every few minutes/seconds) to being able to not just update in real-time but go back and easily add to / update what’s been said any time in the past. People want the freshest content, and this is another step towards not just providing content that is fresh now but a way of freshening the content we’ve made in the past.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Open government and Linked Data; now it's time to draft…

For the past few months, there have been a variety of calls for feedback and suggestions on how the US Government can move towards becoming more open and transparent, especially in terms of their dealings with citizens and also for disseminating information about their recent financial stimulus package.

As part of this, the National Dialogue forum was set up to solicit solutions for ways of monitoring the “expenditure and use of recovery funds”. Tim Berners-Lee wrote a proposal on how linked open data could provide semantically-rich, linkable and reusable data from Recovery.gov. I also blogged about this recently, detailing some ideas for how discussions by citizens on the various uses of expenditure (represented using SIOC and FOAF) could be linked together with financial grant information (in custom vocabularies).

More recently, the Open Government Initiative solicited ideas for a government that is “more transparent, participatory, and collaborative”, and the brainstorming and discussion phases have just ended. This process is now in its third phase, where the ideas proposed to solve various challenges are to be more formally drafted in a collaborative manner.

What is surprising about this is how few submissions and contributions have been put into this third and final phase (see graph below), especially considering that there is only one week for this to be completed. Some topics have zero submissions, e.g. “Data Transparency via Data.gov: Putting More Data Online”.

20090624b

This doesn’t mean that people aren’t still thinking about this. On Monday, Tim Berners-Lee published a personal draft document entitled “Putting Government Data Online“. But we need more contributions from the Linked Data community to the drafts during phase three of the Open Government Directive if we truly believe that this solution can make a difference.

For those who want to learn more about Linked Data, click on the image below to go to Tim Berners-Lee’s TED talk on Linked Data.

(I watched it again today, and added a little speech bubble to the image below to express my delight at seeing SIOC profiles on the Linked Open Data cloud slide.)

We also have a recently-established Linked Data Research Centre at DERI in NUI Galway.

20090624a

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: BlogTalk 2009 (6th International Social Software Conference) – Call for Proposals – September 1st and 2nd – Jeju, Korea

20090529a

BlogTalk 2009
The 6th International Conf. on Social Software
September 1st and 2nd, 2009
Jeju Island, Korea

Overview

Following the international success of the last five BlogTalk events, the next BlogTalk – to be held in Jeju Island, Korea on September 1st and 2nd, 2009 – is continuing with its focus on social software, while remaining committed to the diverse cultures, practices and tools of our emerging networked society. The conference (which this year will be co-located with Lift Asia 09) is designed to maintain a sustainable dialog between developers, innovative academics and scholars who study social software and social media, practitioners and administrators in corporate and educational settings, and other general members of the social software and social media communities.

We invite you to submit a proposal for presentation at the BlogTalk 2009 conference. Possible areas include, but are not limited to:

  • Forms and consequences of emerging social software practices
  • Social software in enterprise and educational environments
  • The political impact of social software and social media
  • Applications, prototypes, concepts and standards

Participants and proposal categories

Due to the interdisciplinary nature of the conference, audiences will come from different fields of practice and will have different professional backgrounds. We strongly encourage proposals to bridge these cultural differences and to be understandable for all groups alike. Along those lines, we will offer three different submission categories:

  • Academic
  • Developer
  • Practitioner

For academics, BlogTalk is an ideal conference for presenting and exchanging research work from current and future social software projects at an international level. For developers, the conference is a great opportunity to fly ideas, visions and prototypes in front of a distinguished audience of peers, to discuss, to link-up and to learn (developers may choose to give a practical demonstration rather than a formal presentation if they so wish). For practitioners, this is a venue to discuss use cases for social software and social media, and to report on any results you may have with like-minded individuals.

Submitting your proposals

You must submit a one-page abstract of the work you intend to present for review purposes (not to exceed 600 words). Please upload your submission along with some personal information using the EasyChair conference area for BlogTalk 2009. You will receive a confirmation of the arrival of your submission immediately. The submission deadline is June 27th, 2009.

Following notification of acceptance, you will be invited to submit a short or long paper (four or eight pages respectively) for the conference proceedings. BlogTalk is a peer-reviewed conference.

Timeline and important dates

  • One-page abstract submission deadline: June 27th, 2009
  • Notification of acceptance or rejection: July 13th, 2009
  • Full paper submission deadline: August 27th, 2009

(Due to the tight schedule we expect that there will be no deadline extension. As with previous BlogTalk conferences, we will work hard to endow a fund for supporting travel costs. As soon as we review all of the papers we will be able to announce more details.)

Topics

Application Portability
Bookmarking
Business
Categorisation
Collaboration
Content Sharing
Data Acquisition
Data Mining
Data Portability
Digital Rights
Education
Enterprise
Ethnography
Folksonomies and Tagging
Human Computer Interaction
Identity
Microblogging
Mobile
Multimedia
Podcasting
Politics
Portals
Psychology
Recommender Systems
RSS and Syndication
Search
Semantic Web
Social Media
Social Networks
Social Software
Transparency and Openness
Trend Analysis
Trust and Reputation
Virtual Worlds
Web 2.0
Weblogs
Wikis
Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Idea for Linked Open Data from the US recovery effort

Tim Berners-Lee recently posted an important request for the provision of Linked Open Data from the US recovery effort website Recovery.gov.

The National Dialogue website (set up to solicit ideas for data collection, storage, warehousing, analysis and visualisation; website design; waste, fraud and abuse detection; and other solutions for transparency and accountability) says that for Recovery.gov to be a useful portal for citizens, it “requires finding innovative ways to integrate, track, and display data from thousands of federal, state, and local entities”.

If you support the idea of Linked Open Data from Recovery.gov, you can have a read and provide some justifications on this thread.

(I’ve recently given some initial ideas about how grant feed data could be linked with user contributions in the form of associated threaded discussions on different topics, see picture below, all to be exposed as Linked Open Data using custom schemas plus SIOC and FOAF across a number of agencies / funding programs / websites.)

20090502a

Reblog this post [with Zemanta]

Posted at 17:05

December 09

Sebastian Trueg: Conditional Sharing – Virtuoso ACL Groups Revisited

Previously we saw how ACLs can be used in Virtuoso to protect different types of resources. Today we will look into conditional groups which allow to share resources or grant permissions to a dynamic group of individuals. This means that we do not maintain a list of group members but instead define a set of conditions which an individual needs to fulfill in order to be part of the group in question.

That does sound very dry. Let’s just jump to an example:

@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
[] a oplacl:ConditionalGroup ;
  foaf:name "People I know" ;
  oplacl:hasCondition [
    a oplacl:QueryCondition ;
    oplacl:hasQuery """ask where { graph <urn:my> { <urn:me> foaf:knows ^{uri}^ } }"""
  ] .

This group is based on a single condition which uses a simple SPARQL ASK query. The ask query contains a variable ^{uri}^ which the ACL engine will replace with the URI of the authenticated user. The group contains anyone who is in a foaf:knows relationship to urn:me in named graph urn:my. (Ideally the latter graph should be write-protected using ACLs as described before.)

Now we use this group in ACL rules. That means we first create it:

$ curl -X POST \
    --data-binary @group.ttl \
    -H"Content-Type: text/turtle" \
    -u dba:dba \
    http://localhost:8890/acl/groups

As a result we get a description of the newly created group which also contains its URI. Let’s imagine this URI is http://localhost:8890/acl/groups/1.

To mix things up we will use the group for sharing permission to access a service instead of files or named graphs. Like many of the Virtuoso-hosted services the URI Shortener is ACL controlled. We can restrict access to it using ACLs.

As always the URI Shortener has its own ACL scope which we need to enable for the ACL system to kick in:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope <urn:virtuoso:val:scopes:curi> .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope <urn:virtuoso:val:scopes:curi> .
};

Now we can go ahead and create our new ACL rule which allows anyone in our conditional group to shorten URLs:

[] a acl:Authorization ;
  oplacl:hasAccessMode oplacl:Write ;
  acl:accessTo <http://localhost:8890/c> ;
  acl:agent <http://localhost:8890/acl/groups/1> ;
  oplacl:hasScope <urn:virtuoso:val:scopes:curi> ;
  oplacl:hasRealm oplacl:DefaultRealm .

Finally we add one URI to the conditional group as follows:

sparql
insert into <urn:my> {
  <urn:me> foaf:knows <http://www.facebook.com/sebastian.trug> .
};

As a result my facebook account has access to the URL Shortener:
Virtuoso URI Shortener

The example we saw here uses a simple query to determine the members of the conditional group. These queries could get much more complex and multiple query conditions could be combined. In addition Virtuoso handles a set of non-query conditions (see also oplacl:GenericCondition). The most basic one being the following which matches any authenticated person:

[] a oplacl:ConditionalGroup ;
  foaf:name "Valid Identifiers" ;
  oplacl:hasCondition [
    a oplacl:GroupCondition, oplacl:GenericCondition ;
    oplacl:hasCriteria oplacl:NetID ;
    oplacl:hasComparator oplacl:IsNotNull ;
    oplacl:hasValue 1
  ] .

This shall be enough on conditional groups for today. There will be more playing around with ACLs in the future…

Posted at 19:20

Peter Mika: Welcome to schema.org

Bing, Google, and Yahoo! have announced schema.org yesterday, a collaboration between the three search providers in the area of vocabularies for structured data. As the ‘schema guy’ at Yahoo!, I have been part of the very small core team that developed technical content for schema.org. It’s been an interesting process: if you doubt that achieving an agreement in the search domain is hard, consider that the last time such an agreement happened was apparently sitemaps.org in 2006.

However, over the years, the lack of agreement on schemas have become such a major pain point for publishers new to the Semantic Web, that eventually cooperation became the only sensible thing to do for the future of the Semantic Web project. Consider that until yesterday any publisher that wanted to provide structured data for Bing, Google, and Yahoo needed to navigate three sets of documentation, and worse, choose between three different schemas and multiple formats (microdata, RDFa, microformats) and markup their pages using all of them.

So how did we get here? In my personal view, one of the key problems of the Semantic Web design of the W3C has been that it considered only technical issues, and not the need for a social process that would lead to bootstrap the system with data and schemas. We are now doing better on the data front thanks to large community efforts such as Linked Data. With regard to schemas — we used to call them ontologies, until we found it scared people away — the expectation was that they would be developed in a distributed manner and machines would do the hard job of schema matching or somehow agreements would emerge. However, schema matching is a hard problem to automate. Agreements were slow to come due to a lack of space for schema development and discussions. We have tried a number of things in this respect, for example some of you might know that I’ve been one of the instigators of the VoCamp movement which peaked around 2009. The W3C itself accepted some RDF-based schemas as member submissions, but it didn’t see itself as the organization that should deal with schemas, and there has been no process for dealing with these submissions either. (As an example, we learned when we started with SearchMonkey that there have been actually two versions of VCard in RDF submitted by two different members of the W3C. This problem has since been resolved.) Other schemas just appeared on websites abandoned by their owners. Finding stable and mature schemas with sufficient adoption has eventually become a major pain point. In the search domain, the situation improved somewhat when search providers preselected some schemas for publishers to use, and started providing specific documentation, with examples and a way to validate webpages. However, as illustrated above, the efforts have been still too fragmented until yesterday.

Given the above history, I’m extremely glad that cooperation prevailed in the end and hopefully schema.org will become a central point for vocabularies for the Semantic Web for a long time to come. Note that it will almost certainly not be the only one. schema.org covers the core interests of search providers, i.e. the stuff that people search for the most (hence the somewhat awkward term ‘search vocabularies’). As the simple needs are the most common in search logs, this includes things like addresses of businesses, reviews and recipes. schema.org will hopefully evolve with extensions over time but it may never cover complex domains such as biotechnology, e-government or others where people have been using Semantic Web technology with success. Nor do I think that schema.org is ‘perfect’. Personally, I would have liked to see RDFa used as the syntax for the basic examples, because I consider it more mature, and a superior standard to microdata in many ways. You will notice that RDF(a) in particular would have offered a standard way to extend schema.org schemas and map them to other schemas on the Web. Currently, there is an example of using the schemas in RDFa, but the support for this version of the markup will depend on its adoption.

Please take a look at schema.org, and if you have comments please consider using the schema.org feedback mechanisms (we have a feedback form as well as a discussion group).

Enhanced by Zemanta

Posted at 19:17

Peter Mika: Microformats and RDFa deployment across the Web

I have presented on previous occasions (at Semtech 2009, SemTech 2010, and later at FIA Ghent 2010, see slides for the latter, also in ISWC 2009) some information about microformat and RDFa deployment on the Web. As such information is hard to come by, this has generated some interest from the audience. Unfortunately, Q&A time after presentations is too short to get into details, hence some additional background on how we obtained this data and what it means for the Web. This level of detail is also important to compare this with information from other sources, where things might be measured differently.

The chart below shows the deployment of certain microformats and RDFa markup on the Web, as percentage of all web pages, based on an analysis of 12 billion web pages indexed by Yahoo! Search. The same analysis has been done at three different time-points and therefore the chart also shows the evolution of deployment.

Microformats and RDFa deployment on the Web (% of all web pages)

The data is given below in a tabular format.

Date RDFa eRDF tag hcard adr hatom xfn geo hreview
09-2008 0.238 0.093 N/A 1.649 N/A 0.476 0.363 N/A 0.051
03-2009 0.588 0.069 2.657 2.005 0.872 0.790 0.466 0.228 0.069
10-2010 3.591 0.000 2.289 1.058 0.237 1.177 0.339 0.137 0.159

There are a couple of comments to make:

  • There are many microformats (see microformats.org) and I only include data for the ones that are most common on the Web. To my knowledge at least, all other microformats are less common than the ones listed above.
  • eRDF has been a predecessor to RDFa, and has been obsoleted by it. RDFa is more fully featured than eRDF, and has been adopted as a standard by the W3C.
  • The data for the tag, adr and geo formats is missing from the first measurement.
  • The numbers cannot be aggregated to get a total percentage of URLs with metadata. The reason is that a webpage may contain multiple microformats and/or RDFa markup. In fact, this is almost always the case with the adr and geo microformats, which are typically used as part of hcard. The hcard microformat itself can be part of hatom markup etc.
  • Not all data is equally useful, depending on what you are trying to do. The tag microformat, for example, is nothing more than a set of keywords attached to a webpage. RDFa itself covers data using many different ontologies.
  • The data doesn’t include “trivial” RDFa usage, i.e. documents that only contain triples from the xhtml namespace. Such triples are often generated by RDFa parsers even when the page author did not intend to use RDFa.
  • This data includes all valid RDFa, and not just namespaces or vocabularies supported by Yahoo! or any other company.

The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hatom microformat.

These results make me optimistic that the Semantic Web is here already in large ways. I don’t expect that a 100% of webpages will ever adopt microformats or RDFa markup, simply because not all web pages contain structured data. As this seems interesting to watch, I will try to publish updates to the data and include the update chart here or in future presentations.

Enhanced by Zemanta

Posted at 19:17

December 05

AKSW Group - University of Leipzig: More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Over recent years, the Web of Data has grown significantly. Various interfaces such as LOD Stats, LOD Laundromat and SPARQL endpoints provide access to hundreds of thousands of RDF datasets, representing billions of facts. These datasets are available in different formats such as raw data dumps and HDT files, or directly accessible via SPARQL endpoints. Querying such a large amount of distributed data is particularly challenging and many of these datasets cannot be directly queried using the SPARQL query language.

In order to tackle these problems, We present WimuQ, an integrated query engine to execute SPARQL queries and retrieve results from a large amount of heterogeneous RDF data sources. Presently, WimuQ is able to execute both federated and non-federated SPARQL queries over a total of 668,166 datasets from LOD Stats and LOD Laundromat, as well as 559 active SPARQL endpoints. These data sources represent a total of 221.7 billion triples from more than 5 terabytes of information from datasets retrieved using the service “Where is My URI” (WIMU). Our evaluation of state-of-the-art real-data benchmarks shows that WimuQ retrieves more complete results for the benchmark queries. 

The contributions of this work are:

  • A hybrid SPARQL query-processing engine to execute SPARQL queries over a large amount of heterogeneous RDF data.
  • Evaluation of real-world datasets using the state of the art of federated and non-federated query benchmarks (FedBench, LargeRDFBench and FEASIBLE).
  • We present the first federated SPARQL query-processing engine that executes SPARQL queries over a total of 221.7 billion triples.

This is an ongoing work, in which the next step consists of a Large Scale approach to study the relation and similarity among the datasets. This work was supported by the Semantic Web group of HTWK Leipzig (https://www.htwk-leipzig.de/) under the advisement of Prof. Dr. rer. nat. Thomas Riechert.

Github repository: https://github.com/firmao/wimuT

Prototype/proof of concept: https://w3id.org/wimuq/

Slides: https://tinyurl.com/slidesKcap2019

Paper: https://dl.acm.org/citation.cfm?id=3364436

Conference: http://www.k-cap.org/2019/

Authors/Contact: valdestilhas@informatik.uni-leipzig.de, tsoru@informatik.uni-leipzig.de, saleem@informatik.uni-leipzig.de

Posted at 15:05

November 23

Benjamin Nowack: Want to hire me?

I have been happily working as a self-employed semantic web developer for the last seven years. With steady progress, I dare to say, but the market is still evolving a little bit too slowly for me (well, at least here in Germany) and I can't invest any longer. So I am looking for new challenges and an employer who would like to utilize my web technology experience (semantic or not). I have created a new personal online profile with detailed information about me, my skills, and my work.

My dream job would be in the social and/or data web area, I'm particularly interested in front-end development for data-centric or stream-oriented environments. I also love implementing technical specifications (probably some gene defect).

The potential show-stopper: I can't really relocate, for private reasons. I am happy to (tele)commute or travel, though. And I am looking for a full-time employment (or a full-time, longer-term contract). I am already applying for jobs, mainly here in D�sseldorf so far, but I thought I'd send out this post as well. You never know :)

Posted at 01:06

November 19

Libby Miller: Tensorflow – saveModel for tflite

I want to convert an existing model to one that will run on a USB stick ‘accelerator’ called Coral. Conversion to tflite is needed for any small devices like these.

I’ve not managed this yet, but here are some notes.  I’ve figured out some of it, but come unstuck in that some operations (‘ops’) are not supported in tflite yet. But maybe this is still useful to someone, and I want to remember what I did.

I’m trying to change a tensorflow model – for which I only have .meta and .index files – to one with .pb files or variables, which seems to be called a ‘savedModel’. These have some interoperability, and appear to be a prerequisite for making a tflite model.

Here’s what I have to start with:

ls models/LJ01-1/
model_gs_933k.data-00000-of-00001.1E0cbbD3  
model_gs_933k.meta
model_gs_933k.data-00000-of-00001           
model_gs_933k.index                         
model_gs_933k.meta.289E3B1a

Conversion to SavedModel

First, create a savedModel (this code is for Tensorflow 1.3, but 2.0 is a simple conversion using a command-line tool).

import tensorflow as tf
model_path = 'LJ01-1/model_gs_933k'
output_node_names = ['Merge_1/MergeSummary']    
loaded_graph = tf.Graph()

with tf.Session(graph=loaded_graph) as sess:
    # Restore the graph
    sess.run(tf.global_variables_initializer())
    saver = tf.train.import_meta_graph(model_path+'.meta')
    # Load weights
    saver.restore(sess,model_path)
    # Freeze the graph
    frozen_graph_def = tf.graph_util.convert_variables_to_constants(
        sess,
        sess.graph_def,
        output_node_names)

    builder = tf.saved_model.builder.SavedModelBuilder('new_models')
    op = sess.graph.get_operations()
    input_tensor = [m.values() for m in op][1][0]
    output_tensor = [m.values() for m in op][len(op)-1][0]

    # https://sthalles.github.io/serving_tensorflow_models/
    tensor_info_input = tf.saved_model.utils.build_tensor_info(input_tensor)
    tensor_info_output = tf.saved_model.utils.build_tensor_info(output_tensor)
    prediction_signature = (
      tf.saved_model.signature_def_utils.build_signature_def(
         inputs={'x_input': tensor_info_input},
         outputs={'y_output': tensor_info_output},
         method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
    builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING],
      signature_def_map={
        tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
          prediction_signature
      },
      )
    builder.save()

I used

output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
print(output_node_names)

to find out the names of the input and output ops.

That gives you a directory (new_models) like

new_models/variables
new_models/variables/variables.data-00000-of-00001
new_models/variables/variables.index
new_models/saved_model.pb

Conversion to tflite

Once you have that, then you can use the command-line tool tflite_convert (examples) – 

tflite_convert --saved_model_dir=new_models --output_file=model.tflite --enable_select_tf_ops

This does the conversion to tflite. And it will probably fail, e.g. mine did this:

Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: ABS, ADD, CAST, CONCATENATION, CONV_2D, DIV, EXP, EXPAND_DIMS, FLOOR, GATHER, GREATER_EQUAL, LOGISTIC, MEAN, MUL, NEG, NOT_EQUAL, PAD, PADV2, RSQRT, SELECT, SHAPE, SOFTMAX, SPLIT, SQUARED_DIFFERENCE, SQUEEZE, STRIDED_SLICE, SUB, SUM, TRANSPOSE, ZEROS_LIKE. Here is a list of operators for which you will need custom implementations: BatchMatMul, FIFOQueueV2, ImageSummary, Log1p, MergeSummary, PaddingFIFOQueueV2, QueueDequeueV2, QueueSizeV2, RandomUniform, ScalarSummary.

You can add –allow_custom_ops to that, which will let everything through – but it still won’t work if there are ops that are not tflite supported – you have to write custom operators for the ones that don’t yet work (I’ve not tried this).

But it’s still useful to use –allow_custom_ops, i.e.

tflite_convert --saved_model_dir=new_models --output_file=model.tflite --enable_select_tf_ops --allow_custom_ops

because you can visualise the graph once you have a tflite file, using netron. Which is quite interesting, although I suspect it doesn’t work for the bits which it passed through but doesn’t support.

>>> import netron 
>>> netron.start('model.tflite')
Serving 'model.tflite' at http://localhost:8080

Update – I forgot a link:

“This document outlines how to use TensorFlow Lite with select TensorFlow ops. Note that this feature is experimental and is under active development. “
However…

Posted at 10:06

November 14

Ebiquity research group UMBC: Why does Google think Raymond Chandler starred in Double Indemnity?

In my knowledge graph class yesterday we talked about the SPARQL query language and I illustrated it with DBpedia queries, including an example getting data about the movie Double Indemnity. I had brought a google assistant device and used it to compare its answers to those from DBpedia. When I asked the Google assistant “Who starred in the film Double Indemnity”, the first person it mentioned was Raymond Chandler. I knew this was wrong, since he was one of its screenwriters, not an actor, and shared an Academy Award for the screenplay. DBpedia’s data was correct and did not list Chandler as one of the actors.

I did not feel too bad about this — we shouldn’t expect perfect accuracy in these huge, general purpose knowledge graphs and at least Chandler played an important role in making the film.

After class I looked at the Wikidata page for Double Indemnity (Q478209) and saw that it did list Chandler as an actor. I take this as evidence that Google’s knowledge Graph got this incorrect fact from Wikidata, or perhaps from a precursor, Freebase.

The good news 🙂 is that Wikidata had flagged the fact that Chandler (Q180377) was a cast member in Double Indemnity with a “potential Issue“. Clicking on this revealed that the issue was that Chandler was not known to have an occupation property that a “cast member” property (P161) expects, which includes twelve types, such as actor, opera singer, comedian, and ballet dancer. Wikidata lists chandler’s occupations as screenwriter, novelist, write and poet.

More good news 😀 is that the Wikidata fact had provenance information in the form of a reference stating that it came from CSFD (Q3561957), a “Czech and Slovak web project providing a movie database”. Following the link Wikidata provided led me eventually to the resource, which allowed my to search for and find its Double Indemnity entry. Indeed, it lists Raymond Chandler as one of the movie’s Hrají. All that was left to do was to ask for a translation, which confirmed that Hrají means “starring”.

Case closed? Well, not quite. What remains is fixing the problem.

The final good news 🙂 is that it’s easy to edit or delete an incorrect fact in Wikidata. I plan to delete the incorrect fact in class next Monday. I’ll look into possible options to add an annotation in some way to ignore the incorrect ?SFD source for Chander being a cast member over the weekend.

Some possible bad news 🙁 that public knowledge graphs like Wikidata might be exploited by unscrupulous groups or individuals in the future to promote false or biased information. Wikipedia is reasonably resilient to this, but the problem may be harder to manage for public knowledge graphs, which get much their data from other sources that could be manipulated.

The post Why does Google think Raymond Chandler starred in Double Indemnity? appeared first on UMBC ebiquity.

Posted at 20:05

Ebiquity research group UMBC: paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part
of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

Early Detection of Cybersecurity Threats Using Collaborative Cognition

Sandeep Narayanan, Ashwinkumar Ganesan, Karuna Joshi, Tim Oates, Anupam Joshi and Tim Finin, Early detection of Cybersecurity Threats using Collaborative Cognition, 4th IEEE International Conference on Collaboration and Internet Computing, Philadelphia, October. 2018.

 

The early detection of cybersecurity events such as attacks is challenging given the constantly evolving threat landscape. Even with advanced monitoring, sophisticated attackers can spend more than 100 days in a system before being detected. This paper describes a novel, collaborative framework that assists a security analyst by exploiting the power of semantically rich knowledge representation and reasoning integrated with different machine learning techniques. Our Cognitive Cybersecurity System ingests information from various textual sources and stores them in a common knowledge graph using terms from an extended version of the Unified Cybersecurity Ontology. The system then reasons over the knowledge graph that combines a variety of collaborative agents representing host and network-based sensors to derive improved actionable intelligence for security administrators, decreasing their cognitive load and increasing their confidence in the result. We describe a proof of concept framework for our approach and demonstrate its capabilities by testing it against a custom-built ransomware similar to WannaCry.

The post paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition appeared first on UMBC ebiquity.

Posted at 20:05

November 03

Libby Miller: Real_libby – a GPT-2 based slackbot

In the latest of my continuing attempts to automate myself, I retrained a GPT-2 model with my iMessages, and made a slackbot so people could talk to it. Since Barney (an expert on these matters) felt it was unethical that it vanished whenever I shut my laptop, it’s now living happily(?) if a little more slowly in a Raspberry Pi 4.

Screen Shot 2019-07-20 at 12.19.24It was surprisingly easy to do, with a few hints from Barney. I’ve sketched out what I did below. If you make one, remember that it can leak out private information – names in particular – and can also be pretty sweary, though mine’s not said anything outright offensive (yet).

fuck, mitzhelaists!

This work is inspired by the many brilliant Twitter bot-makers  and machine-learning people out there such as Barney, (who has many bots, including inspire_ration and notYourBot, and knows much more about machine learning and bots than I do), Shardcore (who made Algohiggs, which is probably where I got the idea for using GPT-2),  and Janelle Shane, (whose ML-generated names for e.g. cats are always an inspiration).

First, get your data

The first step was to get at my iMessages. A lot of iPhone data is backed up as sqlite, so if you decrypt your backups and have a dig round, you can use something like baskup. I had to make a few changes but found my data in

/Users/[me]/Library/Application\ Support/MobileSync/Backup/[long number]/3d/3d0d7e5fb2ce288813306e4d4636395e047a3d28

This number – 3d0d7e5fb2ce288813306e4d4636395e047a3d28 – seems always to indicate the iMessage database – though it moves round depending on what version of iOS you have. I made a script to write the output from baskup into a flat text file for GPT-2 to slurp up. I had about 5K lines.

Retrain GPT-2

I used this code.

python3 ./download_model.py 117M

PYTHONPATH=src ./train.py --dataset /Users/[me]/gpt-2/scripts/data/

I left it overnight on my laptop and by morning loss and avg were oscillating so I figured it was done – 3600 epochs. The output from training was fun, e.g..

([2899 | 33552.87] loss=0.10 avg=0.07)

my pigeons get dandruff
treehouse actually get little pellets
little pellets of the same stuff as well, which I can stuff pigeons with
*little
little pellets?
little pellets?
little pellets?
little pellets?
little pellets?
little pellets?
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets
little pellets

Test it

I copied the checkpoint directory into the models directory

cp -r checkpoint/run1 models/libby
cp models/117M/{encoder.json,hparams.json,vocab.bpe} models/libby/

At which point I could test it using the code provided:

python3 src/interactive_conditional_samples.py --model_name libby

This worked but spewed out a lot of text, very slowly. Adding –length 20 sped it up:

python3 src/interactive_conditional_samples.py --model_name libby --length 20

Screen Shot 2019-07-20 at 13.05.06

That was the bulk of it done! I turned interactive_conditional_samples.py into a server and then whipped up a slackbot – it responds to direct questions and occasionally to a random message.

Putting it on a Raspberry Pi 4 was very very easy. Startlingly so.

Screen Shot 2019-07-20 at 13.11.10

It’s been an interesting exercise, and mostly very funny. These bots have the capacity to surprise you and come up with the occasional apt response (I’m cherrypicking)

Screen Shot 2019-07-20 at 14.25.00

We’ve been talking a lot at work about personal data and what we would do with our own, particularly messages with friends and the pleasure of scrolling back and finding old jokes and funny messages. My messages were mostly of the “could you get some milk?” “here’s a funny picture of the cat” type, but it covered a long period and there were also two very sad events in there. Parsing the data and coming across those again was a vivid reminder that this kind of personal data can be an emotional minefield and not something to be trivially messed with by idiots like me.

Also: while GPT-2 means there’s plausible deniability about any utterance, a bot like this can leak personal information of various kinds, such as names and regurgitated fragments of real messages. Unsurprisingly it’s not the kind of thing I’d be happy making public as is, and I’m not sure if it ever could be.

 

 

Posted at 18:06

October 24

Sebastian Trueg: TPAC 2012 – Who Am I (On the Web)?

Last week I attended my first TPAC ever – in Lyon, France. Coming from the open-source world and such events like Fosdem or the ever brilliant Akademy I was not sure what to expect. Should I pack a suite? On arrival all my fears were blown away by an incredibly well organized event with a lot of nice people. I felt very welcome as a newbie, there was even a breakfast for the first-timers with some short presentations to get an overview of the W3C‘s work in general and the existing working groups. So before getting into any details: I would love this to become a regular thing (not sure it will though, seeing that next year the TPAC will be in China).

My main reason for going to the TPAC was identity on the Web, or short WebID. OpenLink Software is a strong supporter of the WebID identification and authentication system. Thus, it was important to be present for the meeting of the WebID community group.

The meeting with roughly 15 people spawned some interesting discussions. The most heatedly debated topic was that of splitting the WebID protocol into two parts: 1. identification and 2. authentication. The reason for this is not at all technical but more political. The WebID protocol which uses public keys embedded in RDF profiles and X.509 certificates which contain a personal profile URL has always had trouble being accepted by several working groups and people. So in order to lower the barrier for acceptance and to level the playing field the idea was to split the part which is indisputable (at least in the semantic web world) from the part that people really have a problem with (TLS).

This lead to a very simple definition of a WebID which I will repeat in my own words since it is not written in stone yet (or rather “written in spec”):

A WebID is a dereferencable URI which denotes an agent (person, organization, or software). It resolves to an RDF profile document uniquely identifying the agent.

Here “uniquely identify” simply means that the profile contains some relation of the WebID to another identifier. This identifier can be an email address (foaf:mbox), it can be a Twitter account, an OpenID, or, to restore the connection to the WebID protocol, a public key.

The nice thing about this separation of identity and authentication is that the WebID is now compatible with any of the authentication systems out there. It can be used with WebID-Auth (this is how I call the X.509 certificate + public key in agent profile system formally known as WebID), but also with OpenID or even with OAuth. Imagine a service provider like Google returning a WebID as part of the OAuth authentication result. In case of an OpenID the OpenID itself could be the WebID or another WebID would be returned after successful authentication. Then the client could dereference it to get additional information.

This is especially interesting when it comes to WebACLs. Now we could imagine defining WebACLs on WebIDs from any source. Using mutual owl:sameAs relations these WebIDs could be made to denote the same person which the authorizing service could then use to build a list of identifiers that map the one used in the ACL rule.

In any case this is a definition that should pose no problems to such working groups as the Linked Data Protocol. Even the OpenID or OAuth community should wee the benefits of identifying people via URIs. In the end the Web is a Web of URIs…

Posted at 21:06

October 06

Andrew Matthews: Knowledge Graphs 101

This is the first in a short series introducing Knowledge Graphs. It covers just the basics, showing how to write, store, query and work with graph data using RDF (short for Resource Description Format). I will keep it free of theory and interesting but unnecessary digressions. Let me know in the comments if you find […]

Posted at 23:06

Andrew Matthews: Preparing a Project Gutenberg ebook for use on a 6″ ereader

For a while I’ve been trying to find a nice way to convert project Gutenberg books to look pleasant on a BeBook One. I’ve finally hit on the perfect combination of tools, that produces documents ideally suited to 6″ eInk ebook readers like my BeBook. The tool chain involves using GutenMark to convert the file […]

Posted at 23:06

Copyright of the postings is owned by the original blog authors. Contact us.