Planet RDF

It's triples all the way down

June 18

Egon Willighagen: Minting RDF from CSV files with Bioclipse #MIILS2013

The below slides are part of the introduction for the hands on for the #MIILS2013 course this afternoon. I had the participants look at creating RDF the hard way: using Bioclipse script (using this Bioclipse-OpenTox version). And, they had to follow the Open PHACTS RDF Guidelines, VoID specification, etc.

#MIILS2013 Practical: Bioclipse for RDF minting from Egon Willighagen

Posted at 15:31

June 15

Leigh Dodds: Building the new Ordnance Survey Linked Data platform

Disclaimer: the following is my own perspective on the build & design of the

Posted at 11:31

June 14

AKSW Group - University of Leipzig: Special Issue on Web Data Quality in IJSWIS

Call for papers
Special Issue on Web Data Quality
International Journal on Semantic Web and Information Systems

Scope:

The standardization and adoption of Semantic Web technologies has resulted in an unprecedented volume of data being published as Linked Open Data (LOD). The integration across this Web of Data, however, is hampered by the ‘publish first, refine later’philosophy. This leads to various quality problems arising in the underlying data such as incompleteness, inconsistency and incomprehensibility. These problems affect every application domain, be it scientific (e.g., life science, environment), governmental or industrial applications.

This Special Issue is addressed to those members of the community interested in providing novel methodologies or frameworks in assessing, monitoring, maintaining and improving the quality of the Web of Data and also introduce tools and user interfaces which can effectively assist in the assessment. The benefits of such methodologies will not only help in detecting inherent data quality problems currently plaguing the Web of Data, but also provide the means to fix these problems and maintain the quality in the long run. Additionally, we also seek articles that help identify the current impediments in building real-world LOD applications

Topics:

  • Web data and LOD quality concepts
  • Data quality dimensions and metrics for Web data and LOD quality
  • Web data and LOD quality methodologies
  • Data quality assessment frameworks
  • Evaluation of quality and trustworthiness in the web of data
  • (Semi-)automatic assessment in the web of data
  • Large-scale quality assessment of structured datasets
  • Validation of currently existing data quality assessment methodologies
  • Use-case driven quality assessment
  • Quality assessment leveraging background knowledge
  • Co-reference detection and dataset reconciliation
  • Data quality methodologies for linked open data
  • Evaluating quality of ontologies
  • Web data and LOD quality tools
  • Design and implementation of data quality monitoring, assessment and improvement tools
  • Quality exploration and analysis interfaces
  • Scalability and performance of tools
  • Monitoring tools
  • Case studies on Web data and LOD quality assessment and improvement
  • Web data and LOD quality benchmarks
  • Issues in LOD
  • Methods to acquire most relevant LOD datasets
  • Generating meaningful associations across LOD datasets

Posted at 13:55

June 11

Frederick Giasson: structFieldStorage: A New Field Storage System for Drupal 7

Structured Dynamics has been working with Drupal for quite some time. This week marks our third anniversary of posting code to the contributed conStruct modules in Drupal. But, what I’m able to share today is our most exciting interaction with Drupal to date. In essence, we now can run Drupal directly from an RDF triplestore and take full advantage of our broader Open Semantic Framework (OSF) stack. Massively cool!

On a vanilla Drupal 7 instance, everything ends up being saved into Drupal’s default storage system. This blog post introduces a new way to save (local) Content Type entities: the structfieldstorage field storage system. This new field storage system gives the possibility to Drupal administrators to choose to save specific (or all) fields and their values into a remote structWSF instance. This option replaces Drupal’s default storage system (often MySQL) for the content types and their fields chosen.

By using this new field storage system, all of the local Drupal 7 content can be queried via any of structWSF’s web service endpoints (which includes a SPARQL endpoint). This means that all Drupal 7 content (using this new storage system) gets converted and indexed as RDF data. This means that all of the Drupal local content gets indexed in a semantic web service framework.

Fields and Bundles

There are multiple core concepts in Drupal, two of which are Bundles and Fields. A Field is basically an attribute/value tuple that describes an entity. A Bundle is a set (an aggregation) of fields. The main topic of this blog post is a special feature of the field: their storage system.

In Drupal, each field instance does have its own field storage system associated to it. A field storage system is a system that manages the field/value tuples of each entity that has been defined as a Drupal instance. The default storage system of any field is the field_sql_storage, which is normally a MySQL server or database.

The field storage system allows a bundle to have multiple field instances, each of which may have a different field storage target. This means that the data that describes an entity can be saved in multiple different data stores. Though it may appear odd at first as to why such flexibility has merit, but we will see that this design is quite clever, and probably essential.

There are currently a few other field storage systems that have been developed for Drupal 7 so far. The most cited one is probably the MongoDB module, and there is also Riak. What I am discussing in this blog post is a new field storage system for Drupal 7 which uses structWSF as the data store. This new module is called the structFieldStorage module and it is part of conStruct.

Flexibility of the Field Storage API design

The design of having one field storage system per field is really flexible and probably essential. By default, all of the field widgets and all the modules have been created using the field_sql_storage system. This means that a few things here and there have been coded with the specificities of that field storage system. The result is that even if the Field Storage API has been designed and developed such that we can create new field storage systems, the reality is that once you do it, multiple existing field widgets and modules can break from the new field storage systems.

What the field storage system developer has to do is to test all the existing (core) field widgets and modules and make sure to handle all the specifics of these widgets and modules within the field storage system. If it cannot handle a specific widget or module, it should restrict their usage and warn the user.

However, there are situations where someone may require the use of a specific field widget that won’t work with that new field storage system. Because of the flexibility of the design, we can always substitute the field_sql_storage system for the given field dependent on that special widget. Under this circumstance, the values of that field widget would be saved in the field_sql_storage system (MySQL) while the other fields would save their value in a structWSF instance. Other circumstances may also warrant this flexibility.

structFieldStorage Architecture

Here is the general architecture for the structFieldStorage module. The following schema shows how the Drupal Field Storage API Works, and shows the flexibility that resides into the fields, and how multiple fields, all part of the same bundle, can use different storage systems:

bundles_fields_field_storage_api_outline

By default, on a vanilla Drupal instance, all the fields use the field_sql_storage field storage system:

default_field_storage_system_interaction

Here is what that same bundle looks like when all fields use the structfieldstorage field storage system:

ccr_field_storage_system_interaction

Finally here is another schema that shows the interaction between Drupal’s core API, structFieldStorage and the structWSF web service endpoints:

structFieldStorage

Synchronization

Similar to the default MySQL field_sql_storage system, we have to take into account a few synchronization use cases when dealing with the structfieldstorage storage system for the Drupal content types.

Synchronization with structFieldStorage occurs when fields and field instances that use the structfieldstorage storage system get deleted from a bundle or when an RDF mapping changes. These situations don’t appear often once a portal is developed and properly configured. However, since things evolve all the time, the synchronization mechanism is always available to handle deleted content or changed schema.

The synchronization workflow answers the following questions:

  • What happens when a field get deleted in a content type?
  • What happens when a field’s RDF mapping changes for a new property?
  • What happens when a bundle’s type RDF mapping changes for a new one?

Additionally, if new field instances are being created in a bundle, no synchronization of any kind is required. Since this is a new field, there is necessarily no data for this field in the OSF, so we just wait until people start using this new field to commit new data in the OSF.

The current synchronization heuristics follow the following steps:

  1. Read the structfieldstorage_pending_opts_fields table and get all the un-executed synchronization change operations
    1. For each un-executed change:
      1. Get 20 records within the local content dataset from the Search endpoint. Filter the results to get all the entities that would be affected by the current change
        1. Do until the Search query returns 0 results
          1. For each record within that list
            1. Apply the current change to the entities
            2. Save that modified entities into the OSF using the CRUD: Update web service endpoint
      2. When the Search query returns 0 results, it means that this change got fully applied to the OSF. The state of this change record then get marked as executed.
  2. Read the structfieldstorage_pending_opts_bundles table and get all the un-executed synchronization change operations
    1. For each un-executed change:
      1. Get 20 records within the local content dataset from the Search endpoint. Filter the results to get only the ones that would be affected by the current change
        1. Do until the Search query returns 0 results
          1. For each record within that list
            1. Apply the current change to the entities
            2. Save that changed record into the OSF using the CRUD: Update web service endpoint
      2. When the Search query returns 0 results, it means that this change got fully applied to the OSF. The state of this change record then get marked as executed.

The synchronization process is triggered by a Drupal cron job. Eventually this may be changed to have a setting option that would let you use cron synchronization or to trigger it by hand using some kind of button.

Compatibility

The structFieldStorage module is already compatible with multiple field widgets and external contributed Drupal 7 modules. However, because of Drupal’s nature, other field widgets and contributed modules that are not listed in this section may be working with this new field storage system, but tests will be required by the Drupal system administrator.

Field Widgets

Here is a list of all the core Field Widgets that are normally used by Drupal users. This list tells you which field widget is fully operational or disabled with the structfieldstorage field storage system.

Note that if a field is marked as disabled, it only means that it is not currently implemented for working with this new field storage system. It may be re-enabled in the future if it become required.

Field Type Field Widget Operational?
Text Text Field Fully operational
Autocomplete for predefined suggestions Fully operational
Struct Lookup Fully operational
Struct Lookup with suggestion Fully operational
Autocomplete for existing field data Disabled
Autocomplete for existing field data and some node titles Disabled
Term Reference Autocomplete term widget (tagging) Disabled
Select list Disabled
Check boxes/radio buttons Disabled
Long text and summary Text area with a summary Fully operational
Long text Text area (multiple rows) Fully operational
List (text) Select list Fully operational
Check boxes/radio buttons Fully operational
Autocomplete for allowed values list Disabled
List (integer) Select list Fully operational
Check boxes/radio buttons Fully operational
Autocomplete for allowed values list Disabled
List (float) Select list Fully operational
Check boxes/radio buttons Fully operational
Autocomplete for allowed values list Disabled
Link Link Fully operational
Integer Text field Fully operational
Float Text field Fully operational
Image Image Fully operational
File File Fully operational
Entity Reference Select list Fully operational
Check boxes/radio buttons Fully operational
Autocomplete Fully operational
Autocomplete (Tags style) Fully operational
Decimal Text field Fully operational
Date (Unix timestamp) Text field Fully operational
Select list Fully operational
Pop-up calendar Fully operational
Date (ISO format) Text field Fully operational
Select list Fully operational
Pop-up calendar Fully operational
Date Text field Fully operational
Select list Fully operational
Pop-up calendar Fully operational
Boolean Check boxes/radio buttons Fully operational
Single on/off checkbox Fully operational

Core & Popular Modules

Revisioning

The Revisioning module is fully operational with the structfieldstorage field storage system. All the operations exposed in the UI have been handled and implemented in the hook_revisionapi() hook.

Diff

The Diff module is fully operational. Since it compares entity class instances, there is no additional Diff API implementation to do. Each time revisions get compared, then structfieldstorage_field_storage_load() gets called to load the specific entity instances. Then the comparison is done on these entity descriptions.

Taxonomy

The Taxonomy module is not currently supported by the structfieldstorage field storage system. The reason is that the Taxonomy module is relying on the design of the field_sql_storage field storage system, which means that it has been tailored to use that specific field storage system. In some places it can be used, such as with the entity reference field widget, but its core functionality, the term reference field widget, is currently disabled.

Views

structViews is a Views query plugin for querying an OSF backend. It interfaces the Views 3 UI and generates OSF Search queries for searching and filtering all the content it contains. However, Views 3 is intimately tied with the field_sql_storage field storage system, which means that Views 3 itself cannot use the structfieldstorage storage system off the shelf. However, Views 3 design has been created such that a new Views querying engine could be implemented, and used, with the Views 3 user interface. This is no different than how the Field Storage API works for example. This is exactly what structViews is, and this is exactly how we can use Views on all the fields that uses the structfieldstorage field storage system.

This is not different than what is required for the mongodb Drupal module. The mongodb Field Storage API implementation is not working with the default Views 3 functionality either, as shown by this old, and very minimal, mongodb Views 3 integration module.

structViews is already working because all of the information defined in fields that use the structfieldstorage storage system is indexed into the OSF. What structViews does is just to expose this OSF information via the Views 3 user interface. All the fields that define the local content can be added to a structViews view, all the fields can participate into filter criteria, etc.

What our design means is that the structFieldStorage module doesn’t break the Views 3 module. It does not because structViews takes care to expose that entity storage system to Views 3, via the re-implmented API.

efq_views

efq_views is another contributed module that exposes the EntityFieldQuery API to Views 3. What that means is that all of the Field Storage Systems that implement the EntityFieldQuery API should be able to interface with Views 3 via this efq_views Views 3 querying engine.

Right now, the structFieldStorage module does not implement the EntityFieldQueryAPI. However, it could implement it by implementing the hook_field_storage_query() hook. (This was not required by our current client.)

A Better Revisioning System

There is a problem with the core functionality of Drupal’s current revisioning system. The problem is that if a field or a field instance gets deleted from a bundle, then all of the values of those fields, within all of the revisions of the entities that use this bundle, get deleted at the same time.

This means that there is no way to delete a field without deleting the values of that field in existing entities revisions. This is a big issue since there is no way to keep that information, at least for archiving purposes. This is probably working that way because core Drupal developers didn’t want break the feature that enables people to revert an entity to one of its past revisions. This would have meant that data for fields that no longer existed would have to be re-created (creating its own set of issues).

However, for all the fields that uses the structfieldstorage field storage system, this issue is non-existing. Even if fields or fields instances are being deleted, all the past information about these fields remains in the revisions of the entities.

Conclusion

This blog post exposes the internal mechanism of this new structfieldstorage backend to Drupal. The next blog post will focus on the user interface of this new module. It will explain how it can be configured and used. And it will explain the different Drupal backend user interface changes that are needed to expose the new functionality related to this new module.

Posted at 03:41

June 07

Dave Beckett: Leaving Zite

Today is my last day working at Zite in San Francisco. The team of engineers is dedicated and I’m sure they’ll continue to innovate and improve – I wish them well.

For myself, although it has been good working in the online news world at Yahoo! – aggregated and created; Digg – social (not the current non-social flavour) and Zite – personalized, it’s time for a new direction for me.

I am taking a couple of weeks off before I head to my next role which is quite Open and Cloudy. That’s a hint.

Posted at 15:54

Semantic Web Company (Austria): The LOD cloud is dead, long live the trusted LOD cloud

The ongoing debate around the question whether ‘there is money in linked data or not’ has now been formulated more poignantly by Prateek Jain (one of the authors of the original article) recently: He is asking, ‘why linked open data hasn’t been used that much so far besides for research projects?‘.

I believe there are two reasons (amongst others) for the low uptake of LOD in non-academic settings which haven’t been discussed in detail until today:

1. The LOD cloud covers mainly ‘general knowledge‘ in contrast to ‘domain knowledge

Since most organizations live on their internal knowledge which they combine intelligently with very specific (and most often publicly available) knowledge (and data), they would benefit from LOD only if certain domains were covered. A frequently quoted ‘best practice’ for LOD is that portion of data sets which is available at Bio2RDF. This part of the LOD cloud has been used again and again by the life sciences industry due to its specific information and its highly active maintainers.

We need more ‘micro LOD clouds’ like this.

Another example for such is the one which represents the German Library Linked Open Data Cloud (thanks to Adrian Pohl for this pointer!) or the Clean Energy Linked Open Data Cloud:

reegle-lod-cloud

I believe that the first generation of LOD cloud has done a great job. It has visualised the general principles of linked data and was able to communicate the idea behind. It even helped – at least in the very first versions of it – to identify possibly interesting data sets. And most of all: it showed how fast the cloud was growing and attracted a lot of attention.

But now it’s time to clean up:

A first step should be to make a clear distinction between the section of the LOD cloud which is open and which is not. Datasets without licenses should be marked explicitly, because those are the ones which are most problematic for commercial use, not the ones which are not open.

A second improvement could be made by making some quality criteria clearly visible. I believe that the most important one is about maintenance and authorship: Who takes responsibility for the quality and trustworthiness of the data? Who exactly is the maintainer?

This brings me to the second and most important reason for the low uptake of LOD in commercial applications:

2. Most datasets of the LOD cloud are maintained by a single person or by nobody at all (at least as stated on datahub.io)

Would you integrate a web service which is provided by a single, maybe private person into a (core-)application of your company? Wouldn’t you prefer to work with data and services provided by a legal entity which has high reputation at least in its own knowledge domain? We all know: data has very little value if it’s not maintained in a professional manner. An example for a ’good practice’ is the integrated authority file provided by German National Library. I think this is a trustworthy source, isn’t it? And we can expect that it will be maintained in the future.

It’s not the data only which is linked in a LOD cloud, most of all it’s the people and organizations ‘behind the datasets’ that will be linked and will co-operate and communicate based on their datasets. They will create on top of their joint data infrastructure efficient collaboration platforms, like the one in the area of clean energy – the ‘Trusted Clean Energy LOD Cloud‘:

reegle.info trusted links

REEEP and its reegle-LD platform has become a central hub in the clean energy community. Not only data-wise but also as an important cooperation partner in a network of NGOs and other types of stakeholders which promote clean energy globally.

Linked Data has become the basis for more effective communication in that sector.

To sum up: To publish LOD which is interesting for the usage beyond research projects, datasets should be specific and trustworthy (another example is the German labor law thesaurus by Wolters Kluwer). I am not saying that datasets like DBpedia are waivable. They serve as important hubs in the LOD cloud, but for non-academic projects based on LOD we need an additional layer of linked open datasets, the Trusted LOD cloud.

 

Posted at 12:46

June 06

Frederick Giasson: conStruct for Drupal 7

construct_logo_120For more than a year we have been developing a completely new version of conStruct for Drupal 7 for one of our clients.

conStruct for Drupal 6 is really decoupled from Drupal and all the other contributed modules; in a word, it was not playing nice with Drupal. The goal of this new version has been to change that situation. The focus of this completely new conStruct module has been to create a series of connector modules that bridge most of Drupal’s core functionalities with remote structWSF instances.

We wanted to make sure that Drupal developers could manipulate content, within Drupal, that is hosted in structWSF instance(s). The best way to start aiming for that goal was to make sure that all of the core Drupal APIs commonly used by Drupal developers could be used to manipulate structWSF data like if it was native in Drupal. This is what these connectors are about.

The development of conStruct for Drupal 7 is not finished, but it is available in the Git repository. There is still refactoring and improvements required, mainly to make it easier to use and understand, but all of the code is working properly and is already used on production sites.

conStruct As a Large Scale Drupal Implementation

Those who follow the evolution of conStruct know that conStruct’s main goal is to use Drupal as a user interface for structWSF for administrative purposes, or for creating complete portals like the NOW portal. However, in our initial versions, Structured Dynamics’ purpose was to not tightly integrate with Drupal. Over time, though, we have seen broad acceptance for the Drupal front end and Drupal itself is evolving in ways compatible with semantic technologies.

What is changing with conStruct for Drupal 7, with all these connectors, is that we are now using conStruct to bridge Drupal with structWSF server instances. We supercharge Drupal 7′s capabilities with structWSF. Our evolution to a tighter Drupal coupling means the ability to manage, query, search, data mine, million of entities; to have vocabularies of tens of thousands of concepts; and to enable the querying of all of these entities and their content from any kind of devices or systems via a family of web services endpoints.

This is the initial version of what is (or should be) Drupal LSD for Structured Dynamics: A semantic web service framework backend system for Drupal.

conStruct’s Drupal Connectors

Here is the initial list of the connectors that exists:

  • structFieldStorage: this module creates a new structfieldstorage field storage system that can be used by Drupal fields to save the fields’ data into a remote structWSF instance. This is used to enable the Content Type entities to be saved into a structWSF instance. It is an extension of the Drupal field storage system
  • structEntities: this module creates a new Entity Type called the Resource Type that is used to see all the structWSF indexed records as native Entities in Drupal. This means that the Entity API can be used to manipulate any content in structWSF
  • structViews: this module creates a new data source for Views 3. This means that the Views 3 user interface is used to generate structWSF Search endpoint queries instead of SQL queries
  • structSearchAPI: this module exposes new search indexes to the Search API. This means that the Search API can be used to query a structWSF instance.

I will write about all these connectors individually in upcoming blog posts. I will cover their design, architecture and usage.

 

Posted at 19:43

June 05

John Goodwin: How are you using Ordnance Survey Linked Data?

I might have mentioned (a few times) that the new look

Posted at 09:05

June 04

Orri Erling: ESWC 2013 Panel - Semantic Technologies for Big Data Analytics: Opportunities and Challenges

I was invited to the ESWC 2013 "Semantic Technologies for Big Data Analytics: Opportunities and Challenges" panel on 29th May 2013 in Montpellier, France. The panel was moderated by Marko Grobelnik (JSI), with panelists Enrico Motta (KMi), Manfred Hauswirth (NUIG), David Karger (MIT), John Davies (British Telecom), José Manuel Gómez Pérez (ISOCO) and Orri Erling (myself).

Marko opened the panel by looking at the Google Trends search statistics for big data, semantics, business intelligence, data mining, and other such terms. Big data keeps climbing its hype-cycle hill, now above semantics and most of the other terms. But what do these in fact mean? In the leading books about big data, the word semantics does not occur.

I will first recap my 5 minute intro, and then summarize some questions and answers. This is from memory and is in no sense a full transcript.

Presentation

Over the years we have maintained that what the RDF community most needs is good database. Indeed, RDF is relational in essence and, while it requires some new datatypes and other adaptations, there is nothing in it that is fundamentally foreign to RDBMS technology.

This spring, we came through on the promise, delivering Virtuoso 7, packed full of all the state-of-the-art tricks in analytics-oriented databasing, column-wise compressed storage, vectored execution, great parallelism, and flexible scale-out.

At this same ESWC, Benedikt Kaempgen and Andreas Harth presented a paper (No Size Fits All -- Running the Star Schema Benchmark with SPARQL and RDF Aggregate Views) comparing Virtuoso and MySQL on the star schema benchmark at 1G scale. We redid their experiments with Virtuoso 7 at 30x and at 300x the scale.

At present, when running the star schema benchmark in SQL, we outperform column-store pioneer MonetDB by a factor of 2. When running the same star schema benchmark in SPARQL against triples as opposed to tables, we see a slowdown of 5x. When scaling from 30 to 300G and from one to two machines, we get linear increase in throughput, 5x longer for 10x more data.

Coming back to MySQL, the run with 1G takes about 60 seconds. Virtuoso SPARQL does the same on 30x the data in 45 seconds. Well, you could say that we should go pick on somebody in our series and not MySQL, being not relevant for this. Comparing with MonetDB and other analytics column stores is of course more relevant.

For cluster scaling, one could say that star schema benchmark is easy, and so it is, but even with harder ones, which do joins across partitions all the time, like the BSBM BI workload, we get scaling that is close to linear.

So, for analytics, you can use SPARQL in Virtuoso, and run circles around some common SQL databases.

The difference between SQL and SPARQL comes from having no schema. Instead of scanning aligned columns in a table, you do an index lookup for each column. This is not too slow if there is locality, as there is, but still a lot more than when talking about a multicolumn column-compressed table. With more execution tricks, we can maybe cut this to 3x.

The beach-head of workable RDF-based analytics on schema-less data has been attained. Medium-scale data, to the single-digit terabytes, is OK on small clusters.

What about the future?

First, Big Data means more than querying. Before meaningful analytics can be done, the data must generally be prepared and massaged. This means fast bulk load and fast database-resident transformation. We have that via flexible, expressive, parallelizable stored procedures and run time hosting. One can do everything one does in MapReduce right inside the database.

Some analytics cannot be expressed in a query language. For example, graph algorithms like clustering generate large intermediate states and run in many passes. For this, bulk synchronous processing frameworks like Giraph are becoming popular. We can again do this right inside the DBMS, on RDF or SQL tables. There is great platform utilization and more flexibility than in strict BSP, while being able to do any BSP algorithm.

The history of technology is one of divergence followed by reintegration. New trends, like Column stores, RDF databases, key value stores, or MapReduce, start as one-off special-purpose products, and the technologies then find their way back into platforms addressing a broader functionality.

The whole semantic experiment might be seen as a break-away from the web, if also a little from database, for the specific purpose of exploring schemaless-ness, universal referenceability of data, self-describing data, and some inference.

With RDF, we see lasting value in globally consistent identifiers. The URI "superkey" is the ultimate silo-breaker. The future is in integrating more and more varied data and a schema-first approach is cost-prohibitive. If data is to be preserved over extended lengths of time, self-description is essential; the applications and people that produced the data might not be around. Same for publishing data for outside reuse.

In fact, many of these things are right now being pursued in mainstream IT. Everybody is reinventing the triple, whether by using non-first normal form key-value pairs in an RDB, tagging each row of a table with the name of the table, using XML documents, etc. The RDF model provides all these desirable features, but most applications that need these things do not run on RDF infrastructure.

Anyway, by revolutionizing RDF store performance, we make this technology a cost-effective alternative in places where it was not such before.

To get much further in performance, physical storage needs to adapt to the data. Thus, in the long term, we see RDF as a lingua franca of data interchange and publishing, supported by highly scalable and adaptive databases that exploit the structure implicit in the data to deliver performance equal to the best in SQL data warehousing. When we get the schema from the data, we have schema-last flexibility and schema-first performance. The genie is back in the bottle, and data models are unified.

Questions and Answers

Q: Is the web big data?

David Karger: No, the shallow web (i.e., static web pages for purposes of search) is not big data. One can put it in a box and search. But for purposes of more complex processing, like analytics on the structure of the whole web, this is still big data.

Q: I bet you still can't do analytics on a fast stream.

Orri Erling: I am not sure about that, because when you have a stream -- whether this is network management and denial of service detection, or managing traffic in a city -- you know ahead of time what peak volume you are looking at, so you can size the system accordingly. And streams have a schema. So you can play all the database tricks. Vectored execution will work there just as it does for query processing, for example.

Q: I did not mean storage, I meant analysis.

Orri Erling: Here we mean sliding windows and constant queries. The triple vs. row issue also seems the same. There will be some overhead from schema-lastness, but for streams, I would say each has a regular structure.

John Davies: For example, we gather gigabytes a minute of traffic data from sensors in the road network and all this data is very regular, with a fixed schema.

Manfred Hauswirth: Or is this always so? The internet of things has potentially huge diversity in schema, with everything producing a stream. The user of the stream has no control whatever on the schema.

Marko Grobelnik: Yes, we have had streams for a long time -- on Wall Street, for example, where these make a lot of money. But high frequency trading is a very specific application. There is a stream, some analytics, not very complicated, just fast. This is one specific solution, with fixed schema and very specific scope, no explicit semantics.

Q: What is big data, in fact?

David Karger: Computer science has always been about big data; it is just the definition of big that changes. Big data is something one cannot conveniently process on a computer system. Not without unusual tricks, where something trivial, like shortest path, becomes difficult just because of volume. So it is that big data is very much about performance, and performance is usually obtained by sacrificing the general for the specific. The semantic world on the other hand is after something very general and about complex and expressive schema. When data gets big, the schema is vanishingly small in comparison with the data, and the schema work gets done by hand; the schema is not the problem there. Big data is not very internetty either, because the 40 TB produced by the telescope are centrally stored and you do not download them or otherwise transport them very much.

Q: Now, what do each of you understand with semantics?

Manfred Hauswirth: The essential aspect is that data is machine interpretable, with sufficient machine readable context.

David Karger: Semantics has to do with complexity or heterogeneity in the schema. Big data has to do with large volume. Maybe semantic big data would be all the databases in the world with a million different schemas. But today we do not see such applications. If the volume is high, the schema is usually not very large.

Manfred Hauswirth: This is not so far as that, for example a telco has over a hundred distinct network management systems and each has a different schema.

Orri Erling: From the data angle, we have come to associate semantic with

  • schema-lastness
  • globally-resolvable identifiers
  • self-description
When people use RDF as a storage model, they mostly do so because of schema flexibility, not because of expressive schemas or inference. Some use a little inference, but inference or logics or considerations of knowledge representation do not in our experience drive the choice.

Conclusion

In conclusion, the event was rather peaceful, with a good deal of agreement between the panelists and audience and no heated controversy. I hoped to get some reaction when I said that semantics was schema flexibility, but apparently this has become a politically acceptable stance. In the golden days of AI this would not have been so. But then Marko Grobelnik did point out that the whole landscape has become data driven. Even in fields like natural language, one looks more at statistics than deep structure: For example, if a phrase is often found on Google, it is proper usage.

Posted at 14:04

Semantic Web Company (Austria): There’s Money in Linked Data

I believe that the ongoing debate whether there ‘is money in linked (open) data or not’ is a bit misleading. ‘Linked (open) data’ is not only the data itself. It’s much more, even more than yet another technology stack. Linked data is most of all a set of principles how to organize information in agile organizations that are embedded in fast moving and dynamic environments. And from this perspective there is a huge amount of money in it – but let me refine that a bit later.

networkMan

Crying out loud in 2013 that ‘there is no money in linked data’ is an important step towards the right direction because it points out that data publishers should be more precise with data licensing. Although quite flexible licensing models would already exist – it’s the people (and probably other legal entities) who forget to publish their data together with statements about the ‘openness’ of it. As a result, the data remains closed for commercial users. This hasn’t been properly noticed in the early days of the linked open data cloud since commercial users haven’t been around at all (in contrast to academic institutions which considered the LOD cloud to be a wonderful playground). It’s the same thing with linked data as a technology and linked data as a set of standards: the standards and the technology stack are mature now (just think about Virtuoso’s brilliant SPARQL performance, for example), but most people from IT still wouldn’t have things like URIs, RDF and SPARQL off the top of their head when they seek solutions for powerful data integration methodologies.

Why is that?

I believe that so far ‘linked data’ has always been perceived by people from outside the linked data core-community only as a new way to organize data on the web, thus technologies are still not mature for enterprises.

But the truth is, that linked data has at least a threefold nature. Linked data is

  1. a method to organize information in general, not only on the web but also in enterprises
  2. a set of standards which is flexible and expressive enough to link data across boundaries (organizational, political, philosophical), cultures and languages
  3. a way of using IT and information in a quite intuitive way, very close to the patterns like human beings tend to create realities, thus comprehensible also for non-techies.

I think that technologists have made a brilliant job so far with creating the linked data technology stack, its underlying standards, triple-stores and quad-stores, reasoners etc., and for specialists it’s absolutely clear why this kind of technologies will outperform traditional databases, BI-tools, search engines etc. by far.

But: the crucial point now is that enterprises have to adapt linked data technologies inside their corporate boundaries (and not only for SEO purposes or the like). The key question is not whether there is enough LOD out there for app-makers or not. High-quality LOD will be produced very quickly as soon as there are commercial consumers like large enterprises. I am not talking about use cases for linked data in the fields of data publishing or SEO.

The main driver for the further Linked Data development will be enterprises which embrace LD technologies for their internal information management.

It’s true that there are already some large companies (like Daimler - meet them at this year’s I-SEMANTICS in Graz!) dealing with that question but to be honest: there is not the same hype around ‘linked data’ as we can see with ‘big data’. IBM, Microsoft & Co. are not that interested in linked data of course because it is a platform by itself and doesn’t foresee any kind of lucrative lock-in effects. Internet companies like Google and Facebook make use of linked data quite hesitantly. Although Facebook’s Graph Search or Google’s Knowledge Graph contain large portions of this kind of technology, Google would never say ‘oh, we are a semantic web company now, we make heavy use of linked data, and of course we will also contribute to the LOD cloud.’

Why is that? Simply spoken, because through the glasses of Google, Facebook & Co. the internet is a huge machine which produces data for them. Not the other way around.

But shouldn’t the enterprise customers themselves be interested in a cost-effective way of information management? They are, but as stated before, they haven’t perceived linked data as such, although it clearly is.

To develop technologies, we need critical questions, and of course the most critical ones always come from the inside of a community or movement. But time has come to spread the good news for the ‘outside’.

  • Yes, databases which rely on linked data standards have become mature and enough performing for many query types so that they outperform even ‘traditional’ relational databases
  • Yes, also issues which are critical for enterprise usage like privacy and security have been solved by most linked data technology vendors
  • Yes, there is a critical mass of available LOD sources (for example UK Ordnance Survey) and also of high-quality thesauri and ontologies (for example Wolter Kluwer’s working law thesaurus) to be reused in corporate settings
  • Yes, there is a volume of developers and consultants on the labor market (in the U.S. as well as in the E.U.) which is big enough to being able to execute large linked data projects
  • Yes, there are tons of business cases that can benefit from linked data. Linked data and semantic web technologies should be considered as core technologies for any information architecture, at least in larger corporations
  • Yes, SPARQL Query Language is not only a second SQL but comes with some brilliant features like transitive queries which help to save a lot of time when developing applications like business intelligence reporting and analysis
  • Yes, Linked Data has the potential to become the basis for a large variety of tools which help decision-makers (not only in enterprises but also in politics) to become true ‘digerati’ instead of being degraded to masters of the ‘bullshit bingo’.

Yes, this list can be further extended and it is a core element for the further expansion of the LOD cloud. It’s the enterprises that will drive the next level of maturity of the linked data landscape. Because at the end of the day it’s only them who will pay or have already paid the bill for open (government) data.

Posted at 09:58

schema.org: Schema.org and JSON-LD

We'd like to take a minute to share our enthusiasm for some recent work at W3C: JSON-LD.

Schema.org is all about shared vocabulary - it helps integrate data across applications, Web sites and data formats. We are adding JSON-LD to the list of formats we recommend for use with schema.org, alongside Microdata and RDFa - each has strengths and weaknesses for different usage scenarios.

In HTML, schema.org descriptions can be written using markup attributes in HTML (i.e. RDFa and Microdata). However there are often cases when data is exchanged in pure JSON or as JSON within HTML. W3C's work on JSON-LD provides mechanisms for interpreting structured data in JSON that promotes interoperability with other data formats. We believe it provides value for developers and publishers, and improves the flow of information between JSON and other environments.

There are some technical details to work through on how exactly schema.org terms are defined for JSON-LD usage, but it is already clear that JSON-LD is a useful contribution to structured data sharing in the Web. Many thanks to the hardworking W3C community for creating the specification.

Posted at 05:41

June 03

John Goodwin: New Ordnance Survey Linked Data Site not just for Data Geeks

Ordnance Survey’s new

Posted at 16:43

W3C Blog Semantic Web News: Four W3C projects into Google Summer of Code 2013

Google has just announced the accepted projects for the Summer of Code 2013. W3C was granted four slots.

First of all, I'd like to thank all the students who participated as well as the mentors, particularly Manu Sporny who helped a lot with the proposals. The selection process was tougher than we had expected: we have had about 50 proposals and we decided to focus only on 8 of them. We think that 4 slots is good for us, as it's a good balance between the 8 strong candidates/proposals and the fact that this is our first participation, so we can learn from it. We feel sorry for those who were not accepted and we hope that they'll stay engaged in the W3C community.

Here are the students who were accepted and what they will be working on:

  • Gábor Kövesdán will refactor the Validator.nu project so that the validator will be a standalone reusable Java component
  • Joseph J Short will adapt the Internationalization (i18n) checker so that it is a modularised, easy to deploy API, which can be run by itself or integrated into other (Java) applications
  • Vikash Agrawal will work on several applications around JSON-LD to help developers integrate with Schema.org and the LinkedIn API, among other things
  • Tao Lin will extend the RDFa play tool for helping the web developers mark up their pages with RDFa for better UI experience, with dozens of predefined examples and an example submission module

Those are really exciting projects that the Web community is waiting for. I personally can't wait to see our students getting started :-)

Posted at 14:53

AKSW Group - University of Leipzig: AKSW wins best paper award at ESWC 2013

Best Paper ESWC 2013Our paper “When to Reach for the Cloud” was awarded the best paper award at ESWC. The idea behind the paper was to provide implementations of the HR3  algorithm (the first reduction-ratio-optimal algorithm for link discovery) for parallel hardware and to devise suggestions for when to use which hardware when computing links between very large data sets. With this work, we aim to make link discovery amenable to Big Data.  Check out the paper here ;)

Link on,

Axel on behalf of AKSW

Posted at 10:57

Bob DuCharme: Coming soon: new, expanded edition of "Learning SPARQL"

55% more pages! 23% fewer mentions of the semantic web!

Posted at 00:44

May 31

W3C Read Write Web Community Group: Read Write Web — Monthly Open Thread — (May 2013)

Summary

WWW 2013 took place this year, in Rio de Janeiro, Brazil.  There was a packed program, including an interesting workshop entitled “Linked Data on the Web“, four papers of which, were dedicated to the Read Write Web.

The big news in linked data is that gmail has started to add JSON LD to their popular email service.  This allows developers to embed structured data into an email, in the form of Reviews, RSVPs, Interactive actions and Flight cards.  Response has been generally positive to this move, with perhaps the possibility for couple of minor tweaks to the markup.

The following papers were presented at the Read Write Web session in Rio : R&Wbase: git for triples, OSLC Resource Shape: A language for defining constraints on Linked Data, Hydra: A Vocabulary for Hypermedia-Driven Web APIs, Reasoning over SPARQL.  The website w3id.org was also released, which promises to be a permanent home for COOL URIs.

 

Communications and Outreach

The RWW group welcomes new members.  In particular, we had a great introduction from read write web veteran, Henri Bergius.  Henri has been working on read write topics for a number of years.  Notably midgaurd in the 1990s, and more recently, the impressive create.js.  If you’re unfamiliar with Henri’s work you may enjoy this video that goes through many core concepts.

 

Community Group

There has been some discussion on the mailing list, but also with the semantic web group, and some IETF folks as to the best way to use HTTP to identify a user to a server.  This would enable a user to identify itself to a server without having to rely on the subjecctAltName field in a client side TLS certificate, or other methods.  Thought had been to reuse the “From” header, however this seems tightly bound to email.  Current thinking is that we draft text for a new header, then find a name for it.

 

Applications

Our co-chair, Andrei Sambra, met the developers of the Cozy Cloud project in Paris.  There’s hope that this system can be combined with the my-profile project to become a kind of read write web example of a social dashboard.  Cozy Cloud comes with a dozen or so cloud enabled apps, and has also been short listed for the LeWeb London best startup competition, so wishing them best of luck!

 

Last but not least…

Activity Streams, the popular social network data exchange format, have been dipping their toes into Linked data with, Activity Streams 2.0, a JSON LD powered activity stream.  This currently does not have official standing but the reception has been good, and there is talk of pushing it through the IETF.  Hopefully this can finally lead to a united and interoperable social web for all!

Posted at 14:47

Norm Walsh: The joy of timezones

Timezones are annoying and inconvenient. And that's before legislatures get involved and start mucking about with them. Nevertheless, in the real world, sometimes you just gotta deal.

Posted at 03:44

May 30

AKSW Group - University of Leipzig: Google’s spiritus rector Eric Schmidt visited AKSW

Today Google’s spiritus rector Eric Schmidt visited AKSW to learn about the newest Linked Data technology and figuring out how to replace the Google’s proprietary knowledge graph with open DBpedia.

Joke aside: Together his co-author Jared Cohen he  visited University of Leipzig to discuss  their new book “The New Digital Age: Reshaping the Future of People, Nations and Business” with students and researchers.

Eric and Jared spend more than an hour answering questions, talking and joking. The major topics were the Internet, freedom of expression, privacy, copyright, and driving on German Autobahn. One of their key ideas seems to be that technology and the Internet can help to make the world better by spreading values such as freedom of speech and ultimately democracy. Generally an agreeable opinion, but as we now have a virtual reality on the Internet, we also seem to have sometimes a virtual democracy or how else could George W. and friend’s succeed in taking over and raiding their country, lying to world public to start a useless war (in Iraq) costing ten thousands lifes on all sides. Especially regarding the latter the internet censorship of the Chinese government (which also was a topic) appears like a rather minor shortcoming.

Posted at 20:10

May 29

Dublin Core Metadata Initiative: UW iSchool joins DCMI as inaugural Institutional Member

2013-05-29, DCMI is please to announce that the Information School of the University of Washington in Seattle, USA, has joined DCMI as the inaugural Institutional Member in the Initiative's revised membership programs. As a leading member of the iSchool movement, the University of Washington Information School is a model for other information schools around the globe. Assistant Professor Joseph T. Tennis will represent the Information School on the DCMI Oversight Committee. Regional, Institutional and Supporting members of DCMI are pivotal to guaranteeing the continuing contributions of DCMI to the metadata community. Information about the revised membership programs is available at http://dublincore.org/about/membershipPrograms/.

Posted at 23:59

Dublin Core Metadata Initiative: DCMI-AsiaPac regional workshop in Singapore: "RDA, DC and Linked Data"

2013-05-29, DCMI-AsiaPac will hold a regional workshop in Singapore on 15 August 2013 as part of the DCMI Regional Meetings Series. The theme for the one-day workshop will be "RDA, DC and LOD" and will be comprised of two half-day seminars. The Workshop will be held the day before the IFLA IT Section's conference on "User interaction based on library linked data" on 16 August. IFLA WLIC itself will run from 17-23 Aug 2013. Through the Workshop, the organizers intend to raise the awareness among librarians in the Asian region on the implementation of RDA and how library metadata (specifically DC) can be exposed as linked data to improve visibility and enhance collection usage. The objective is also to build confidence among Asian librarians to work well in the digital arena and be comfortable enough to adopt new technologies that will help improve their libraries�f services. A secondary objective is to build a community for the DCMI Asia Task Group where regular discussion on metadata matters can be established. More information about the workshop is available at http://dcevents.dublincore.org/BibData/ap2013.

Posted at 23:59

W3C Blog Semantic Web News: Interview: Oracle on Data on the Web - Part 1 with Reza B'Far

This is part 1 of a 2-part interview with Oracle about data on the Web. In part 1, the focus is on the consumption of data by applications, such as those that enterprises provide to their employees. In part 2, the focus is on back-end data management.

For this part of the interview I spoke with Reza B'Far, Vice President of Development.

IJ: How does the Oracle apps team use Web standards for data?

Reza: Oracle uses a number of W3C standards, but one of my focus areas is the application of Semantic Web technologies. OWL and PROV are the two standards we've used in our Fusion applications. Fusion applications bring together and integrate Oracle acquisitions from the past decade related to enterprise resource planning (ERP), human resources, supply chain management, financials, customer relations, and so on.

IJ: What are some examples of Fusion applications?

Reza: For example, enterprise customers use Fusion GRC to ensure they comply with various government rules and regulations. They also use the tools to detect over-payment or fraud. In the team that I run, the problems of discovering things like overpayment, SOD violations, fraud, and others are best solved by using an artificial intelligence (AI) approach. We have found that OWL provides an optimal way to capture the knowledge required by the AI engine, for example, for intelligent searches.

IJ: What are intelligent searches?

Reza: These are heuristic-based searches. Take the example of trying to detect fraud in an enterprise environment where a lot of systems interact. Suppose Jack reports to Joe and they collude in some way on one transaction out of 100,000. How do we detect this? One might try to look at all possible permutations of the transaction in the system, but there's no known solution if you take this sort of brute force approach where you simply look at every single possible permutation.

Reza: On the other hand, if you use heuristics based on domain expertise, you can make your search engine smarter and reduce the problem space. The challenge is how to capture the domain knowledge. There are a variety of ways to do this, even several approaches using Semantic Web technology. However, we found OWL worked best for us. OWL lets us represent all the entities in the system as well as statements like "the probability of fraud due to duplicate payment or overpayment is high." OWL is very versatile because it does not require you to use a single grand schema to represent your world. And, beyond heuristic reasoning, OWL gives us the secondary benefit of data aggregation.

IJ: So you have OWL statements and RDF data. Then what happens?

Reza: We have a reasoning engine --the Planner and Reasoning Engine-- which uses the heuristics and walks through the data to verify compliance, detect fraud, etc.

IJ: What were you using before OWL?

Reza: Though we did capture some data using a variety of formats, there really was nothing before OWL. We started using OWL to scale this product line by allowing our partners to add their own rules starting roughly 5 years ago. As an example, a company like Deloitte might use their own rules expressed in OWL, customer data, and Oracle's tools.

IJ: What is the reception to using OWL?

Reza: Fairly positive. The biggest barrier to OWL adoption has been that people are unfamiliar with it. So we have invested in educating our partners and customers, and this investment has paid off. Within Oracle, we've gone from "OWL is weird" to "OWL is a possibility." But we need more champions with specific applications that generate revenue.

IJ: How are you using PROV?

Reza: PROV is at least is important to us as OWL. Until PROV, one of the hugest problems we faced was maintaining transaction audit trails in a heterogeneous environment in a standard and compatible way. Audit trails are described with literally millions of different formats in different organizations. This used to mean it was impossible to create a single audit time line. PROV solves this problem. We now provide (and consume) a PROV feed that unifies the audit trails generated by transactions across heterogeneous systems.

IJ: What's an example?

Reza: Suppose I own a retail store and I contract with someone to help out during the holiday season. Months later that person becomes an employee. PROV lets me track changes over time for metadata from heterogeneous systems. It provides a standardized temporal structure for metadata, allowing me to aggregate temporal data from different systems. This lets me do things like look at payment data and changes to employee status and detect fraud.

IJ: Are there other Semantic Web technologies you are thinking of adopting?

Reza: We are actively looking at the opportunity of using Linked Data Platform (LDP) specifications.

IJ: Any comments about vocabulary management?

Reza: I think there's a dissonance in vocabulary creation, particular related to Dublin Core. There's no standard mechanism to rationalize OWL implementations with Dublin core. Dublin Core defines a bunch of canonical domain objects. Dublin Core should be mashed into OWL. Or there could be guidelines on using OWL for consistency with Dublin Core. There is a risk of stumbling when using both unless you use them with consistency.

IJ: Thanks for your time!

Posted at 15:11

W3C Blog Semantic Web News: Interview: Oracle on Data on the Web - Part 2 with Xavier Lopez

This is part 2 of a 2-part interview with Oracle about data on the Web. In part 1, the focus is on the consumption of data by applications, such as those that enterprises provide to their employees. In part 2, the focus is on back-end data management.

For this part of the interview I spoke with Xavier Lopez, Director Spatial & Semantic Technologies.

IJ: Oracle is known for relational databases. But today I'd like to talk about Oracle support for the Semantic Web graph model. Tell me about support for the graph model on the back end.

Xavier: As the W3C made progress with the Semantic Web specifications (RDF and SPARQL) we had more customers in Life Science, Health Science, Public Safety, and Publishing looking to adopt triple or RDF graph stores. Oracle responded by release an RDF graph feature to the Oracle Spatial and Graph option. Our RDF features was focused on providing customers with a highly scalable, secure and high performance data management solution.

Xavier: Customers like these are looking for a relationship-centric or linked data navigation model to support a variety of data integration, analytics, and discovery solutions. The Semantic Web approach is designed just for this. However, since the volumes of data for these social and entity graph can be quite large and queries complex, it is essential to build these solutions on a high performance and highly scalability software infrastructure.

Xavier: Our objective is to make RDF graph a mainstream capability in the IT infrastructure. Most IT environments use databases, and we want to make sure they can benefit from the RDF data model. The advantage we offer is integration with the rest of our services. We also have optimized adapters for Jena and Sesame, tools used by about 90% of Semantic Web developers. The adapters are optimized to work with Oracle. We also have:

  • In-database OWL inferencing engine.
  • SPARQL query support through the Jena adapter
  • a feature unique to Oracle allowing people to query the graph through SQL by embedding SPARQL queries. So you can get at any data in the Oracle environment through SQL, which provides expressivity and lets you access more heterogeneous data.
  • Plug-in for Cytoscape visualization tool

Xavier: Customers also expect the use the standard manageability utilities and services that Oracle database traditionally offers (partitioning, parallelism compression, high availability, etc). Hence our approach is to ensure customer can leverage the Oracle Database to build a wide range of graph-based applications.

IJ: You mentioned some of the industries looking at these solutions. For these customers, what are examples of problems that the graph model can address?

Xavier: There are a large variety of uses for Semantic Web technology. However, in the enterprise space, three application areas stand out:

  1. Semantic Metadata Layer: A standard, graph oriented unified content metadata for federated resources (database, files, big data, online services). This layer can be enhanced with rules to validate semantic and structural consistency across disparate federated resources.
  2. Text Mining and Entity Analytics: Here, we are primarily dealing with unstructured text. After running content through entity extraction engines, and placing the resulting labeled text in the database as RDF, it is possible to use SPARQL query patterns to find related content & relations by navigating connected entities. It is also possible to apply reasoning rules on the RDF entities to discovery implicit data and relationships that were not previously evident.
  3. Social Media Management and Analysis: The data model underlying most social media sites is a graph that represents relationships and properties of the entities – people, products, events, locations. The underlying graph model structure is ideal for supporting ever-evolving schemas without impacting performance.

Xavier: For each of these broad categories, RDF and OWL are ideal as a canonical data model for integration, navigation, and pattern query across diverse sources of data.

IJ: Tell me more about, for example, the semantic metadata approach.

Xavier: This is an IT issue that never goes away: people want to use data from different sources, (whether relational, graph, or whatever). Applications using that data want it to look like a single database. To achieve that, you represent schemas of the different databases, and link them across common terms (with some mapping, typically).

Xavier: This technique is the highest value we see for this technology. First, it lets you add new data sources by extending your graph representation. When you extend your graph you begin to get network effects, letting you reuse definitions, for example. Second, this approach does not require underlying applications to align with any one schema. People don't have to change their own models. This is important for managing mergers and acquisitions, or data syndication for example. Third, this approach lets you adapt easily as people's schemas change over time.

IJ: Tell me more about performance, which you mentioned a moment ago.

Xavier: We had a graph database prior to RDF, and used it to represent all the roads in Western Europe. But the graph size was limited by memory, and for some applications in pharma or government, customers asked for persistent graph stores. So we developed a second graph model for massive graphs for hundreds of billions of triples. These RDF stores are bigger than most typical db stores. We provide all the traditional benefits of a traditional oracle database to graphs.

Xavier: However using "vanilla" Oracle database tables for RDF would be inefficient -- traditional relational models are not ideal for graphs. That's why graph databases emerged. We let people represent information as a graph but we optimize it in a traditional relational database, taking advantage of partitioning, parallelism, and query loading, for example. Thus, we offer for graphs the same performance enhancing and management features that people expect for relational databases.

Xavier: RDF graphs pose another performance challenge when you do inferencing, which they often do in drug discovery applications or intelligence. Inferencing is a powerful feature - you apply rules to triples and generate more triples, potentially a lot more (in some cases up to 2/3 more data). Things fall apart with in-memory solutions. We also do some pre-inferencing to speed up performance.

IJ: Which customers look for these capabilities?

Xavier: We see more and more adoption by large customers. They have heard that graph solutions work and whatever they are currently doing does not. There's a lot of literature showing this is possible, even if the approach has not yet gone mainstream. People see similar companies solving similar issues with the graph approach.

Xavier: There is, however, a learning curve. Though this has been slow, I have seen a lot more acceptance and availability in the past year in terms of people and tools. Overall, there has been considerable industry development of RDF graph triple stores in the last decade. In many ways, this layer of the technology stack is fairly mature. Challenges still remain in software tooling to help mainstream IT and Web developers to build out solutions without having to learn the intricacies of RDF and SPARQL. A related challenge is the availability of skilled resources familiar with semantic web concepts, building out enterprise solutions using a combination of commercial and open source technologies.

IJ: What would encourage adoption by customers that are not as large?

Xavier: People want to interact with JSON objects, for example. They need tools and high-level APIs they can work with. We're not there yet but we are working with partners on this sort of project since it is clear that that's what the customers want.

IJ: What about connecting with data not in RDF?

Xavier: Oracle experts played an active role in RDB2RDF standard, which exposes relational data to the Semantic Web. We did so recognizing the importance of making existing relational data sources available to SPARQL based applications. In short, it helps achieve the “mainstreaming” of the semantic web.

IJ: And CSV?

Xavier: Tools for converting data to and from RDF are important and available. Fortunately, some of the more widely used semantic frameworks, such as Jena, can perform these file transformation operations. However, there is still need to embed such RDF transformation utilities into mainstream ETL and data integration tools. This will to along with treating RDF as a native type when converting to/from files, databases and Big Data sources.

IJ: Xavier, thank you for taking the time to chat!

Posted at 14:07

May 23

Semantic Web Company (Austria): Free Webinar: Linked Data for the Environmental Sector – Use Cases and Opportunities

Organizations working in the environmental sector most often act as intermediates between politics, economy and citizens. They are growing out of their role as plain content providers. To service the demands of their stakeholders they have to act also as data and tool providers for their respective communities.

On June 13 this webinar introduces several good practice examples achieving data governance in using the linked open data paradigms. Together with a basic overview of the possibilities of linked open data you get an appealing picture of the new opportunities which are provided by these principles and technologies, also for your organisation!

Register Now!

Learn more about three organizations and their linked data projects

Global Buildings Performance Network (GBPN)

GBPN_logo_rgb_72GBPN established the “Policy comparative tool on building stock data” together with a domain specific thesaurus used for a domain specific news aggregator.

Renewable Energy and Efficiency Partnership (REEEP)

logo_reeepAs one of the pioneers in the sector, REEEP has an extensive focus on the use of linked data for renewable energy and energy efficiency, facilitating that in various services, like an automatic annotation service, aggregated country data presented as fact sheets, a domain specific search engine, etc.

Austrian Geological Survey (GBA)

gbaThe main driving factor for institutions like the GBA to invest in thesaurus and taxonomy projects, is the increasing need for a uniform description of their data. The idea is that this enhances value and re-usability of their products for their stakeholders. Especially in the geo-spatial sector the INSPIRE directive of the European Parliament and Council gave a push in that direction. As a public authority, the GBA was legally called to implement the directive for its domain.

Presenters in this Webinar

  • Martin Kaltenböck (SWC)
    CFO and Project Lead at Semantic Web Company for Data Portal Solutions
  • Florian Bauer (REEEP)
    Operations and IT Director of REEEP as well as the  clean energy information portal www.reegle.info
  • Andreas Blumauer (SWC)
    CEO and Evangelist for Linked Data and SKOS based Thesaurus Management

Free Register

 

Posted at 15:03

Ebiquity research group UMBC: Google Top Charts uses the Knowledge Graph for entity recognition and disambiguation

Top Charts is a new feature for Google Trends that identifies the popular searches within a category, i.e., books or actors. What’s interesting about it, from a technology standpoint, is that it uses Google’s Knowledge Graph to provide a universe of things and the categories into which they belong. This is a great example of “Things, not strings”, Google’s clever slogan to explain the importance of the Knowledge Graph.

Here’s how it’s explained in in the Trends Top Charts FAQ.

“Top Charts relies on technology from the Knowledge Graph to identify when search queries seem to be about particular real-world people, places and things. The Knowledge Graph enables our technology to connect searches with real-world entities and their attributes. For example, if you search for ice ice baby, you’re probably searching for information about the musician Vanilla Ice or his music. Whereas if you search for vanilla ice cream recipe, you’re probably looking for information about the tasty dessert. Top Charts builds on work we’ve done so our systems do a better job finding you the information you’re actually looking for, whether tasty desserts or musicians.”

One thing to note is that the Knowledge Graph, which is said to have more than 18 billion facts about 570 million objects, is that its objects include more than the traditional named entities (e.g., people, places, things). For example, there is a top chart for Animals that shows that dogs are the most popular animal in Google searches followed by cats (no surprises here) with chickens at number three on the list (could their high rank be due to recipe searches?). The dog object, in most knowledge representation schemes, would be modeled as a concept or class as opposed to an object or instance. In some representation systems, the same term (e.g., dog) can be used to refer to both a class of instances (a class that includes Lassie) and also to an instance (e.g., an instance of the class animal types). Which sense of the term dog is meant (class vs. instance) is determined by the context. In the semantic web representation language OWL 2, the ability to use the same term to refer to a class or a related instance is called punning.

Of course, when doing this kind of mapping of terms to objects, we only want to consider concepts that commonly have words or short phrases used to denote them. Not all concepts do, such as animals that from a long way off look like flies.

A second observation is that once you have a nice knowledge base like the Knowledge Graph, you have a new problem: how can you recognize mentions of its instances in text. In the DBpedia knowledge based (derived from Wikipedia) there are nine individuals named Michael Jordan and two of them were professional basketball players in the NBA. So, when you enter a search query like “When did Michael Jordan play for Penn”, we have to use information in the query, its context and what we know about the possible referents (e.g., those nine Michael Jordans) to decide (1) if this is likely to be a reference to any of the objects in our knowledge base, and (2) if so, to which one. This task, which is a fundamental one in language processing, is not trivial, but luckily, in applications like Top Charts, we don’t have to do it with perfect accuracy.

Google’s Top Charts is a simple, but effective, example that demonstrates the potential usefulness of semantic technology to make our information systems better in the near future.

Posted at 13:07

May 22

W3C Semantic Web News: W3C’s RDF Validation Workshop – Practical Assurances for Quality RDF Data

W3C announced today a RDF Validation Workshop – Practical Assurances for Quality RDF Data, 10-11 September 2013, in Cambridge, USA. The Semantic Web has demonstrated considerable value for collaborative contributions to data. Adoption in many mission-critical environments requires data to conform to specified patterns. Validation in a banking context shares many requirements with quality assurance of linked clinical data. Systems like Linked Open Data, which don’t have formal interface specifications, share these validation needs. Most data representation languages used in conventional settings offer some sort of input validation, ranging from parsing grammars for domain-specific languages to XML Schema or RelaxNG for XML structures. While the distributed nature of RDF affects the notions of “validity”, tool chains need to be established to ensure data integrity. The goal of this workshop is to discuss use cases for data validation on the Semantic Web with development of technologies to enable those use cases. W3C membership is not required to participate. The event is open to all. All participants are required to submit a position paper by 30 June 2013.

Posted at 15:34

May 21

Frederick Giasson: Neighbourhoods of Winnipeg: A Community Semantic Portal

NOW Portal Introduction from City of Winnipeg

Introduction

I am proud to announce the new NOW (Neighbourhoods Of Winnipeg) semantic web portal! This new and innovative semantic web portal was publicly announced by the Mayor of Winnipeg City last week.

The NOW (Neighbourhoods of Winnipeg) portal is “a new Web portal (the “Portal”) produced by the City of Winnipeg to provide broad, dynamic and interactive access to local and neighbourhood information. Designed for easy access and use by all citizens, businesses, community organizations and Governments, the information on the site includes municipal data, census and demographic information, economic development information, historical data, much spatial and mapping information, and facilities for including and sharing data by external groups and constituencies.”

I would suggest you to read Mike Bergman’s blog post about this new semantic web portal to have the proper background about that initiative by the city of Winnipeg and how it uses the OSF (Open Semantic Framework) as its foundational technology stack.

This project has been the springboard that led to the Open Semantic Framework version 1.1. Multiple pieces of the framework have been developed in relation to this project, and more particularly pieces like the sWebMap semantic component and several improvements to the structWSF web services endpoints and conStruct modules for Drupal 6.

Development of the Portal

The development plan of this portal is composed of four major areas:

  1. Development of the data structure of the municipal domain by creating a series of ontologies
  2. Conversion of existing data asset using this new data structure
  3. Creation of the web portal by creating its design and by developing all the display templates
  4. Creation of new tools to let users interact with the data available on the portal

Structured Dynamics has been involved in #1, #2 and #4 by providing design and development resources, technology transfer sessions and material and supporting internal teams to create, maintain and deploy their 57 publicly available datasets.

The Data Structure

This technology stack does not have any meaning without the proper data and data structures (ontologies) in place. This gold mine of information is what drives the functionality of the portal.

The portal is driven by 12 ontologies: 2 internal and 10 external. The content of the 57 publicly available datasets is defined by the classes and properties defined in one of these ontologies.

The two internal ontologies have been created jointly by Structured Dynamics and the City of Winnipeg, but they are extended and maintained by the city only.

These ontologies are maintained using two different kind of tools:

  1. Protege
  2. structOntology

Protege is used for the big development tasks such as creating a big number of classes and properties, to do a big reorganization of the classes structure, etc.

structOntology is used for quick ontological changes to have an immediate impact on the behaviors of the portals such as label changes, SCO ontology property assignments to change the behavior of some of the tools that exist in the portal, etc.

structOntology can also be used by portal users to understand the underlying data structure used to define the data available on the portal. All users have access to the reading mode of the tool which let them browse, search and export the loaded ontologies on the portal.

The Data

Except for rare exceptions such as the historical photos, no new data has been created by the City of Winnipeg to populate this NOW portal. Most of its content comes from existing internal sources of data such as:

  • Conventional relational databases
  • GIS (Geographic Information System) on-top of relational databases
  • Spreadsheets

All of the conventional relation databases and legacy data from the GIS systems has been converted into RDF using the FME Workbench ETL system. All of the FME workbench templates are mapping the relational data into RDF using the ontologies loaded into the portal. All of the geolocated records that exist in the portal come from this ETL process and have been converted using FME.

Some smaller datasets come from internal spreadsheets that got modified to comply with the commON spreadsheet format that is used to convert spreadsheet (CSV/TSV) data files into RDF.

All of the dataset creation and maintenance is managed internally by the City of Winnipeg using one of these two data conversion and importation processes.

Here are some internal statistics of the content that is currently accessible on the NOW portal.

General Portal

These are statistics related to different functionalities of the portal.

  • Number of neighbourhoods: 236
  • Number of community areas: 14
  • Number of wards: 15
  • Number of neighbourhood clusters: 23
  • Number of major site sections: 7
  • Total number of site pages: 428,019
    • Static pages: 2,245
    • Record-oriented pages: 425,874
    • Dynamic (search-based) pages: infinite
  • Number of documents: 1,017
  • Number of images: 2,683
  • Number of search facets: 1,392
  • Number of display templates: 54
  • Number of links: 1,067
    • External links: 784
    • Internal links: 283
Site Data

These statistics show the things that are available via the portal, what are their types, their properties, what is the quantity of data that is searchable, manipulable and exportable from the portal.

  • Number of datasets: 57
  • Number of records: 425,874
    • Number of geolocational records: 418,869
      • Point of interest (POI) records: 193,272
      • Polygon records: 218,602
      • Path (route) records: 6,995
  • Number of classes (types): 84
  • Number of properties: 1,308
  • Number of triple assertions: 8,683,103

Sharing Content

An important aspect of this portal is that all of the content is contextually available, in different formats, to all of the users of the portal. Whether you are browsing content within datasets, searching for specific pieces of content, or looking at a specific record page, you always have the possibility to get your hands on the content that is being displayed to you, the user, with a choice of five different data formats:

Export Page Content

All content pages can be exported in one of the formats outlined above. In the bottom right corner of these pages you will see a Export button that you can click to get the content of that page in one of these formats.

record_export

Export Search Content

Every time you do a search on the portal, you can export the results of that search in one of the formats outlined above. You can do that by selecting the Export tab, and by selecting one of the formats you want to use for exporting the data.

browse_export

Export Datasets

You can export any publicly available dataset from the portal. These datasets have to be exported in slices if they are too big to fit in a single slice. The datasets can be exported in one of the formats mentioned above.

datasets_export

Export Census

Users also have the possibility to export census data, from the census section of the portal, in spreadsheets. They only have to select the Tables tab, and then to click the Export Spreadsheet button.

export_census

Export Ontologies

The export functionality would not be complete without the ability to consult and export the ontologies that are used to describe the content exposed by the portal. These ontologies can be read from the ontologies reader user interface, or can be exported from the portal to be read by external ontologies management tools such as Protege.

ontologies_export

Portal Design

The portal is using Drupal 6 as its CMS (Content Management System). The Drupal 6 instance communicates with structWSF using the conStruct module, which acts as a bridge between a Druapal portal and a structWSF web service network.

Here are the main design phases that have been required to create the portal:

  1. Creation of the portal’s design, and the Drupal 6 theme that implements it
  2. Creation of the Search and Browse results templates
  3. Creation of the individual records’ page design and templates based on their type
  4. Creation of the sWebMap search results templates.

The portal’s design has been created internally by the City of Winnipeg and by Tactica based on the Citizen DAN demo. Tactica also worked on another Citizen DAN like portal called MyPeg.ca.

Semantic Components

The NOW Web portal is using a series of tools that are called the Semantic Components. These are a set of Flash and JavaScript tools that can be embedded within any web page and that can easily communicate with structWSF instance(s). They display information in all kinds of charts, they can display document reading widgets, they can create dashboards of structured data, etc. The initial set of Semantic Components was developed for the MyPeg.ca project back in November 2010. This was before Steve Jobs announced that Apple would not support Adobe Flash, and far before Google announced that it would drop support for it as well.

Since the NOW portal wanted to re-use as much as possible to lower the development cost related to the portal, they choose to use the complete OSF stack which includes these Semantic Components.

However, when we participated in developing this new NOW portal, we did extended the set of Semantic Components by creating the most complex Semantic Component: the sWebMap. However, because of the two announcements mentioned above, we choose to move forward and to create the sWebMap Semantic Component using JavaScript instead of Flash. The other Semantic Component tools that have been developed in Flash have not yet been ported into JavaScript.

Conclusion

The new NOW semantic web portal’s main asset is its data: how it can be searched (with traditional search engines or using a semantic component to search, browse, filter and localize results), displayed and exported. This portal has been developed using a completely free and open source semantic platform that has been developed from previous projects that open sourced their code.

I consider this portal a pioneer in the way municipal organization will provide new online services to their citizens and to the commercial enterprises based on the quality of the data that will be exposed via such Web portals.

Posted at 19:45

May 20

Seevl: seevl Attends: MusicTechFest, Google I/O and The Music Technology Showcase

We Attend MusicTechFest, Google I/O and The Music Technology Showcase These are busy times for us! In addition to the launch of the mobile version of seevl for Deezer, we’re also involved in three exciting events: MusicTechFest, Google I/O and The Music Technology Showcase! This weekend, Pete traveled to MusicTechFest in London to present seevl to a warm [...]

The post seevl Attends: MusicTechFest, Google I/O and The Music Technology Showcase appeared first on seevl.net.

Posted at 14:05

Copyright of the postings is owned by the original blog authors. Contact us.