Planet RDF

It's triples all the way down

November 24

AKSW Group - University of Leipzig: Highlights of the 1st Meetup on Question Answering Systems – Leipzig, November 21st

On November 21st, AKSW group was hosting the 1st meetup on “Question Answering” (QA) systems. In this meeting, researchers from AKSW/University of Leipzig, CITEC/University of Bielefeld, Fraunhofer IAIS/University of BonDERI/National University of Ireland and the University of Passau presented the recent results of their work on QA systems. The following themes were discussed during the meeting:

  • Ontology-driven QA on the Semantic Web. Christina Unger presented Pythia system for ontology-based QA. Slides are available here.
  • Distributed Semantic Models for achieving scalability & consistency on QA. André Freitas presented TREO and EasyESA which employ vector-based approach for semantic approximation.
  • Template-based QA. Jens Lehmann presented TBSL for Template-based Question Answering over RDF Data.
  • Keyword-based QA. Saeedeh Shekarpour presented SINA approach for semantic interpretation of user queries for QA on interlinked data.
  • Hybrid QA over Linked Data. Ricardo Usbeck presented HAWK for hybrid question answering using Linked Data and full-text indizes.
  • Semantic Parsing with Combinatory Categorial Grammars (CCG). Sherzod Hakimov. Slides are available here.
  • QA on statistical Linked Data. Konrad Höffner presented LinkedSpending and RDF Data Cube vocabulary to apply QA on statistical Linked Data.
  • WDAqua (Web Data and Question Answering) project. Christoph Lange presented the WDAqua project which is part of the EU’s Marie Skłodowska-Curie Action Innovative Training Networks. WDAqua focuses on answering different aspects of the question, “how can we answer complex questions with web data?”
  • OKBQA (Open Knowledge Base & Question-Answering). Axel-C. Ngonga Ngomo presented OKBQA which aims to bring cutting edge experts in knowledge base construction and application in order to create an extensive architecture for QA systems which has no restriction on programming languages.
  • Open QA. Edgard Marx presented open source question answering framework that unifies QA approaches from several domain experts.

The meetup decided to meet biannually to fuse efforts. All agreed upon investigating existing architecture for question answering systems to be able to offer a promising, collaborative architecture for future endeavours. Join us next time! For more information contact Ricardo Usbeck.

Ali and Ricardo on behalf of the QA meetup

Posted at 09:09

November 21

Libby Miller: Product Space and Workshops

I ran a workshop this week for a different bit of the organisation. It’s a bit like holding a party. People expect to enjoy themselves (and this is an important part of the process). But workshops also have to have outcomes and goals and the rest of it. And there’s no booze to help things along.

I always come out of them feeling a bit deflated. Even if others found them enjoyable and useful, the general stress of organising and the responsibility of it all mean that I don’t, plus I have to co-opt colleagues into quite complicated and full-on roles as facilitators, so they can’t really enjoy the process either.

This time we were trying to think more creatively about future work. There are various things niggling me about it, and I want to think about how to improve things next time, while it’s still fresh in my mind.

One of the goals was – in the terms I’ve been thinking of – to explore the space in which products could exist, more thoroughly. Excellent articles by

Posted at 11:12

November 20

AKSW Group - University of Leipzig: Announcing GERBIL: General Entity Annotator Benchmark Framework

Dear all,

We are happy to announce GERBIL – a General Entity Annotation Benchmark Framework, a demo can be found at! With GERBIL, we aim to establish a highly available, easy quotable and liable focal point for Named Entity Recognition and Named Entity Disambiguation (Entity Linking) evaluations:

  • GERBIL provides persistent URLs for experimental settings. By these means, GERBIL also addresses the problem of archiving experimental results.
  • The results of GERBIL are published in a human-readable as well as a machine-readable format. By these means, we also tackle the problem of reproducibility.
  • GERBIL provides 11 different datasets and 9 different entity annotators. Please talk to us if you want to add yours.

To ensure that the GERBIL framework is useful to both end users and tool developers, its architecture and interface were designed with the following principles in mind:

  • Easy integration of annotators: We provide a web-based interface that allows annotators to be evaluated via their NIF-based REST interface. We provide a small NIF library for an easy implementation of the interface.
  • Easy integration of datasets: We also provide means to gather datasets for evaluation directly from data services such as DataHub.
  • Extensibility: GERBIL is provided as an open-source platform that can be extended by members of the community both to new tasks and different purposes.
  • Diagnostics: The interface of the tool was designed to provide developers with means to easily detect aspects in which their tool(s) need(s) to be improved.
  • Portability of results: We generate human- and machine-readable results to ensure maximum usefulness and portability of the results generated by our framework.

We are looking for your feedback!

Best regards,

Ricardo Usbeck for The GERBIL Team

Posted at 09:08

November 17

Jeen Broekstra: Insight Visualised

fragment of a canvas

An InsightNG thought canvas (click to open)

The past few years I have been heavily involved in a new concept for e-learning and knowledge discovery, called InsightNG. Recently, we released the first public beta of our platform. It is free to sign up for while we are in beta, so by all means give it a try, we would love to hear what you think of it. Or if you want to have a quick look at InsightNG to get an idea of what it’s about, visit this public canvas on The Semantic Web I created.

So what is InsightNG? One way to say what we’re doing is that we are visualising insight. An InsightNG ‘Thought Canvas’ is an interactive map of its creator’s learning process for a particular complex problem or topic. Where a simple search presents you with results without context, we constantly relate everything to the wider scope of your thinking, associating things and building relations between websites, articles, pictures and videos (any sort of web resource), but also more abstract/’real world’ things such as concepts, tasks, events, people, etc. Insight is gained, not by getting individual search results, but by seeing all the little puzzle pieces in this broader context. When it ‘clicks’, when you go “Aha, I get it!”, you have gained insight.

The philosophy behind building these canvases is that knowledge is, at a fundamental level,  a social thing, and a never-ending journey. We do not expect you to just sit there and dream a canvas up from scratch – instead we immediately engage in a dialog: while you add new elements and relations to your canvas, our platform constantly analyses what is being added and how it relates to each other, and it uses this to discover suggestions from various external sources. We scour the Web in general, Wikipedia, dedicated sources like the Springer Open Access journal library, and more. Our suggestion discovery engine uses a combination of KR/Semantic Web technologies and text mining/NLP strategies to find results that are not just keyword matches but contextually relevant suggestions: we weigh and measure everything, trying to determine how well it fits in with the broader scope of your canvas, and then rate it and feed it back to you. In short: we’re trying to be smart about what we recommend.

Thought Canvases are a great way to explore any topic where you wish to learn more, increase your understanding or awareness, or even just record your thoughts – a great mental excercise for achieving clarity. You can share canvases with friends or coworkers to show your findings and get their feedback. The fact that our discovery engine continues to look for additional related content for you means that revisiting an older canvas can often be very rewarding, as we will have found new interesting stuff for you to look at.

InsightNG is a very broadly applicable visual learning/discovery tool, and we hope you’ll give it a try and tell us what you think of our first beta. There’s a brief interactive tutorial available in the tool itself that automatically starts when you first sign up, and of course we have some more in-depth information as well, such as a Best Practice guide to creating a Canvas, and a more theoretical explanation of our ICE (Inquire, Connect, Enlighten) Methodology .

Posted at 23:29

AKSW Group - University of Leipzig: @BioASQ challenge gaining momentum

BioASQ is a series of challenges aiming to bring us closer to the vision of machines that can answer questions of biomedical professionals and researchers. The second BioASQ challenge started in February 2013. It comprised two different tasks: Large-scale biomedical semantic indexing (Task 2a), and biomedical semantic question answering (Task 2b).

In total 216 users and 142 systems registered to the automated evaluation system of BioASQ in order to participate in the challenge; 28 teams (with 95 systems) finally submitted their suggested solutions and answers. The final results were presented at the BioASQ workshop in the Cross Language Evaluation Forum (CLEF), which took place between September 23 and 26 in Sheffield, U.K.

The Awards Went To The Following Teams

Task 2a (Large-scale biomedical semantic indexing):

  • Fudan University (China)
  • NCBI (USA)
  • Aristotle University of Thessaloniki (Greece) and (USA)

Task 2b (Biomedical semantic question answering):

  • Fudan University (China)
  • NCBI (USA)
  • University of Alberta (Canada)
  • Seoul National University (South Korea)
  • Toyota Technological Institute (Japan)
  • Aristotle University of Thessaloniki (Greece) and (USA)

Best Overall Contribution:

  • NCBI (USA)
The second BioASQ competition, challenge continued the impressive achievements of the first one, pushing the research frontiers in biomedical indexing and question answering. The systems that participated in both tasks of the challenge achieved a notable increase in accuracy over the first year. Among the highlights is the fact that the best systems in task 2a outperformed again the very strong baseline MTI system provided by NLM. This is despite the fact that the MTI system itself has been improved by incorporating ideas proposed by last year’s winning systems. The end of the second challenge marks also the end of the financial support for BioASQ, by the European Commission. We would like to take this opportunity to thank the EC for supporting our vision. The main project results (incl. frameworks, datasets and publications) can be found at the project showcase page at
Nevertheless, the BioASQ challenge will continue with its third round BioASQ3, which will start in February 2015. Stay tuned!

About BioASQ

The BioASQ team combines researchers with complementary expertise from 6 organisations in 3 countries: the Greek National Center for Scientific Research “Demokritos” (coordinator), participating with its Institutes of ‘Informatics & Telecommunications’ and ‘Biosciences & Applications’, the German IT company Transinsight GmbH, the French University Joseph Fourier, the German research Group for Agile Knowledge Engineering and Semantic Web at the University of Leipzig, the French University Pierre et Marie Curie‐Paris 6 and the Department of Informatics of the Athens University of Economics and Business in Greece (visit the BioASQ project partners page). Moreover, biomedical experts from several countries assist in the creation of the evaluation data and a number of key players in the industry and academia from around the world participate in the advisory board of the project.
BioASQ started in October 2012 and was funded for two years by the European Commission as a support action (FP7/2007-2013: Intelligent Information Management, Targeted Competition Framework; grant agreement n° 318652). More information can be found at:
Project Coordinator: George Paliouras (

Posted at 14:29

November 16

Libby Miller: Bitmap to SVG

Creating an SVG from a bitmap is pretty easy in Inkscape. Copy and paste your bitmap into a new Inkscape file, select it, do

path -> trace bitmap

Posted at 15:13

November 13

Orri Erling: LDBC: Making Semantic Publishing Execution Rules

LDBC SPB (Semantic Publishing Benchmark) is based on the BBC Linked Data use case. Thus the data modeling and transaction mix reflect the BBC's actual utilization of RDF. But a benchmark is not only a condensation of current best practice. The BBC Linked Data is deployed on Ontotext GraphDB (formerly known as OWLIM).

So, in SPB we wanted to address substantially more complex queries than the lookups than the BBC linked data deployment primarily serves. Diverse dataset summaries, timelines, and faceted search qualified by keywords and/or geography, are examples of online user experience that SPB needs to cover.

SPB is not an analytical workload, per se, but we still find that the queries fall broadly in two categories:

  • Some queries are centered on a particular search or entity. The data touched by the query size does not grow at the same rate as the dataset.
  • Some queries cover whole cross sections of the dataset, e.g., find the most popular tags across the whole database.
These different classes of questions need to be separated in a metric, otherwise the short lookup dominates at small scales, and the large query at large scales.

Another guiding factor of SPB was the BBC's and others' express wish to cover operational aspects such as online backups, replication, and fail-over in a benchmark. True, most online installations have to deal with these, yet these things are as good as absent from present benchmark practice. We will look at these aspects in a different article; for now, I will just discuss the matter of workload mix and metric.

Normally, the lookup and analytics workloads are divided into different benchmarks. Here, we will try something different. There are three things the benchmark does:

  • Updates - These sometimes insert a graph, sometimes delete and re-insert the same graph, sometimes just delete a graph. These are logarithmic to data size.

  • Short queries - These are lookups that most often touch on recent data and can drive page impressions. These are roughly logarithmic to data scale.

  • Analytics - These cover a large fraction of the dataset and are roughly linear to data size.

A test sponsor can decide on the query mix within certain bounds. A qualifying run must sustain a minimum, scale-dependent update throughput and must execute a scale-dependent number of analytical query mixes, or run for a scale-dependent duration. The minimum update rate, the minimum number of analytics mixes and the minimum duration all grow logarithmically to data size.

Within these limits, the test sponsor can decide how to mix the workloads. Publishing several results emphasizing different aspects is also possible. A given system may be especially good at one aspect, leading the test sponsor to accentuate this.

The benchmark has been developed and tested at small scales, between 50 and 150M triples. Next we need to see how it actually scales. There we expect to see how the two query sets behave differently. One effect that we see right away when loading data is that creating the full text index on the literals is in fact the longest running part. For a SF 32 ( 1.6 billion triples) SPB database we have the following space consumption figures:

  • 46,886 MB of RDF literal text
  • 23,924 MB of full text index for RDF literals
  • 23,598 MB of URI strings
  • 21,981 MB of quads, stored column-wise with default index scheme

Clearly, applying column-wise compression to the strings is the best move for increasing scalability. The literals are individually short, so literal per literal compression will do little or nothing but applying this by the column is known to get a 2x size reduction with Google Snappy.

The full text index does not get much from column store techniques, as it already consists of words followed by space efficient lists of word positions. The above numbers are measured with Virtuoso column store, with quads column-wise and the rest row-wise. Each number includes the table(s) and any extra indices associated to them.

Let's now look at a full run at unit scale, i.e., 50M triples.

The run rules stipulate a minimum of 7 updates per second. The updates are comparatively fast, so we set the update rate to 70 updates per second. This is seen not to take too much CPU. We run 2 threads of updates, 20 of short queries, and 2 of long queries. The minimum run time for the unit scale is 10 minutes, so we do 10 analytical mixes, as this is expected to take a little over 10 minutes. The run stops by itself when the last of the analytical mixes finishes.

The interactive driver reports:

Seconds run : 2,144
        2 agents

        68,164 inserts (avg :   46  ms, min :    5  ms, max :   3002  ms)
         8,440 updates (avg :   72  ms, min :   15  ms, max :   2471  ms)
         8,539 deletes (avg :   37  ms, min :    4  ms, max :   2531  ms)

        85,143 operations (68,164 CW Inserts   (98 errors), 
                            8,440 CW Updates   ( 0 errors), 
                            8,539 CW Deletions ( 0 errors))
        39.7122 average operations per second

        20 agents

        4120  Q1   queries (avg :    789  ms, min :   197  ms, max :   6,767   ms, 0 errors)
        4121  Q2   queries (avg :     85  ms, min :    26  ms, max :   3,058   ms, 0 errors)
        4124  Q3   queries (avg :     67  ms, min :     5  ms, max :   3,031   ms, 0 errors)
        4118  Q5   queries (avg :    354  ms, min :     3  ms, max :   8,172   ms, 0 errors)
        4117  Q8   queries (avg :    975  ms, min :    25  ms, max :   7,368   ms, 0 errors)
        4119  Q11  queries (avg :    221  ms, min :    75  ms, max :   3,129   ms, 0 errors)
        4122  Q12  queries (avg :    131  ms, min :    45  ms, max :   1,130   ms, 0 errors)
        4115  Q17  queries (avg :  5,321  ms, min :    35  ms, max :  13,144   ms, 0 errors)
        4119  Q18  queries (avg :    987  ms, min :   138  ms, max :   6,738   ms, 0 errors)
        4121  Q24  queries (avg :    917  ms, min :    33  ms, max :   3,653   ms, 0 errors)
        4122  Q25  queries (avg :    451  ms, min :    70  ms, max :   3,695   ms, 0 errors)

        22.5239 average queries per second. 
        Pool 0, queries [ Q1 Q2 Q3 Q5 Q8 Q11 Q12 Q17 Q18 Q24 Q25 ]

        45,318 total retrieval queries (0 timed-out)
        22.5239 average queries per second

The analytical driver reports:

        2 agents

        14    Q4   queries (avg :   9,984  ms, min :   4,832  ms, max :   17,957  ms, 0 errors)
        12    Q6   queries (avg :   4,173  ms, min :      46  ms, max :    7,843  ms, 0 errors)
        13    Q7   queries (avg :   1,855  ms, min :   1,295  ms, max :    2,415  ms, 0 errors)
        13    Q9   queries (avg :     561  ms, min :     446  ms, max :      662  ms, 0 errors)
        14    Q10  queries (avg :   2,641  ms, min :   1,652  ms, max :    4,238  ms, 0 errors)
        12    Q13  queries (avg :     595  ms, min :     373  ms, max :    1,167  ms, 0 errors)
        12    Q14  queries (avg :  65,362  ms, min :   6,127  ms, max :  136,346  ms, 2 errors)
        13    Q15  queries (avg :  45,737  ms, min :  12,698  ms, max :   59,935  ms, 0 errors)
        13    Q16  queries (avg :  30,939  ms, min :  10,224  ms, max :   38,161  ms, 0 errors)
        13    Q19  queries (avg :     310  ms, min :      26  ms, max :    1,733  ms, 0 errors)
        12    Q20  queries (avg :  13,821  ms, min :  11,092  ms, max :   15,435  ms, 0 errors)
        13    Q21  queries (avg :  36,611  ms, min :  14,164  ms, max :   70,954  ms, 0 errors)
        13    Q22  queries (avg :  42,048  ms, min :   7,106  ms, max :   74,296  ms, 0 errors)
        13    Q23  queries (avg :  48,474  ms, min :  18,574  ms, max :   93,656  ms, 0 errors)
        0.0862 average queries per second. 
        Pool 0, queries [ Q4 Q6 Q7 Q9 Q10 Q13 Q14 Q15 Q16 Q19 Q20 Q21 Q22 Q23 ]

        180 total retrieval queries (2 timed-out)
        0.0862 average queries per second

The metric would be 22.52 qi/s , 310 qa/h, 39.7 u/s @ 50Mt (SF 1)

The SUT is dual Xeon E5-2630, all in memory. The platform utilization is steadily above 2000% CPU (over 20/24 hardware threads busy on the DBMS). The DBMS is Virtuoso Open Source (v7fasttrack at, feature/analytics branch).

The minimum update rate of 7/s was sustained, but fell short of the target of 70/s. In this run, most demand was put on the interactive queries. Different thread allocations would give different ratios of the metric components. The analytics mix, for example, is about 3x faster without other concurrent activity.

Is this good or bad? I would say that this is possible but better can certainly be accomplished.

The initial observation is that Q17 is the worst of the interactive lot. 3x better is easily accomplished by avoiding a basic stupidity. The query does the evil deed of checking for a substring in a URI. This is done in the wrong place and accounts for most of the time. The query is meant to test geo retrieval but ends up doing something quite different. Optimizing this right would by itself almost double the interactive score. There are some timeouts in the analytical run, which as such disqualifies the run. This is not a fully compliant result, but is close enough to give an idea of the dynamics. So we see that the experiment is definitely feasible, is reasonably defined, and that the dynamics seen make sense.

As an initial comment of the workload mix, I'd say that interactive should have a few more very short point-lookups, to stress compilation times and give a higher absolute score of queries per second.

Adjustments to the mix will depend on what we find out about scaling. As with SNB, it is likely that the workload will shift a little so this result might not be comparable with future ones.

In the next SPB article, we will look closer at performance dynamics and choke points and will have an initial impression on scaling the workload.

Posted at 21:19

Orri Erling: LDBC: Creating a Metric for SNB

In the Making It Interactive post on the LDBC blog, we were talking about composing an interactive Social Network Benchmark (SNB) metric. Now we will look at what this looks like in practice.

A benchmark is known by its primary metric. An actual benchmark implementation may deal with endless complexity but the whole point of the exercise is to reduce this all to an extremely compact form, optimally a number or two.

For SNB, we suggest clicks per second Interactive at scale (cpsI@ so many GB) as the primary metric. To each scale of the dataset corresponds a rate of update in the dataset's timeline (simulation time). When running the benchmark, the events in simulation time are transposed to a timeline in real time.

Another way of expressing the metric is therefore acceleration factor at scale. In this example, we run a 300 GB database at an acceleration of 1.64; i.e., in the present example, we did 97 minutes of simulation time in 58 minutes of real time.

Another key component of a benchmark is the full disclosure report (FDR). This is expected to enable any interested party to reproduce the experiment.

The system under test (SUT) is Virtuoso running an SQL implementation of the workload at 300 GB (SF = 300). This run gives an idea of what an official report will look like but is not one yet. The implementation differs from the present specification in the following:

  • The SNB test driver is not used. Instead, the workload is read from the file system by stored procedures on the SUT. This is done to circumvent latencies in update scheduling in the test driver which would result in the SUT not reaching full platform utilization.

  • The workload is extended by 2 short lookups, i.e., person profile view and post detail view. These are very short and serve to give the test more of an online flavor.

  • The short queries appear in the report as multiple entries. This should not be the case. This inflates the clicks per second number but does not significantly affect the acceleration factor.

As a caveat, this metric will not be comparable with future ones.

Aside from the composition of the report, the interesting point is that with the present workload, a 300 GB database keeps up with the simulation timeline on a commodity server, also when running updates. The query frequencies and run times are in the full report. We also produced a graphic showing the evolution of the throughput over a run of one hour --

(click to embiggen)

We see steady throughput except for some slower minutes which correspond to database checkpoints. (A checkpoint, sometimes called a log checkpoint, is the operation which makes a database state durable outside of the transaction log.) If we run updates only at full platform, we get an acceleration of about 300x in memory for 20 minutes, then 10 minutes of nothing happening while the database is being checkpointed. This is measured with 6 2TB magnetic disks. Such a behavior is incompatible with an interactive workload. But with a checkpoint every 10 minutes and updates mixed with queries, checkpointing the database does not lead to impossible latencies. Thus, we do not get the TPC-C syndrome which requires tens of disks or several SSDs per core to run.

This is a good thing for the benchmark, as we do not want to require unusual I/O systems for competition. Such a requirement would simply encourage people to ignore the specification for the point and would limit the number of qualifying results.

The full report contains the details. This is also a template for later "real" FDRs. The supporting files are divided into test implementation and system configuration. With these materials plus the data generator, one should be able to repeat the results using a Virtuoso Open Source cut from v7fasttrack at, feature/analytics branch.

In later posts we will analyze the results a bit more and see how much improvement potential we find. The next SNB article will be about the business intelligence and graph analytics areas of SNB.

Posted at 21:09

November 11

Tetherless World Constellation group RPI: Data Management – Serendipity in Academic Career

A few days ago I began to think about the topic for a blog and the first reflection in my mind was ‘data management’ and then a Chinese poem sentence ‘无心插柳柳成荫’ followed. I went to Google for an English translation of that sentence and the result was ‘Serendipitiously’. Interesting, I never saw that word before and I had to use a dictionary to find that ‘serendipity’ means unintentional positive outcomes, which expresses the meaning of that Chinese sentence quite well. So, I regard data management as serendipity in my academic career. I think that’s because I was trained as a geoinformatics researcher through my education in China and the Netherlands, how it comes that most of my current time is being spent on data management?

One clue I could see is that I have been working on ontologies, vocabularies and conceptual models for geoscience data services, which is relevant to data management. Another more relevant clue is a symposium ‘Data Management in Research: A Challenging Issue’ organized at University of Twente campus in 2011 spring. Dr. David Rossiter, Ms. Marga Koelen, I and a few other ITC colleagues attend the event. That symposium highlighted both technical and social/cultural issues faced by the 3TU.Datacentrum (, a data repository for the three technological universities in the Netherlands. It is very interesting to see that several topics of my current work had already discussed in that symposium, whereas I paid almost no attention because I was completely focused on my vocabulary work at that time. Since now I am working on data management, I would like to introduce a few concepts relevant to it and the current social and technical trends.

Data management, in simple words, means what you will do with your datasets during and after a research. Conventionally, we treat paper as the ‘first class’ product of research and many scientists pay less attention to data management. This may lower the efficiency of research activities and hinder communications among research groups in different institutions. There is even a rumor that 80% of a scientist’s time is spent on data discovery, retrieval and assimilation, and only 20% of time is for data analysis and scientific discovery. An ideal situation is that reverse the allocation of time, but that requires efforts on both a technical infrastructure for data publication and a set of appropriate incentives to the data authors.

After coming to United States the first data repository caused my attention was the California Digital Library (CDL) (, which is similar to the services offered by 3TU.Datacentrum. I like the technical architecture CDL work not only because they provide a place for depositing datasets but also, and more importantly, they provide a series of tools and services ( to allow users to draft data manage plans to address funding agency requirements, to mint unique and persistent identifiers to published datasets, and to improve the visibility of the published datasets. The word data publication is derived from paper publication. By documenting metadata, minting unique identifiers (e.g., Digital Object Identifiers (DOIs)), and archiving copies of datasets into a repository, we can make a piece of published dataset similar to a piece of published paper. The identifier and metadata make the dataset citable, just like what we do with published papers. A global initiative, the DataCite, had been working on standards of metadata schema and identifier for datasets, and is increasing endorsed by data repositories across the word, including both CDL and 3TU.Datacentrum. A technological infrastructure for data publication is emerging, and now people begin to talk about the cultural change to treat data as ‘first class’ product of research.

Though funding agencies already require data management plans in funding proposals, such as the requirements of National Science Foundation in US and the Horizon 2020 in EU (A Google search with key word ‘data management’ and the name of the funding agency will help find the agency’s guidelines), The science community still has a long way to go to give data publication the same attention as what they do with paper publication. Various community efforts have been take to promote data publication and citation. The FORCE11 published the Joint Declaration of Data Citation Principles ( in 2013 to promote good research practice of citing datasets. Earlier than that, in 2012, the Federation of Earth Science Information Partners published Data Citation Guidelines for Data Providers and Archives (, which offers more practical details on how a piece of published dataset should be cited. In 2013, the Research Data Alliance ( was launched to build the social and technical bridges that enable open sharing of data, which enhances existing efforts, such as CODATA (, to promote data management and sharing.


To promote data citation, a number of publishers have launched so called data journals in recent years, such as Scientific Data ( of Nature Publishing Group, Geoscience Data Journal ( of Wiley, and Data in Brief ( of Elsevier. Such a data journal often has a number of affiliated and certified data repositories. A data paper allows the authors to describe a piece of dataset published in a repository. A data paper itself is a journal paper, so it is citable, and the dataset is also citable because there are associated metadata and identifier in the data repository. This makes data citation flexible (and perhaps confusing): you can cite a dataset by either citing the identifier of the associated data paper, or the identifier of the dataset itself, or both. More interestingly, a paper can cite a dataset, a dataset can cite a dataset, and a dataset can also cite paper (e.g., because the dataset may be derived from tables in a paper). The Data Citation Index ( launched by Thomson Reuters provides services to index the world’s leading data repositories, connect datasets to related literature indexed in the Web of Science database and to search and access data across subjects and regions.

Although there is such huge progress on data publication and citation, we are not yet there to fully treat data as ‘first class’ products of research. A recent good news is that, in 2013, the National Science Foundation renamed Publications section in biographical sketch of funding applicants as Products and allowed datasets and software to be listed there ( However, this is still just a small step. We hope more similar incentives appear in academia. For instance, even we have the Data Citation Index, are we ready to mix the data citation and paper citation to generate the H-index of a scientist? And even there is such an H-index, are we ready to use it in research assessment?

Data management involves so many social and technological issues, which make it quite different from pure technical questions in geoinformatics research. This is an enjoyable work and in the next step I may spend more time on data analysis, for which I may introduce a few ideas in another blog.

Posted at 23:32

November 09

Bob DuCharme: Querying aggregated Walmart and BestBuy data with SPARQL

From structured data in their web pages!

Posted at 14:35

November 07

W3C Data Activity: Encouraging Commercial Use of Open Data

I went to Paris this week to give a talk at SemWeb.Pro, an event that, like SemTechBiz in Silicon Valley or SEMANTiCS in Germany/Austria, is firmly established in the annual calendar. These are events where businesses and other ‘real world’ … Continue reading

Posted at 12:21

November 05

Libby Miller: Working out which wire does what for USB

Occasionally I want to solder the power wires for something 5V to USB (A) so I can power it from that (most recently a lovely little

Posted at 10:44

November 03

Frederick Giasson: clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle

Some of my recent work leaded me to heavily use Clojure to develop all kind of new capabilities for Structured Dynamics. The ones that knows us, knows that every we do is related to RDF and OWL ontologies. All this work with Clojure is no exception.

Recently, while developing a Domain Specific Language (DSL) for using the Open Semantic Framework (OSF) web service endpoints, I did some research to try to find some kind of simple Clojure DSL that I could use to generate RDF data (in any well-known serialization). After some time, I figured out that no such a thing was currently existing in the Clojure ecosystem, so I choose to create my simple DSL for creating RDF data.

The primary goal of this new project was to have a DSL that users could use to created RDF data that could be feed to the OSF web services endpoints such as the CRUD: Create or CRUD: Update endpoints.

What I choose to do is to create a new project called clj-turtle that generates RDF/Turtle code from Clojure code. The Turtle code that is produced by this DSL is currently quite verbose. This means that all the URIs are extended, that the triple quotes are used and that the triples are fully described.

This new DSL is mean to be a really simple and easy way to create RDF data. It could even be used by non-Clojure coder to create RDF/Turtle compatible data using the DSL. New services could easily be created that takes the DSL code as input and output the RDF/Turtle code. That way, no Clojure environment would be required to use the DSL for generating RDF data.


For people used to Clojure and Leinengen, you can easily install clj-turtle using Linengen. The only thing you have to do is to add Add [clj-turtle "0.1.3"] as a dependency to your project.clj.

Then make sure that you downloaded this dependency by running the lein deps command.


The whole DSL is composed of simply six operators:

  • rdf/turtle
    • Used to generate RDF/Turtle serialized data from a set of triples defined by clj-turtle.
  • defns
    • Used to create/instantiate a new namespace that can be used to create the clj-turtle triples
  • rei
    • Used to reify a clj-turtle triple
  • iri
    • Used to refer to a URI where you provide the full URI as an input string
  • literal
    • Used to refer to a literal value
  • a
    • Used to specify the rdf:type of an entity being described


Working with namespaces

The core of this DSL is the defns operator. What this operator does is to give you the possibility to create the namespaces you want to use to describe your data. Every time you use a namespace, it will generate a URI reference in the triple(s) that will be serialized in Turtle.

However, it is not necessary to create a new namespace every time you want to serialize Turtle data. In some cases, you may not even know what the namespace is since you have the full URI in hands already. It is why there is the iri function that let you serialize a full URI without having to use a namespace.

Namespaces are just shorthand versions of full URIs that mean to make your code cleaner an easier to read and maintain.

Syntactic rules

Here are the general syntactic rules that you have to follow when creating triples in a (rdf) or (turtle) statement:

  1. Wrap all the code using the (rdf) or the (turtle) operator
  2. Every triple need to be explicit. This means that every time you want to create a new triple, you have to mention the subject, predicate and the object of the triple
  3. A fourth “reification” element can be added using the rei operator
  4. The first parameter of any function can be any kind of value: keywords, strings, integer, double, etc. They will be properly serialized as strings in Turtle.

Strings and keywords

As specified in the syntactic rules, at any time, you can use a string, a integer, a double a keyword or any other kind of value as input of the defined namespaces or the other API calls. You only have to use the way that is more convenient for you or that is the cleanest for your taste.

More about reification

Note: RDF reification is quite a different concept than Clojure’s reify macro. So carefully read this section to understand the meaning of the concept in this context.

In RDF, reifying a triple means that we want to add additional information about a specific triple. Let’s take this example:

  (foo :bar) (iron :prefLabel) (literal "Bar"))

In this example, we have a triple that specify the preferred label of the :bar entity. Now, let’s say that we want to add “meta” information about that specific triple, like the date when this triple got added to the system for example.

That additional information is considered the “fourth” element of a triple. It is defined like this:

    (foo :bar) (iron :prefLabel) (literal "Bar") (rei
                                                   (foo :date) (literal "2014-10-25" :type :xsd:dateTime)))

What we do here is to specify additional information regarding the triple itself. In this case, it is the date when the triple got added into our system.

So, reification statements are really “meta information” about triples. Also not that reification statements doesn’t change the semantic of the description of the entities.


Here is a list of examples of how this DSL can be used to generate RDF/Turtle data:

Create a new namespace

The first thing we have to do is define the namespaces we will want to use in our code.

(defns iron "")
(defns foo "")
(defns owl "")
(defns rdf "")
(defns xsd "")

Create a simple triple

The simplest example is to create a simple triple. What this triple does is to define the preferred label of a :bar entity:

  (foo :bar) (iron :prefLabel) (literal "Bar"))


<> <> """Bar""" .

Create a series of triples

This example shows how we can describe more than one attribute for our bar entity:

  (foo :bar) (a) (owl :Thing)
  (foo :bar) (iron :prefLabel) (literal "Bar")
  (foo :bar) (iron :altLabel) (literal "Foo"))


<> <> <> .
<> <> """Bar""" .
<> <> """Foo""" .

Note: we prefer having one triple per line. However, it is possible to have all the triples in one line, but this will produce less readable code:

Just use keywords

It is possible to use keywords everywhere, even in (literals)

  (foo :bar) (a) (owl :Thing)
  (foo :bar) (iron :prefLabel) (literal :Bar)
  (foo :bar) (iron :altLabel) (literal :Foo))


<> <> <> .
<> <> """Bar""" .
<> <> """Foo""" .

Just use strings

It is possible to use strings everywhere, even in namespaces:

  (foo "bar") (a) (owl "Thing")
  (foo "bar") (iron :prefLabel) (literal "Bar")
  (foo "bar") (iron :altLabel) (literal "Foo"))


<> <> <> .
<> <> """Bar""" .
<> <> """Foo""" .

Specifying a datatype in a literal

It is possible to specify a datatype for every (literal) you are defining. You only have to use the :type option and to specify a XSD datatype as value:

  (foo "bar") (foo :probability) (literal 0.03 :type :xsd:double))

Equivalent codes are:

  (foo "bar") (foo :probability) (literal 0.03 :type (xsd :double)))
  (foo "bar") (foo :probability) (literal 0.03 :type (iri "")))


<> <> """0.03"""^^xsd:double .

Specifying a language for a literal

It is possible to specify a language string using the :lang option. The language tag should be a compatible ISO 639-1 language tag.

  (foo "bar") (iron :prefLabel) (literal "Robert" :lang :fr))


<> <> """Robert"""@fr .

Defining a type using the an operator

It is possible to use the (a) predicate as a shortcut to define the rdf:type of an entity:

  (foo "bar") (a) (owl "Thing"))


<> <> <> .

This is a shortcut for:

  (foo "bar") (rdf :type) (owl "Thing"))

Specifying a full URI using the iri operator

It is possible to define a subject, a predicate or an object using the (iri) operator if you want to defined them using the full URI of the entity:

  (iri "") (iri " (iri


<> <> <> .

Simple reification

It is possible to reify any triple suing the (rei) operator as the fourth element of a triple:

  (foo :bar) (iron :prefLabel) (literal "Bar") (rei
                                                 (foo :date) (literal "2014-10-25" :type :xsd:dateTime)))


<> <> """Bar""" .
<rei:6930a1f93513367e174886cb7f7f74b7> <> <> .
<rei:6930a1f93513367e174886cb7f7f74b7> <> <> .
<rei:6930a1f93513367e174886cb7f7f74b7> <> <> .
<rei:6930a1f93513367e174886cb7f7f74b7> <> """Bar""" .
<rei:6930a1f93513367e174886cb7f7f74b7> <> """2014-10-25"""^^xsd:dateTime .

Reify multiple properties

It is possible to add multiple reification statements:

(foo :bar) (iron :prefLabel) (literal "Bar") (rei
                                               (foo :date) (literal "2014-10-25" :type :xsd:dateTime)
                                               (foo :bar) (literal 0.37)))


<> <> """Bar""" .
<rei:6930a1f93513367e174886cb7f7f74b7> <> <> .
<rei:6930a1f93513367e174886cb7f7f74b7> <> <> .
<rei:6930a1f93513367e174886cb7f7f74b7> <> <> .
<rei:6930a1f93513367e174886cb7f7f74b7> <> """Bar""" .
<rei:6930a1f93513367e174886cb7f7f74b7> <> """2014-10-25"""^^xsd:dateTime .
<rei:6930a1f93513367e174886cb7f7f74b7> <> """0.37""" .

Using clj-turtle with clj-osf

clj-turtle is meant to be used in Clojure code to simplify the creation of RDF data. Here is an example of how clj-turtle can be used to generate RDF data to feed to the OSF Crud: Create web service endpoint via the clj-osf DSL:

[require '[clj-osf.crud :as crud])

  (crud/dataset "")
      (iri link) (a) (bibo :Article)
      (iri link) (iron :prefLabel) (literal "Some article")))

Using the turtle alias operator

Depending on your taste, it is possible to use the (turtle) operator instead of the (rdf) one to generate the RDF/Turtle code:

  (foo "bar") (iron :prefLabel) (literal "Robert" :lang :fr))


<> <> """Robert"""@fr .

Merging clj-turtle

Depending the work you have to do in your Clojure application, you may have to generate the Turtle data using a more complex flow of operations. However, this is not an issue for clj-turtle since the only thing you have to do is to concatenate the triples you are creating. You can do so using a simple call to the str function, or you can create more complex processing using loopings, mappings, etc that end up with a (apply str) to generate the final Turtle string.

    (foo "bar") (a) (owl "Thing"))
    (foo "bar") (iron :prefLabel) (literal "Bar")
    (foo "bar") (iron :altLabel) (literal "Foo")))


<> <> <> .
<> <> """Bar""" .
<> <> """Foo""" .


As you can see now, this is a really simple DSL for generating RDF/Turtle code. Even if simple, I find it quite effective by its simplicity. However, even if it quite simple and has a minimum number of operators, this is flexible enough to be able to describe any kind of RDF data. Also, thanks to Clojure, it is also possible to write all kind of code that would generate DSL code that would be executed to generate the RDF data. For example, we can easily create some code that iterates a collection to produce one triple per item of the collection like this:

(->> {:label "a"
      :label "b"
      :label "c"}
      (fn [label]
         (foo :bar) (iron :prefLabel) (literal label))))
     (apply str))

That code would generate 3 triples (or more if the input collection is bigger). Starting with this simple example, we can see how much more complex processes can leverage clj-turtle for generating RDF data.

A future enhancement to this DSL would be to add a syntactic rule that gives the opportunity to the user to only have to specify the suject of a triple the first time it is introduced to mimic the semi-colon of the Turtle syntax.

Posted at 17:58

October 29

Gregory Williams: ISWC 2014

Last week I attended ISWC 2014 in Riva del Garda and had a very nice time.

Riva del Garda

I presented a paper at the Developers Workshop about designing the new PerlRDF and SPARQLKit systems using trait-based programming. Covering more PerlRDF work, Kjetil presented his work on RDF::LinkedData.

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime by Kontchakov, et al. caught my attention during the main conference. The authors show a technique to map query answering under OWL QL entailment to pure SPARQL 1.1 queries under simple entailment, allowing OWL QL reasoning to be used with SPARQL 1.1 systems that do not otherwise support OWL. An important challenge highlighted by the work (and echoed by the audience after the presentation) was the limited performance of existing property path implementations. It seems as if there might be a subset of SPARQL 1.1 property paths (e.g. with limited nesting of path operators) that could be implemented more efficiently while providing enough expressivity to allow things like OWL reasoning.

There seemed to be a lot of interest in more flexible solutions to SPARQL query answering. Ruben’s Linked Data Fragments work addresses one part of this issue (though I remain somewhat skeptical about it as a general-purpose solution). That work was received very well at the conference, winning the Best Demo award. In conversations I had, there was also a lot of interest in the development of SPARQL systems that could handle capacity and scalability issues more gracefully (being smart enough to use appropriate HTTP error codes when a query cannot be answered due to system load or the complexity of the query, allowing a fallback to linked data fragments, for example).

Posted at 02:44

October 28

AKSW Group - University of Leipzig: AKSW successful at #ISWC2014

Dear followers, 9 members of AKSW have been participating at the 13th International Semantic Web Conference (ISWC) at Riva del Garda, Italy. Next to listening to interesting talks, giving presentations or discussing with fellow Semantic Web researchers, AKSW won 4 significant prizes:

We do work on way more projects, which you can find at Cheers, Ricardo on behalf of the AKSW group
Best Paper Award

Posted at 16:03

October 24

Dublin Core Metadata Initiative: DCMI's first Governing Board officer transition

2014-10-24, In the closing ceremony of DC-2014, DCMI exercised the first Governing Board officer transition under the Initiative's new governance structure. Michael Crandall stepped into the role of Immediate Past Chair as Eric Childress assumed the roles of Chair of DCMI and the Governing Board. Joseph Tennis became the new Chair Elect of the Board and will succeed as Chair at DC-2015 in São Paulo, Brazil. Information about the new DCMI governance structure can be found in the DCMI Handbook at

Posted at 23:59

Dublin Core Metadata Initiative: DCMI/ASIS&T; Webinar - The Learning Resource Metadata Initiative, describing learning resources with, and more?

2014-10-24, The Learning Resource Metadata Initiative (LRMI) is a collaborative initiative that aims to make it easier for teachers and learners to find educational materials through major search engines and specialized resource discovery services. The approach taken by LRMI is to extend the ontology so that educationally significant characteristics and relationships can be expressed. In this webinar, Phil Barker and Lorna M. Campbell of Cetis will introduce and present the background to LRMI, its aims and objectives, and who is involved in achieving them. The webinar will outline the technical aspects of the LRMI specification, describe some example implementations and demonstrate how the discoverability of learning resources may be enhanced. Phil and Lorna will present the latest developments in LRMI implementation, drawing on an analysis of its use by a range of open educational resource repositories and aggregators, and will report on the potential of LRMI to enhance education search and discovery services. Whereas the development of LRMI has been inspired by, the webinar will also include discussion of whether LRMI has applications beyond those of Registration at The webinar is free to DCMI Individual & Organizational Members, to ASIS&T Members and at modest fee to non-members.

Posted at 23:59

Dublin Core Metadata Initiative: Tutorials, presentations and proceedings of DC-2014 published

2014-10-24, Over 236 people from 17 countries attended DC-2014 in Austin, Texas from 8 through 11 October 2014. Pre- and post-Conference tutorials and workshops were presented in the AT&T Executive Education and Conference Center and at the Harry Ransom Center with 90 people attending in each of the two venues on the University of Texas at Austin campus. Over 190 people attended the 2-day conference. Presentation slides of the keynote by Eric Miller of Zepheira LLC as well as presentations from the special sessions and the tutorials/workshops are available online at the conference website at The full text of the peer reviewed papers, project reports, extended poster abstracts and poster images are also available. Additional assets from the conference will be added to the online proceedings as they become available over the next few weeks.

Posted at 23:59

Dublin Core Metadata Initiative: Stewardship of LRMI specification transferred to DCMI

2014-10-24, After lengthy deliberations, the leadership of the Learning Resource Metadata Initiative (LRMI) has determined that the stewardship of the LRMI 1.1 specification, as well management of future LRMI development, will be passed to DCMI and its long-standing Education Community. The LRMI specification on which educational properties and classes are based was created by the Association of Educational Publishers (AEP) and Creative Commons (CC) with support from the Bill & Melinda Gates Foundation. The three-phased development cycle of the LRMI 1.1 specification included closing processes for the orderly passing of stewardship to a recognized organization in the metadata sector with commitments to transparency and community involvement. Stewardship of the LRMI specification within DCMI will be a function of a DCMI/LRMI Task Group. The LRMI specification as well as links to the Task Group and community communications channels can be found on the DCMI website at For the full announcement of the transfer, see the AEP press release at

Posted at 23:59

October 23

AKSW Group - University of Leipzig: AKSW internal group meeting @ Dessau

Recently AKSW members were at the city of Dessau for an internal group meeting.

The meeting took place between 8th and 10th of October, in the modern university of architecture of Bauhaus were we also stayed hosted. Bauhaus is located in the city of Dessau, about one hour from Leipzig. Bauhaus operated from 1919 to 1933 and was famous for the approach to design that combined crafts and the fine arts. At that time the German term Bauhaus – literally “house of construction” – was understood as meaning “School of Building”. It seemed to be a perfect getaway and an awesome location for AKSWers meet and “build” together the future steps of the group.

Wednesday was spent mostly in smaller group discussions on various ongoing projects. Over the next two days the AKSW PhD students presented their achievements, current status and future plans of their PhD projects. During the meeting, we have the pleasure to receive various of AKSWers leaders and project managers as prof. Dr. Sören Auer, Dr. Jens Lehmann, prof. Dr. Thomas Riechert and Dr. Michael Martin. The heads of the AKSW gave their inputs and suggestions to the students in order to help them to improve, continue and or complete their PhDs. In addition, the current projects were also discussed so as to find possible synergies between them and to discuss further improvements and ideas.

However, we do find some time to enjoy the beautiful city of Dessau and learn a little bit more about the history of this wonderful city.

Overall, it was a productive and recreational trip not only to keep a track of each students progress but also to help them to improve their work. We are all thankful to prof. Dr. Thomas Riechert who was the main responsible for organizing this amazing meeting.

DSC_2817 SAM_3489 DSC_2806 DSC_2698 DSC_2724 DSC_2775 IMG_2950 SAM_3491 SAM_3565 IMG_2932 IMG_2930 IMG_2940 SAM_3630 SAM_3472 SAM_3493 SAM_3485 IMG_2963 SAM_3624 SAM_3498 SAM_3639 DSC_2736 DSC_2740 DSC_2742 DSC_2754 DSC_2767

Posted at 00:07

October 22

Sebastian Trueg: Conditional Sharing – Virtuoso ACL Groups Revisited

Posted at 20:25

Orri Erling: On Universality and Core Competence

I will here develop some ideas on the platform of Peter Boncz's inaugural lecture mentioned in the previous post. This is a high-level look at where the leading edge of analytics will be, now that the column store is mainstream.

Peter's description of his domain was roughly as follows, summarized from memory:

The new chair is for data analysis and engines for this purpose. The data analysis engine includes the analytical DBMS but is a broader category. For example, the diverse parts of the big data chain (including preprocessing, noise elimination, feature extraction, natural language extraction, graph analytics, and so forth) fall under this category, and most of these things are usually not done in a DBMS. For anything that is big, the main challenge remains one of performance and time to solution. These things are being done, and will increasingly be done, on a platform with heterogenous features, e.g., CPU/GPU clusters, possibly custom hardware like FPGAs, etc. This is driven by factors of cost and energy efficiency. Different processing stages will sometimes be distributed over a wide area, as for example in instrument networks and any network infrastructure, which is wide area by definition.

The design space of database and all that is around it is huge, and any exhaustive exploration is impossible. Development times are long, and a platform might take ten years to be mature. This is ill compatible with academic funding cycles. However, we should not leave all the research in this to industry, as industry maximizes profit, not innovation or absolute performance. Architecting data systems has aspects of an art. Consider the parallel with architecture of buildings: There are considerations of function, compatibility with environment, cost, restrictions arising from the materials at hand, and so forth. How a specific design will work cannot be known without experiment. The experiments themselves must be designed to make sense. This is not an exact science with clear-cut procedures and exact metrics of success.

This is the gist of Peter's description of our art. Peter's successes, best exemplified by MonetDB and Vectorwise, arise from focus over a special problem area and from developing and systematically applying specific insights to a specific problem. This process led to the emergence of the column store, which is now a mainstream thing. The DBMS that does not do columns is by now behind the times.

Needless to say, I am a great believer in core competence. Not every core competence is exactly the same. But a core competence needs to be broad enough so that its integral mastery and consistent application can produce a unit of value valuable in itself. What and how broad this is varies a great deal. Typically such a unit of value is something that is behind a "natural interface." This defies exhaustive definition but the examples below may give a hint. Looking at value chains and all diverse things in them that have a price tag may be another guideline.

There is a sort of Hegelian dialectic to technology trends: At the start, it was generally believed that a DBMS would be universal like the operating system itself, with a few products with very similar functionality covering the whole field. The antithesis came with Michael Stonebraker declaring that one size no longer fit all. Since then the transactional (OLTP) and analytical (OLAP) sides are clearly divided. The eventual synthesis may be in the air, with pioneering work like HyPer led by Thomas Neumann of TU München. Peter, following his Humbolt prize, has spent a couple of days a week in Thomas's group, and I have joined him there a few times. The key to eventually bridging the gap would be compilation and adaptivity. If the workload is compiled on demand, then the right data structures could always be at hand.

This might be the start of a shift similar to the column store turning the DBMS on its side, so to say.

In the mainstream of software engineering, objects, abstractions and interfaces are held to be a value almost in and of themselves. Our science, that of performance, stands in apparent opposition to at least any naive application of the paradigm of objects and interfaces. Interfaces have a cost, and boxes limit transparency into performance. So inlining and merging distinct (in principle) processing phases is necessary for performance. Vectoring is one take on this: An interface that is crossed just a few times is much less harmful than one crossed a billion times. Using compilation, or at least type-and-data-structure-specific variants of operators and switching their application based on run-time observed behaviors, is another aspect of this.

Information systems thus take on more attributes of nature, i.e., more interconnectedness and adaptive behaviors.

Something quite universal might emerge from the highly problem-specific technology of the column store. The big scan, selective hash join plus aggregation, has been explored in slightly different ways by all of HyPer, Vectorwise, and Virtuoso.

Interfaces are not good or bad, in and of themselves. Well-intentioned naïveté in their use is bad. As in nature, there are natural borders in the "technosphere"; declarative query languages, processor instruction sets, and network protocols are good examples. Behind a relatively narrow interface lies a world of complexity of which the unsuspecting have no idea. In biology, the cell membrane might be an analogy, but this is in all likelihood more permeable and diverse in function than the techno examples mentioned.

With the experience of Vectorwise and later Virtuoso, it turns out that vectorization without compilation is good enough for TPC-H. Indeed, I see a few percent of gain at best from further breaking of interfaces and "biology-style" merging of operators and adding inter-stage communication and self-balancing. But TPC-H is not the end of all things, even though it is a sort of rite of passage: Jazz players will do their take on Green Dolphin Street and Summertime.

Science is drawn towards a grand unification of all which is. Nature, on the other hand, discloses more and more diversity and special cases, the closer one looks. This may be true of physical things, but also of abstractions such as software systems or mathematics.

So, let us look at the generalized DBMS, or the data analysis engine, as Peter put it. The use of DBMS technology is hampered by its interface, i.e., declarative query language. The well known counter-reactions to this are the NoSQL, MapReduce, and graph DB memes, which expose lower level interfaces. But then the interface gets put in the whole wrong place, denying most of the things that make the analytics DBMS extremely good at what it does.

We need better and smarter building blocks and interfaces at zero cost. We continue to need blocks of some sort, since algorithms would stop being understandable without any data/procedural abstraction. At run time, the blocks must overlap and interpenetrate: Scan plus hash plus reduction in one loop, for example. Inter-thread, inter-process status sharing for things like top k for faster convergence, for another. Vectorized execution of the same algorithm on many data for things like graph traversals. There are very good single blocks, like GPU graph algorithms, but interface and composability are ever the problem.

So, we must unravel the package that encapsulates the wonders of the analytical DBMS. These consist of scan, hash/index lookup, partitioning, aggregation, expression evaluation, scheduling, message passing and related flow control for scale-out systems, just to mention a few. The complete list would be under 30 long, with blocks parameterized by data payload and specific computation.

By putting these together in a few new ways, we will cover much more of the big data pipeline. Just-in-time compilation may well be the way to deliver these components in an application/environment tailored composition. Yes, keep talking about block diagrams, but never once believe that this represents how things work or ought to work. The algorithms are expressed as distinct things, but at the level of the physical manifestation, things are parallel and interleaved.

The core skill for architecting the future of data analytics is correct discernment of abstraction and interface. What is generic enough to be broadly applicable yet concise enough to be usable? When should the computation move, and when should the data move? What are easy ways of talking about data location? How can protect the application developer be protected from various inevitable stupidities?

No mistake about it, there are at present very few people with the background for formulating the blueprint for the generalized data pipeline. These will be mostly drawn from architects of DBMS. The prospective user is any present-day user of analytics DBMS, Hadoop, or the like. By and large, SQL has worked well within its area of applicability. If there had never been an anti-SQL rebel faction, SQL would not have been successful. Now that a broader workload definition calls for redefinition of interfaces, so as to use the best where it fits, there is a need for re-evaluation of the imperative Vs. declarative question.

T. S. Eliot once wrote that humankind cannot bear very much reality. It seems that we in reality can deconstruct the DBMS and redeploy the state of the art to serve novel purposes across a broader set of problems. This is a cross-over that slightly readjusts the mental frame of the DBMS expert but leaves the core precepts intact. In other words, this is a straightforward extension of core competence with no slide into the dilettantism of doing a little bit of everything.

People like MapReduce and stand-alone graph programming frameworks, because these do one specific thing and are readily understood. By and large, these are orders of magnitude simpler than the DBMS. Even when the DBMS provides in-process Java or CLR, these are rarely used. The single-purpose framework is a much narrower core competence, and thus less exclusive, than the high art of the DBMS, plus it has a faster platform development cycle.

In the short term, we will look at opening the SQL internal toolbox for graph analytics applications. I was discussing this idea with Thomas Neumann at Peter Boncz's party. He asked who would be the user. I answered that doing good parallel algorithms, even with powerful shorthands, was an expert task; so the people doing new types of analytics would be mostly on the system vendor side. However, modifying such for input selection and statistics gathering would be no harder than doing the same with ready-made SQL reports.

There is significant possibility for generalization of the leading edge of database. How will this fare against single-model frameworks? We hope to shed some light on this in the final phase of LDBC and beyond.

Posted at 17:23

Orri Erling: Inaugural Lecture of Prof. Boncz at VU Amsterdam

Last Friday, I attended the inaugural lecture of Professor Peter Boncz at the VU University Amsterdam. As the reader is likely to know, Peter is one of the database luminaries of the 21st century, known among other things for architecting MonetDB and Actian Vector (Vectorwise) and publishing a stellar succession of core database papers.

The lecture touched on the fact of the data economy and the possibilities of E-science. Peter proceeded to address issues of ethics of cyberspace and the fact of legal and regulatory practice trailing far behind the factual dynamics of cyberspace. In conclusion, Peter gave some pointers to his research agenda; for example, use of just-in-time compilation for fusing problem-specific logic with infrastructure software like databases for both performance and architecture adaptivity.

There was later a party in Amsterdam with many of the local database people as well as some from further away, e.g., Thomas Neumann of Munich, and Marcin Zukowsky, Vectorwise founder and initial CEO.

I should have had the presence of mind to prepare a speech for Peter. Stefan Manegold of CWI did give a short address at the party, while presenting the gifts from Peter's CWI colleagues. To this I will add my belated part here, as follows:

If I were to describe Prof. Boncz, our friend, co-worker, and mentor, in one word, this would be man of knowledge. If physicists define energy as that which can do work, then knowledge would be that which can do meaningful work. A schematic in itself does nothing. Knowledge is needed to bring this to life. Yet this is more than an outstanding specialist skill, as this implies discerning the right means in the right context and includes the will and ability to go through with this. As Peter now takes on the mantle of professor, the best students will, I am sure, not fail to recognize excellence and be accordingly inspired to strive for the sort of industry changing accomplishments we have come to associate with Peter's career so far. This is what our world needs. A big cheer for Prof. Boncz!

I did talk to many at the party, especially Pham Minh Duc, who is doing schema-aware RDF in MonetDB, and many others among the excellent team at CWI. Stefan Manegold told me about Rethink Big, an FP7 for big data policy recommendations. I was meant to be an advisor and still hope to go to one of their meetings for some networking about policy. On the other hand, the EU agenda and priorities, as discussed with, for example, Stefano Bertolo, are, as far as I am concerned, on the right track: The science of performance must meet with the real, or at least realistic, data. Peter did not fail to mention this same truth in his lecture: Spinoffs play a key part in research, and exposure to the world out there gives research both focus and credibility. As René Char put it in his poem L'Allumette (The Matchstick), "La tête seule à pouvoir de prendre feu au contact d'une réalité dure." ("The head alone has power to catch fire at the touch of hard reality.") Great deeds need great challenges, and there is nothing like reality to exceed man's imagination.

For my part, I was advertising the imminent advances in the Virtuoso RDF and graph functionality. Now that the SQL part, which is anyway the necessary foundation for all this, is really very competent, it is time to deploy these same things in slightly new ways. This will produce graph analytics and structure-aware RDF to match relational performance while keeping schema-last-ness. Anyway, the claim has been made; we will see how it is delivered during the final phase of LDBC and Geoknow.

Posted at 17:21

Libby Miller: hostapd debugging

sudo hostapd -dd /etc/hostapd/hostapd.conf

tells you if your config file is broken. Which helps.

Sample hostapd.conf for wpa-password protected access point:

# makes the SSID visible and broadcasted

Posted at 16:50

October 21

W3C Data Activity: The Importance of Use Cases Documents

The Data on the Web Best Practices WG is among those who will be meeting at this year’s TPAC in Santa Clara. As well as a chance for working group members to meet and make good progress, it’s a great … Continue reading

Posted at 11:22

Copyright of the postings is owned by the original blog authors. Contact us.