Planet RDF

It's triples all the way down

August 24

AKSW Group - University of Leipzig: ASKW Colloquium, 24 August, 3pm, Lucene, SOLR, FEASIBLE

In this colloquium, first, Kay Müller and Nilesh Chakraborty will give an overview of Lucene and SOLR and how they can be used in information retrieval scenarios, along with a short peek into ElasticSearch, introducing a few recent features, their respective use-cases and how they compare with SOLR.

Then, Muhammed Saleem will present his paper “FEASIBLE: A Featured-Based SPARQL Benchmark Generation Framework

Abstract: Benchmarking is indispensable when aiming to assess technologies with respect to their suitability for given tasks. While several benchmarks and benchmark generation frameworks have been developed to evaluate triple stores, they mostly provide a one-fits-all solution to the benchmarking problem. This approach to benchmarking is however unsuitable to evaluate the performance of a triple store for a given application with particular requirements. We address this drawback by presenting FEASIBLE, an automatic approach for the generation of benchmarks out of the query history of applications, i.e., query logs. The generation is achieved by selecting prototypical queries of a user-defined size from the input set of queries. We evaluate our approach on two query logs and show that the benchmarks it generates are accurate approximations of the input query logs. Moreover, we compare four different triple stores with benchmarks generated using our approach and show that they behave differently based on the data they contain and the types of queries posed. Our results suggest that FEASIBLE generates better sample queries than the state of the art. In addition, the better query selection and the larger set of query types used lead to triple store rankings which partly differ from the rankings generated by previous works.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:40

Leigh Dodds: Data and information in the city

For a while now I’ve been in the habit of looking for data as I travel to work or around Bath. You can’t really work with data and information systems for any length of time without becoming a little bit obsessive about numbers or becoming tuned into interesting little dashboards:

Posted at 10:49

August 22

Bob DuCharme: Querying machine learning movie ratings data with SPARQL

Well, movie ratings data popular with machine learning people.

Posted at 15:10

August 19

Dublin Core Metadata Initiative: DCMI Governing Board announces 2015 revisions to Bylaws

2015-08-19, The DCMI Governing Board announces its GB2015-2 decision to revise the DCMI Bylaws. The major focus of the revisions is on the refactoring of roles of the Advisory Board and Directorate with regard to DCMI conferences, meetings, educational programming and Initiative outreach. The revisions are are part of the ongoing fine-tuning of the Bylaws following the major restructuring of DCMI governance in 2014. The revised Bylaws can be found at

Posted at 23:59

Dublin Core Metadata Initiative: DCMI RDF Application Profile Task Group completes deliverables

2015-08-19, The DCMI RDF-AP aims at defining best practices for documenting application profiles, requests for handling RDF application profiles, and for RDF constraints specification and validation. The first deliverable, Report on Use Cases, reports on the case studies collected in the Task Force, their use cases and their validation requirements. The second deliverable, Report on Validation Requirements, supplements the Report on Use Cases from which the requirements were derived. The full descriptions of case studies and use cases can be found in the task group wiki. Case studies and the corresponding use cases are collected in the DCMI RDF-AP database (see DCMI RDF Application Profiles database on case studies, use cases, requirements, and solutions).

Posted at 23:59

August 18

AKSW Group - University of Leipzig: AKSW has 10 accepted publications in Semantics’2015 conference

The Semantics‘2015 conference will take place in Vienna, Austria on September 15-17, 2015. AKSW is happy to announce 10 accepted publications: 6 for research track and 4 for demos/posters track. Semantics’2015 has received 93 submissions for research track, out of which 24 were accepted (an acceptance rate of 26%).

For the research track our accepted papers are:

For the poster/demo track:

See you at the conference in September!

Posted at 11:37

August 17

AKSW Group - University of Leipzig: AKSW Colloquium, 17-08-2015, Relational Machine Learning for Knowledge Graphs, Zoo Keeper, Apache Mesos

Relational Machine Learning for Knowledge Graphs

In this colloquium Patrick Westphal will present the paper “A Review of Relational Machine Learning for Knowledge Graphs” by Nickel, Murphy, Tresp and Gabrilovich.

Abstract: Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets. The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.

Zoo Keeper, Apache Mesos

A short introduction to the coordination system ‘ZooKeeper’ and cluster manager ‘Apache Mesos’ will be presented by Hajira.
Mesos provides efficient resource isolation and sharing across distributed applications and frameworks. It can run a multitude of applications on a dynamically shared pool of resources being coordinated by zookeeper.

Posted at 08:58

August 15

Libby Miller: AWS Janus Gateway strangeness

Can anyone think of a reason why this might happen? (janus gateway is a webRTC gateway –

Posted at 12:36

August 12

AKSW Group - University of Leipzig: 18 AKSW Papers at ISWC 2015

logoWe are very pleased to announce that AKSW will be presenting 18 papers in the upcoming International Semantic Web Conference (ISWC) 2015.  Following are the details:

Research track:

  1. TITLE: FEASIBLE: A Featured-Based SPARQL Benchmarks Generation Framework
    AUTHORS: Muhammad Saleem, Qaiser Mehmood, Axel-Cyrille Ngonga Ngomo
  2. TITLE: LANCE: Piercing to the Heart of Instance Matching Tools
    AUTHORS: Tzanina Saveta, Evangelia Daskalaki, Giorgos Flouris, Irini Fundulaki, Melanie Herschel, Axel-Cyrille Ngonga Ngomo

  In-Use and Software track

  1. TITLE: ASSESS —Automatic Self-Assessment Using Linked Data
    AUTHORS: Lorenz Bühmann, Ricardo Usbeck,  Axel-Cyrille Ngonga Ngomo 
  2. TITLE: Assessing and Refining Mappings to RDF to Improve Data Quality
    AUTHORS: Anastasia Dimou, Dimitris Kontokostas, Markus Freudenberg, Ruben Verborgh, Jens Lehmann, Erik Mannens, Sebastian Hellmann, Rik Van de Walle

  Data Sets and Ontologies

  1. TITLE: LSQ: Linked SPARQL Queries Dataset
    AUTHORS: Muhammad Saleem, Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, Axel-Cyrille Ngonga Ngomo
  2. TITLE: DBpedia Commons: Structured Multimedia Metadata for Wikimedia Commons
    AUTHORS: Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann, Sebastian Hellmann

  Posters and Demos

  1. TITLE: Enhancing Dataset Quality Using Keys          [demo]
    AUTHORS: Tommaso Soru, Edgard Marx, Axel-Cyrille Ngonga Ngomo
  2. TITLE: The DBpedia Events Dataset          [demo]
    AUTHORS: Magnus Knuth, Jens Lehmann, Dimitris Kontokostas, Thomas Steiner, Harald Sack
  3. TITLE: Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
    AUTHORS: Anastasia Dimou, Dimitris Kontokostas, Markus Freudenberg, Ruben Verborgh, Jens Lehmann, Erik Mannens, Sebastian Hellmann, Rik Van de Walle
  4. TITLE: Interoperable Machine Learning Metadata using MEX          [demo]
    AUTHORS: Diego Esteves, Diego Moussallem, Jens Lehmann, Maria Claudia, J.Cesar Duarte
  5. TITLE: Automatic SPARQL Benchmark Generation Using FEASIBLE          [demo]
    AUTHORS: Muhammad Saleem, Qaiser Mehmood, Axel-Cyrille Ngonga Ngomo
  6. TITLE: The LSQ Dataset: Querying for Queries          [demo]
    AUTHORS: Muhammad Saleem, Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, Axel-Cyrille Ngonga Ngomo
  7. TITLE: openQA: An Open-Source Question Answering Framework          [demo]
    AUTHORS: Edgard MarxAxel-Cyrille Ngonga Ngomo 
  8. TITLE: LANCE: A Generic Benchmark Generator for Linked Data
    AUTHORS: Tzanina Saveta, Evangelia Daskalaki, Giorgos Flouris, Irini Fundulaki, Melanie Herschel, Axel-Cyrille Ngonga Ngomo
  9. TITLE: SPARQL Query Formulation and Execution using FedViz          [demo]
    AUTHORS: Syeda Sana e Zainab, Ali Hasnain, Muhammad Saleem, Qaiser Mehmood, Durre Zehra, Stefan Decker


  1. TITLE: FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
    AUTHORS: Syeda Sana e Zainab, Ali Hasnain, Muhammad Saleem, Qaiser Mehmood, Durre Zehra, Stefan Decker
  2. TITLE: Answering Boolean Hybrid Questions with HAWK
    AUTHORS: Ricardo Usbeck, Erik Körner,  Axel-Cyrille Ngonga Ngomo


TITLE: Federated Query Processing over Linked Data
AUTHORS: Muhammad Saleem, Muhammad Intizar Ali, Ruben Verborgh, Axel-Cyrille Ngonga Ngomo


Come over to ISWC 2015 and enjoy the talks.

Best Regards,
Saleem on behalf of AKSW

Posted at 01:16

August 11

Orri Erling: DBpedia Usage Report, August 2015

We recently published the latest DBpedia Usage Report, covering v3.3 (released July, 2009) to v3.10 (sometimes called "DBpedia 2014"; released September, 2014).

The new report has usage data through July 31, 2015, and brought a few surprises to our eyes. What do you think?

Posted at 16:58

August 06

AKSW Group - University of Leipzig: Michael Röder at AKSW Colloquium, Monday, 10th August 2015, 3pm

Crawling the Linked Open Data Cloud by Michael Röder

Michael RoederIn the recent years the Linked Open Data Cloud matured to support various applications, e.g., question answering, knowledge inference or database enrichment. Such applications have to be supported by the recent data, which is impossible to retrieve without constant crawling, due to the dynamic nature of the World Wide Web and the cloud as a part of it. On this colloquium Michael is going to talk about the state of the art for Linked Data Crawling as well as the future work that will be done by Ivan Ermilov, Axel-Cyrille Ngonga Ngomo and himself.

Posted at 12:17

August 05

Libby Miller: HackspaceHat part 2: Streaming to a remote sever

We’ve made more progress on the HackspaceHat (HackspaceHat is a telepresence hat for exploring Hackspaces).

Posted at 19:51

August 03

AKSW Group - University of Leipzig: Hajira Jabeen and Ricardo Usbeck at AKSW Colloquium, Monday, 3rd August 2015, 3pm

Hybrid Question Answering at QALD 5 challenge by Ricardo Usbeck

The plethora of datasets on the web, both structured and unstructured, enables answering complex questions such as “Which anti-apartheid activist was born in Mvezo?” Some of those hybrid (source) question answering system have been benchmarked at the QALD 5 challenge at CLEF conference. Ricardo is going to present some of the results and give future research directions.


BDE, Hadoop MapR and HDFS by Hajira Jabeen

Hajira will present brief introduction to BigData Europe project (BDE). Followed by , Hadoop HDFS and map reduce for distributed processing of large data sets on compute clusters of commodity hardware. Hadoop is one of the many Big Data components being used in the BDE project.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:13

Semantic Web Company (Austria): How the PoolParty Semantic Suite is learning to speak 40+ languages

Business is becoming more and more globalised, and enterprises and organisations are acting in several different regions and thus facing more challenges of different cultural aspects as well as respective language barriers. Looking at the European market, we even see 24 working languages in EU28, which make cross-border services considerably complicated. As a result, powerful language technology is needed, and intense efforts have already been taken in the EU to deal with this situation and enable the vision of a multilingual digital single market (a priority area of the European Commission this year, see:


Here at the Semantic Web Company we also witness fast-growing demands for language-independent and/or specific-language and cross-language solutions to enable business cases like cross-lingual search or multilingual data management approaches. To provide such solutions, a multilingual metadata and data management approach is needed, and this is where PoolParty Semantic Suite comes into the game: as PoolParty follows W3C semantic web standards like SKOS, we have language-independent-based technologies in place and our customers already benefit from them. However, as regards text analysis and text extraction, the ability to process multilingual information and data is key for success – which means that the systems need to speak as many languages as possible.

Our new cooperation with K Dictionaries (KD) is enabling the PoolParty Semantic Suite to continuously “learn to speak” more and more languages, by making use of KD’s rich monolingual, bilingual and multilingual content and its long-time experience in lexicography as a base for improved multi-language text analysis and processing.

KD ( is a technology-oriented content and data creator that is based in Tel Aviv and cooperates with publishing partners, ICT firms, the academe and professional associations worldwide. It deals with nearly 50 languages, offering quality monolingual, bilingual and multilingual lexical datasets, morphological word forms, phonetic transcription, etc.

As a result of this cooperation, PoolParty now provides language bundles in the following languages, which can be licensed together with all types of PoolParty servers:

  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Russian
  • Slovak
  • Spanish

Additional language bundles are in preparation and will be in place soon!

Furthermore, SWC and KD are partners in a brand new EUREKA project that is supported by a bilateral technology/innovation program between Austria and Israel. The project is called LDL4HELTA (Linked Data Lexicography for High-End Language Technology Application) and combines lexicography and Language Technology with Semantic Web and Linked (Open) Data mechanisms and technologies to improve existing and develop new products and services. It integrates the products of both partners to better serve existing customers and new ones, as well as to enter together new markets in the field of Linked Data lexicography-based Language Technology solutions. This project has been successfully kicked off in early July and has a duration of 24 months, with the first concrete results due early in 2016.

The LDL4HELTA project is supported by a research partner (Austrian Academy of Sciences) and an expert Advisory Board including  Prof Christian Chiarcos (Goethe University, Frankfurt), Mr Orri Erling (OpenLink Software), Dr Sebastian Hellmann (Leipzig University), Prof Alon Itai (Technion, Haifa), and Ms Eveline Wandl-Wogt (Austrian Academy of Sciences).

So stay tuned and we will inform you about news and activities of this cooperation here in the blog continuously!

Posted at 09:49

July 29

Dublin Core Metadata Initiative: Final Program announced for DC-2015 in São Paulo, Brazil

2015-07-29, São Paulo State University (UNESP) and the Conference Committee of DC-2015 in São Paulo, Brazil on 1-4 September have published the final program of the DCMI International Conference at Join us in São Paulo for an exciting agenda including papers, project reports and best practice posters and presentations. Parallel with the peer reviewed program is an array of special sessions of panels and discussions on key metadata issues, challenges and new opportunities. Pre- and post-conference Professional Program workshops round out the program by providing full-day instruction. Every year the DCMI community gathers for both its Annual Meeting and its International Conference on Dublin Core & Metadata Applications. The work agenda of the DCMI community is broad and inclusive of all aspects of innovation in metadata design, implementation and best practices. While the work of the Initiative progresses throughout the year, the Annual Meeting and Conference provide the opportunity for DCMI "citizens" as well as students and early career professionals studying and practicing the dark arts of metadata to gather face-to-face to share experiences. In addition, the gathering provides public- and private-sector initiatives beyond DCMI engaged in significant metadata work to come together to compare notes and cast a broader light into their particular metadata domain silos. Through such a gathering of the metadata "clans", DCMI advances its "first goal" of promoting metadata interoperability and harmonization. Visit the DC-2015 conference website at for additional information and to register.

Posted at 23:59

Dublin Core Metadata Initiative: Japanese translation of "Guidelines for Dublin Core Application Profiles" published

2015-07-29, DCMI is please to announce that the National Diet Library of Japan has translated "Guidelines for Dublin Core Application Profiles", a DCMI Recommended Resource. The link to the new Japanese translation is available on the DCMI Documents Translation page at

Posted at 23:59

July 28

Libby Miller: HackspaceHat part 1: WebRTC, Janus and Gstreamer

Posted at 08:09

July 27

Tetherless World Constellation group RPI: Data and Semantics — Topics of Interest at ESIP 2015 Summer Meeting

The ESIP 2015 Summer Meeting was held at Pacific Grove, CA in the week of July 14-17. Pacific Grove is such a beautiful place with the coast line, sand beach and sun set. What excited me more are the science and technical topics covered in the meeting sessions, as well as the opportunity to catch up with friends in the ESIP community. Excellent topics + a scenic place + friends = a wonderful meeting. Thanks a lot to the meeting organizers!

The theme of this summer meeting is “The Federation of Earth Science Information Partners & Community Resilience: Coming Together.” Though my focus was Semantic Web and data stewardship relevant sessions, I was able to see the topic ‘resilience’ in various presented works. It was nice to see that the ESIP community has an ontology portal. It implements the Bio Portal infrastructure and focuses on collecting ontologies and vocabularies in the field of Earth sciences. With more submissions from the community in the future the portal has great potential for geo-semantics research, similar to what the Bio Portal does for bioinformatics. An important topic was reviewing progress and discussing directions for the future. Prof. Peter Fox from RPI offered a short overview. The ESIP Semantic Web cluster is nine years old, and it is nice to see that through the cluster has helped improve the visibility of semantic web methods and technologies in the grand field of geoinformatics. A key feature supporting the success of Semantic Web is that it is an open world and it evolves and updates.

There were several topics or projects of interest that I recorded during the meeting:

(1) It recently released version 2.0 and introduced a new mechanism for extension. There are now two types of extensions: reviewed/hosted extensions and external extensions. The former (e1) gets its own chunk of namespace: All items in that extension are created and maintained by their own creators. The latter means a third party to create extensions specific to an application. Extensions to location and time might be a topic for the Earth science community in the near future.

(2) GCIS Ontology: GCIS is such a nice project it is incorporated several state-of-the-art Semantic Web methods and technologies. The provenance representation in GCIS means it is not just a static knowledge representation. It is more about what are the facts, what do people believe and why. In the ontology engineering for GCIS we also see the collaboration between geoscientists and computer scientists. That is, conceptual model came first, as a product that geoscientists can understand, before it was bound to logic and ontology encoding grammar. The process can be seen as within the scope of semiology. We can do good jobs with syntax and semantics, and very often we will struggle with the pragmatics.

(3) PROV-ES: Provenance of scientific findings is receiving increasing attending. Earth science community has taken a lead on working of capturing provenance. The World Wide Web Consortium (W3C) PROV standard provide a platform for Earth science community to adopt and extend. The Provenance – Earth Science (PROV-ES) Working Group was initiated in 2013 and it primarily focused on extending the PROV standard, and tested the outputs with sample projects. In the PROV-ES hackathon at the summer meeting, Hook Hua and Gerald Manipon showed more technical details of with PROV-ES, especially about its encodings, discovery, and visualization.

(4) Entity linking: Jin Guang Zheng and I had a poster about our ESIP 2014 Test bed project. The topic is about linking entity mentions in documents and datasets to entities in the Web of Data. Entity recognition and linking is a valuable work in works with datasets collected from multiple sources. Detecting and linking entity mentions in datasets can be facilitated by using knowledge bases on the Web, such as ontologies and vocabularies. In this work we built a web-based entity linking and wikification service for datasets. Our current demo system uses DBPedia as the knowledge base, and we have been collecting geoscience ontologies and vocabularies. A potential future collaboration is to use the ESIP ontology portal as the knowledge base. Discussion with colleagues during the poster session shows that this work may also be beneficial to works on dark data, such as pattern recognition and knowledge discovery from legacy literature.

(5) Big Earth Data Initiative: This is an inter-agency coordination work for geo-data interoperability in US. I would copy paste a part of the original session description to show the detailed relationships about a few entities and organizations that were mentioned: ‘The US Group on Earth Observations (USGEO) Data Management Working Group (DMWG) is an inter-agency body established under the auspices of the White House National Science and Technology Council (NSTC). DMWG members have been drafting an “Earth Observations Common Framework” (EOCF) with recommended approaches for supporting and improving discoverability, accessibility, and usability for federally held earth observation data. The recommendations will guide work done under the Big Earth Data Initiative (BEDI), which provided funding to some agencies for improving those data attributes.’ It will be nice to see more outputs from this effort and compare the work with similar efforts in Europe such as the INSPIRE, as well as the global initiative GEOSS.

Posted at 17:13

July 22

AKSW Group - University of Leipzig: DL-Learner 1.1 (Supervised Structured Machine Learning Framework) Released

Dear all,

we are happy to announce DL-Learner 1.1.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

GitHub page:

DL-Learner is used for data analysis in other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern based and evolutionary techniques for learning on structured data. For a practical example, see It also offers a plugin for Protege, which can give suggestions for axioms to add. DL-Learner is part of the Linked Data Stack – a repository for Linked Data management tools.

In the current release, we improved the support for SPARQL endpoints as knowledge sources. You can now directly use a SPARQL endpoint for learning without an OWL reasoner on top of it. Moreover, we extended DL-Learner to also consider dates and inverse properties for learning. Further efforts were made to improve our Query Tree Learning algorithms (those are used to learn SPARQL queries rather than OWL class expressions).

We want to thank everyone who helped to create this release, in particular Robert Höhndorf and Giuseppe Rizzo. We also acknowledge support by the recently started SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as the GeoKnow and Big Data Europe projects where it is part of the respective platforms.

Kind regards,

Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin

Posted at 14:14

July 16

AKSW Group - University of Leipzig: AKSW Colloquium, 20-07-2015, Enterprise Linked Data Networks

Enterprise Linked Data Networks (PhD progress report) by Marvin Frommhold

marvinFrommholdThe topic of the thesis is the scientific utilization of the LUCID research project, in particular the LUCID Endpoint Prototype. In LUCID we research and develop on Linked Data technologies in order to allow partners in supply chains to describe their work, their companies and their products for other participants. This allows for building distributed networks of supply chain partners on the Web without a centralized infrastructure.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 09:20

July 15

Dublin Core Metadata Initiative: Ex Libris and Elsevier join as DC-2015 sponsors

2015-07-15, Conference host UNESP and DCMI are pleased to announce that Ex Libris and Elsevier are now among the sponsors of DC-2015 in São Paulo, Brazil, 1-4 September 2015. Elsevier is a world-leading provider of scientific, technical and medical information products and services and a world-leading provider of information solutions that enhance the performance of science, health, and technology professionals, empowering them to make better decisions. Ex Libris is a leading provider of library automation solutions, offering the only comprehensive product suite for the discovery, management, and distribution of all materials--print, electronic, and digital. For information about how your organization can becoming a DC-2015 sponsor, see

Posted at 23:59

Dublin Core Metadata Initiative: DC-2015 Early Bird registration closes 31 July 2015!

2015-07-15, Early Bird registration for DC-2015 in São Paulo, Brazil closes on 31 July 2015. In addition to Keynote Speakers Paul Walk of EDINA and Ana Alice Baptista of the University of Minho, there is a full Technical Program of peer-reviewed papers, project reports and posters, as well as a Professional Program of full-day Workshops and Conference Special Sessions. For more about the conference, visit the conference website at

Posted at 23:59

Orri Erling: Big Data, Part 2: Virtuoso Meets Impala

In this article we will look at Virtuoso vs. Impala with 100G TPC-H on two R3.8 EC2 instances. We get a single user win for Virtuoso by a factor of 136, and a five user win by a factor of 55. The details and analysis follow.

The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string for string and date columns, and decimal for numbers. We suppose the authors know what works best.

The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell) can be changed, but the behavior does not seem to change.

Following are the run times for one query stream.

Query Virtuoso Impala Notes
332     s 841     s Data Load
Q1 1.098 s 164.61  s
Q2 0.187 s 24.19  s
Q3 0.761 s 105.70  s
Q4 0.205 s 179.67  s
Q5 0.808 s 84.51  s
Q6 2.403 s 4.43  s
Q7 0.59  s 270.88  s
Q8 0.775 s 51.89  s
Q9 1.836 s 177.72  s
Q10 3.165 s 39.85  s
Q11 1.37  s 22.56  s
Q12 0.356 s 17.03  s
Q13 2.233 s 103.67  s
Q14 0.488 s 10.86  s
Q15 0.72  s 11.49  s
Q16 0.814 s 23.93  s
Q17 0.681 s 276.06  s
Q18 1.324 s 267.13  s
Q19 0.417 s 368.80  s
Q20 0.792 s 60.45  s
Q21 0.720 s 418.09  s
Q22 0.155 s 40.59  s
Total 20     s 2724     s

Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.

Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem. This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top, the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.

Following are the average times for each query in the 5 stream experiment.

Query Virtuoso Impala Notes
Q1 1.95 s 191.81 s
Q2 0.70 s 40.40 s
Q3 2.01 s 95.67 s
Q4 0.71 s 345.11 s
Q5 2.93 s 112.29 s
Q6 4.76 s 14.41 s
Q7 2.08 s 329.25 s
Q8 3.00 s 98.91 s
Q9 5.58 s 250.88 s
Q10 8.23 s 55.23 s
Q11 4.26 s 27.84 s
Q12 1.74 s 37.66 s
Q13 6.07 s 147.69 s
Q14 1.73 s 23.91 s
Q15 2.27 s 23.79 s
Q16 2.41 s 34.76 s
Q17 3.92 s 362.43 s
Q18 3.02 s 348.08 s
Q19 2.27 s 443.94 s
Q20 3.05 s 92.50 s
Q21 2.00 s 623.69 s
Q22 0.37 s 61.36 s
Total for
Slowest Stream
67    s 3740    s

There are 4 queries in Impala that terminated with an error (memory limit exceeded). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem, which explains running out of memory. Virtuoso does Q21 mostly by index.

Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad are over 160G, certainly enough to have the working set in memory. iostat also does not show any I, so we seem to be running from memory, as intended.

We observe that Impala does not store tables in any specific order. Therefore a merge join of orders and lineitem is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders and half of lineitem in Q21, and all orders in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.

However, the lineitem/orders join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem and part was about 90s, with a replicated part hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.

Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile statement for each query for the single stream run. contains the evidence for the single-stream run; holds the 5-stream run.

To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.

In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.

Posted at 20:12

Copyright of the postings is owned by the original blog authors. Contact us.