Planet RDF

It's triples all the way down

August 03

AKSW Group - University of Leipzig: Hajira Jabeen and Ricardo Usbeck at AKSW Colloquium, Monday, 3rd August 2015, 3pm

Hybrid Question Answering at QALD 5 challenge by Ricardo Usbeck

The plethora of datasets on the web, both structured and unstructured, enables answering complex questions such as “Which anti-apartheid activist was born in Mvezo?” Some of those hybrid (source) question answering system have been benchmarked at the QALD 5 challenge at CLEF conference. Ricardo is going to present some of the results and give future research directions.

Slides: https://docs.google.com/presentation/d/1dccMwbPMIeOpzvV1PNCKKxg96xZBSdK2Gav9JynAJjo/edit?usp=sharing

BDE, Hadoop MapR and HDFS by Hajira Jabeen

Hajira will present brief introduction to BigData Europe project (BDE). Followed by , Hadoop HDFS and map reduce for distributed processing of large data sets on compute clusters of commodity hardware. Hadoop is one of the many Big Data components being used in the BDE project.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:13

Semantic Web Company (Austria): How the PoolParty Semantic Suite is learning to speak 40+ languages

Business is becoming more and more globalised, and enterprises and organisations are acting in several different regions and thus facing more challenges of different cultural aspects as well as respective language barriers. Looking at the European market, we even see 24 working languages in EU28, which make cross-border services considerably complicated. As a result, powerful language technology is needed, and intense efforts have already been taken in the EU to deal with this situation and enable the vision of a multilingual digital single market (a priority area of the European Commission this year, see: http://ec.europa.eu/priorities/digital-single-market/).

image-languages

Here at the Semantic Web Company we also witness fast-growing demands for language-independent and/or specific-language and cross-language solutions to enable business cases like cross-lingual search or multilingual data management approaches. To provide such solutions, a multilingual metadata and data management approach is needed, and this is where PoolParty Semantic Suite comes into the game: as PoolParty follows W3C semantic web standards like SKOS, we have language-independent-based technologies in place and our customers already benefit from them. However, as regards text analysis and text extraction, the ability to process multilingual information and data is key for success – which means that the systems need to speak as many languages as possible.

Our new cooperation with K Dictionaries (KD) is enabling the PoolParty Semantic Suite to continuously “learn to speak” more and more languages, by making use of KD’s rich monolingual, bilingual and multilingual content and its long-time experience in lexicography as a base for improved multi-language text analysis and processing.

KD (http://kdictionaries.com/ http://kdictionaries-online.com/) is a technology-oriented content and data creator that is based in Tel Aviv and cooperates with publishing partners, ICT firms, the academe and professional associations worldwide. It deals with nearly 50 languages, offering quality monolingual, bilingual and multilingual lexical datasets, morphological word forms, phonetic transcription, etc.

As a result of this cooperation, PoolParty now provides language bundles in the following languages, which can be licensed together with all types of PoolParty servers:

  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Russian
  • Slovak
  • Spanish

Additional language bundles are in preparation and will be in place soon!

Furthermore, SWC and KD are partners in a brand new EUREKA project that is supported by a bilateral technology/innovation program between Austria and Israel. The project is called LDL4HELTA (Linked Data Lexicography for High-End Language Technology Application) and combines lexicography and Language Technology with Semantic Web and Linked (Open) Data mechanisms and technologies to improve existing and develop new products and services. It integrates the products of both partners to better serve existing customers and new ones, as well as to enter together new markets in the field of Linked Data lexicography-based Language Technology solutions. This project has been successfully kicked off in early July and has a duration of 24 months, with the first concrete results due early in 2016.

The LDL4HELTA project is supported by a research partner (Austrian Academy of Sciences) and an expert Advisory Board including  Prof Christian Chiarcos (Goethe University, Frankfurt), Mr Orri Erling (OpenLink Software), Dr Sebastian Hellmann (Leipzig University), Prof Alon Itai (Technion, Haifa), and Ms Eveline Wandl-Wogt (Austrian Academy of Sciences).

So stay tuned and we will inform you about news and activities of this cooperation here in the blog continuously!

Posted at 09:49

July 29

Dublin Core Metadata Initiative: Final Program announced for DC-2015 in São Paulo, Brazil

2015-07-29, São Paulo State University (UNESP) and the Conference Committee of DC-2015 in São Paulo, Brazil on 1-4 September have published the final program of the DCMI International Conference at http://dcevents.dublincore.org/IntConf/index/pages/view/schedule-15. Join us in São Paulo for an exciting agenda including papers, project reports and best practice posters and presentations. Parallel with the peer reviewed program is an array of special sessions of panels and discussions on key metadata issues, challenges and new opportunities. Pre- and post-conference Professional Program workshops round out the program by providing full-day instruction. Every year the DCMI community gathers for both its Annual Meeting and its International Conference on Dublin Core & Metadata Applications. The work agenda of the DCMI community is broad and inclusive of all aspects of innovation in metadata design, implementation and best practices. While the work of the Initiative progresses throughout the year, the Annual Meeting and Conference provide the opportunity for DCMI "citizens" as well as students and early career professionals studying and practicing the dark arts of metadata to gather face-to-face to share experiences. In addition, the gathering provides public- and private-sector initiatives beyond DCMI engaged in significant metadata work to come together to compare notes and cast a broader light into their particular metadata domain silos. Through such a gathering of the metadata "clans", DCMI advances its "first goal" of promoting metadata interoperability and harmonization. Visit the DC-2015 conference website at http://purl.org/dcevents/dc-2015 for additional information and to register.

Posted at 23:59

Dublin Core Metadata Initiative: Japanese translation of "Guidelines for Dublin Core Application Profiles" published

2015-07-29, DCMI is please to announce that the National Diet Library of Japan has translated "Guidelines for Dublin Core Application Profiles", a DCMI Recommended Resource. The link to the new Japanese translation is available on the DCMI Documents Translation page at http://dublincore.org/resources/translations/index.shtml.

Posted at 23:59

July 28

Libby Miller: HackspaceHat part 1: WebRTC, Janus and Gstreamer

Posted at 08:09

July 27

Tetherless World Constellation group RPI: Data and Semantics — Topics of Interest at ESIP 2015 Summer Meeting

The ESIP 2015 Summer Meeting was held at Pacific Grove, CA in the week of July 14-17. Pacific Grove is such a beautiful place with the coast line, sand beach and sun set. What excited me more are the science and technical topics covered in the meeting sessions, as well as the opportunity to catch up with friends in the ESIP community. Excellent topics + a scenic place + friends = a wonderful meeting. Thanks a lot to the meeting organizers!

The theme of this summer meeting is “The Federation of Earth Science Information Partners & Community Resilience: Coming Together.” Though my focus was Semantic Web and data stewardship relevant sessions, I was able to see the topic ‘resilience’ in various presented works. It was nice to see that the ESIP community has an ontology portal. It implements the Bio Portal infrastructure and focuses on collecting ontologies and vocabularies in the field of Earth sciences. With more submissions from the community in the future the portal has great potential for geo-semantics research, similar to what the Bio Portal does for bioinformatics. An important topic was reviewing progress and discussing directions for the future. Prof. Peter Fox from RPI offered a short overview. The ESIP Semantic Web cluster is nine years old, and it is nice to see that through the cluster has helped improve the visibility of semantic web methods and technologies in the grand field of geoinformatics. A key feature supporting the success of Semantic Web is that it is an open world and it evolves and updates.

There were several topics or projects of interest that I recorded during the meeting:

(1) schema.org: It recently released version 2.0 and introduced a new mechanism for extension. There are now two types of extensions: reviewed/hosted extensions and external extensions. The former (e1) gets its own chunk of schema.org namespace: e1.schema.org. All items in that extension are created and maintained by their own creators. The latter means a third party to create extensions specific to an application. Extensions to location and time might be a topic for the Earth science community in the near future.

(2) GCIS Ontology: GCIS is such a nice project it is incorporated several state-of-the-art Semantic Web methods and technologies. The provenance representation in GCIS means it is not just a static knowledge representation. It is more about what are the facts, what do people believe and why. In the ontology engineering for GCIS we also see the collaboration between geoscientists and computer scientists. That is, conceptual model came first, as a product that geoscientists can understand, before it was bound to logic and ontology encoding grammar. The process can be seen as within the scope of semiology. We can do good jobs with syntax and semantics, and very often we will struggle with the pragmatics.

(3) PROV-ES: Provenance of scientific findings is receiving increasing attending. Earth science community has taken a lead on working of capturing provenance. The World Wide Web Consortium (W3C) PROV standard provide a platform for Earth science community to adopt and extend. The Provenance – Earth Science (PROV-ES) Working Group was initiated in 2013 and it primarily focused on extending the PROV standard, and tested the outputs with sample projects. In the PROV-ES hackathon at the summer meeting, Hook Hua and Gerald Manipon showed more technical details of with PROV-ES, especially about its encodings, discovery, and visualization.

(4) Entity linking: Jin Guang Zheng and I had a poster about our ESIP 2014 Test bed project. The topic is about linking entity mentions in documents and datasets to entities in the Web of Data. Entity recognition and linking is a valuable work in works with datasets collected from multiple sources. Detecting and linking entity mentions in datasets can be facilitated by using knowledge bases on the Web, such as ontologies and vocabularies. In this work we built a web-based entity linking and wikification service for datasets. Our current demo system uses DBPedia as the knowledge base, and we have been collecting geoscience ontologies and vocabularies. A potential future collaboration is to use the ESIP ontology portal as the knowledge base. Discussion with colleagues during the poster session shows that this work may also be beneficial to works on dark data, such as pattern recognition and knowledge discovery from legacy literature.

(5) Big Earth Data Initiative: This is an inter-agency coordination work for geo-data interoperability in US. I would copy paste a part of the original session description to show the detailed relationships about a few entities and organizations that were mentioned: ‘The US Group on Earth Observations (USGEO) Data Management Working Group (DMWG) is an inter-agency body established under the auspices of the White House National Science and Technology Council (NSTC). DMWG members have been drafting an “Earth Observations Common Framework” (EOCF) with recommended approaches for supporting and improving discoverability, accessibility, and usability for federally held earth observation data. The recommendations will guide work done under the Big Earth Data Initiative (BEDI), which provided funding to some agencies for improving those data attributes.’ It will be nice to see more outputs from this effort and compare the work with similar efforts in Europe such as the INSPIRE, as well as the global initiative GEOSS.

Posted at 17:13

July 22

AKSW Group - University of Leipzig: DL-Learner 1.1 (Supervised Structured Machine Learning Framework) Released

Dear all,

we are happy to announce DL-Learner 1.1.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

Website: http://dl-learner.org
GitHub page: https://github.com/AKSW/DL-Learner
Download: https://github.com/AKSW/DL-Learner/releases
ChangeLog: http://dl-learner.org/development/changelog/

DL-Learner is used for data analysis in other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern based and evolutionary techniques for learning on structured data. For a practical example, see http://dl-learner.org/community/carcinogenesis/. It also offers a plugin for Protege, which can give suggestions for axioms to add. DL-Learner is part of the Linked Data Stack – a repository for Linked Data management tools.

In the current release, we improved the support for SPARQL endpoints as knowledge sources. You can now directly use a SPARQL endpoint for learning without an OWL reasoner on top of it. Moreover, we extended DL-Learner to also consider dates and inverse properties for learning. Further efforts were made to improve our Query Tree Learning algorithms (those are used to learn SPARQL queries rather than OWL class expressions).

We want to thank everyone who helped to create this release, in particular Robert Höhndorf and Giuseppe Rizzo. We also acknowledge support by the recently started SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as the GeoKnow and Big Data Europe projects where it is part of the respective platforms.

Kind regards,

Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin

Posted at 14:14

July 16

AKSW Group - University of Leipzig: AKSW Colloquium, 20-07-2015, Enterprise Linked Data Networks

Enterprise Linked Data Networks (PhD progress report) by Marvin Frommhold

marvinFrommholdThe topic of the thesis is the scientific utilization of the LUCID research project, in particular the LUCID Endpoint Prototype. In LUCID we research and develop on Linked Data technologies in order to allow partners in supply chains to describe their work, their companies and their products for other participants. This allows for building distributed networks of supply chain partners on the Web without a centralized infrastructure.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 09:20

July 15

Dublin Core Metadata Initiative: Ex Libris and Elsevier join as DC-2015 sponsors

2015-07-15, Conference host UNESP and DCMI are pleased to announce that Ex Libris and Elsevier are now among the sponsors of DC-2015 in São Paulo, Brazil, 1-4 September 2015. Elsevier is a world-leading provider of scientific, technical and medical information products and services and a world-leading provider of information solutions that enhance the performance of science, health, and technology professionals, empowering them to make better decisions. Ex Libris is a leading provider of library automation solutions, offering the only comprehensive product suite for the discovery, management, and distribution of all materials--print, electronic, and digital. For information about how your organization can becoming a DC-2015 sponsor, see http://bit.ly/DC2015-Sponsors.

Posted at 23:59

Dublin Core Metadata Initiative: DC-2015 Early Bird registration closes 31 July 2015!

2015-07-15, Early Bird registration for DC-2015 in São Paulo, Brazil closes on 31 July 2015. In addition to Keynote Speakers Paul Walk of EDINA and Ana Alice Baptista of the University of Minho, there is a full Technical Program of peer-reviewed papers, project reports and posters, as well as a Professional Program of full-day Workshops and Conference Special Sessions. For more about the conference, visit the conference website at http://purl.org/dcevents/dc-2015.

Posted at 23:59

Orri Erling: Big Data, Part 2: Virtuoso Meets Impala

In this article we will look at Virtuoso vs. Impala with 100G TPC-H on two R3.8 EC2 instances. We get a single user win for Virtuoso by a factor of 136, and a five user win by a factor of 55. The details and analysis follow.

The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string for string and date columns, and decimal for numbers. We suppose the authors know what works best.

The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell) can be changed, but the behavior does not seem to change.

Following are the run times for one query stream.

Query Virtuoso Impala Notes
332     s 841     s Data Load
Q1 1.098 s 164.61  s
Q2 0.187 s 24.19  s
Q3 0.761 s 105.70  s
Q4 0.205 s 179.67  s
Q5 0.808 s 84.51  s
Q6 2.403 s 4.43  s
Q7 0.59  s 270.88  s
Q8 0.775 s 51.89  s
Q9 1.836 s 177.72  s
Q10 3.165 s 39.85  s
Q11 1.37  s 22.56  s
Q12 0.356 s 17.03  s
Q13 2.233 s 103.67  s
Q14 0.488 s 10.86  s
Q15 0.72  s 11.49  s
Q16 0.814 s 23.93  s
Q17 0.681 s 276.06  s
Q18 1.324 s 267.13  s
Q19 0.417 s 368.80  s
Q20 0.792 s 60.45  s
Q21 0.720 s 418.09  s
Q22 0.155 s 40.59  s
Total 20     s 2724     s

Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.

Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem. This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top, the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.

Following are the average times for each query in the 5 stream experiment.

Query Virtuoso Impala Notes
Q1 1.95 s 191.81 s
Q2 0.70 s 40.40 s
Q3 2.01 s 95.67 s
Q4 0.71 s 345.11 s
Q5 2.93 s 112.29 s
Q6 4.76 s 14.41 s
Q7 2.08 s 329.25 s
Q8 3.00 s 98.91 s
Q9 5.58 s 250.88 s
Q10 8.23 s 55.23 s
Q11 4.26 s 27.84 s
Q12 1.74 s 37.66 s
Q13 6.07 s 147.69 s
Q14 1.73 s 23.91 s
Q15 2.27 s 23.79 s
Q16 2.41 s 34.76 s
Q17 3.92 s 362.43 s
Q18 3.02 s 348.08 s
Q19 2.27 s 443.94 s
Q20 3.05 s 92.50 s
Q21 2.00 s 623.69 s
Q22 0.37 s 61.36 s
Total for
Slowest Stream
67    s 3740    s

There are 4 queries in Impala that terminated with an error (memory limit exceeded). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem, which explains running out of memory. Virtuoso does Q21 mostly by index.

Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad are over 160G, certainly enough to have the working set in memory. iostat also does not show any I, so we seem to be running from memory, as intended.

We observe that Impala does not store tables in any specific order. Therefore a merge join of orders and lineitem is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders and half of lineitem in Q21, and all orders in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.

However, the lineitem/orders join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem and part was about 90s, with a replicated part hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.

Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile statement for each query for the single stream run. impala_stream0.zip contains the evidence for the single-stream run; impala-stream1-5.zip holds the 5-stream run.

To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.

In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.

Posted at 20:12

Egon Willighagen: PubChemRDF: semantic web access to PubChem data

Gang Fu and Evan Bolton have blogged about it previously, but their PubChemRDF paper is out now (doi:10.1186/s13321-015-0084-4). It very likely defines the largest collection of RDF triples using the CHEMINF ontology and I congratulate the authors with a increasingly powerful PubChem database.

With this major provider of Linked Open Data for chemistry now published, I should soon see where my Isbjørn stands. The release of this publication is also very timely with respect to the CHEMINF ontology, as I last week finished a transition from Google to GitHub, by moving the important wiki pages, including one about "Where is the CHEMINF ontology used?". I already added Gang's paper. A big thanks and congratulations to the PubChem team and my sincere thanks to have been able to contribute to this paper.

Posted at 17:21

Bob DuCharme: Visualizing DBpedia geographic data

With some help from SPARQL.

Posted at 13:34

July 14

Semantic Web Company (Austria): Semantic Web Company with LOD2 project top listed at the first EC Innovation Radar

The Innovation Radar is a DG Connect support initiative which focuses on the identification of high potential innovations and the key innovators behind them in FP7, CIP and H2020 projects. The Radar supports the innovators by suggesting a range of targeted actions that can assist them in fulfilling their potential in the market place. The first Innovation Radar Report reviews the innovation potential of ICT projects funded under 7th Framework Programme and the Competitiveness and Innovation Framework Programme. Between May 2014 and January 2015, the Commission reviewed 279 ICT projects, which had resulted in a total of 517 innovations, delivered by 544 organisations in 291 European cities.

Core of the analysis is the Innovation Capacity Indicator (ICI), which measures both the ability of the innovator company and the quality of the environment in which it operates. AND: among this results, SWC has received two top rankings. One for the recently concluded LOD2 project (LOD2 – Creating Knowledge out of Interlinked Data) and another as being one of the key organisations and thereby innovating SMEs within this projects. Also listed are our partners OpenLink Software and Wolters Kluwer Germany.


Ranking of the top 10 innovations and key organisations behind them (Innovation Radar 2015)

We are happy and proud that the report identifies Semantic Web Company as one of those players (10 %) where commercial exploitation of innovations is already ongoing. That strengthens and confirmes our approach to interconnect FP7 and H2020 research and innovation activities with real-world business use cases coming from our customers and partners. Thereby our core product PoolParty Semantic Suite can be taken as a best practice example of embedding collaborative research into an innovation-driven commercial product. For Semantic Web Company, the report is particularly encouraging because of it’s emphasis on the positive role of SMEs, where the report sees 41% of high-potential innovation coming from…

 

Blogpost by Martin Kaltenböck and Thomas Thurner

Posted at 10:15

July 13

Orri Erling: Vectored Execution in Column/Row Stores

This article discusses the relationship between vectored execution and column- and row-wise data representations. Column stores are traditionally considered to be good for big scans but poor at indexed access. This is not necessarily so, though. We take TPC-H Q9 as a starting point, working with different row- and column-wise data representations and index choices. The goal of the article is to provide a primer on the performance implications of different physical designs.

All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads.

We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee.

So, in these experiments, we store the relevant data four times over, as follows:

  • 100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey and on l_partkey, l_suppkey

  • The same in row-wise data representation

  • Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name. These all have the original tables primary key, e.g., l_orderkey, l_linenumber for the l_ prefixed tables

  • The same with row-wise tables

The column-wise structures are in the DB qualifier, and the row-wise are in the R qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win.

As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem with a selective hash join followed ordered index access of orders, then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side.

The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql makes the tables and indices, rsload.sql copies the data from the TPC-H tables.

The business question is to calculate the profit from sale of selected parts grouped by year and country of the supplier. This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row.

SELECT
                                                                         n_name  AS  nation, 
                                                 EXTRACT(year FROM o_orderdate)  AS  o_year,
          SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity)  AS  sum_profit
    FROM  lineitem, part, partsupp, orders, supplier, nation
   WHERE    s_suppkey = l_suppkey
     AND   ps_suppkey = l_suppkey
     AND   ps_partkey = l_partkey
     AND    p_partkey = l_partkey
     AND   o_orderkey = l_orderkey
     AND  s_nationkey = n_nationkey
     AND  p_name LIKE '%green%'
GROUP BY  nation, o_year
ORDER BY  nation, o_year DESC

Query Variants

The query variants discussed here are:

  1. Hash based, the best plan -- 9h.sql

  2. Index based with multicolumn rows, with lineitem index on l_partkey -- 9i.sql, 9ir.sql

  3. Index based with multicolumn rows, lineitem index on l_partkey, l_suppkey -- 9ip.sql, 9ipr.sql

  4. Index based with one table per dependent column, index on l_partkey -- 9p.sql

  5. index based with one table per dependent column, with materialized l_partkey, l_suppkey -> l_orderkey, l_minenumber -- 9pp.sql, 9ppr.sql

These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern.

The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine.

We note that lineitem and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems by a secondary index on l_partkey.

1 — Hash-based plan

Vector Dynamic 10k 1
Column-wise 4.1 s 4.1 s 145   s
Row-wise 25.6 s 25.9 s 45.4 s

Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism.

Without vectorization, we have a situation where the lineitem scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead.

2 — Index-based, lineitem indexed on l_partkey

Vector Dynamic 10k 1
Column-wise 30.4 s 62.3 s 321   s
Row-wise 31.8 s 27.7 s 122   s

Here the plan scans part, then partsupp, which shares ordering with part; both are ordered on partkey. Then lineitem is fetched by a secondary index on l_partkey. This produces l_orderkey, l_lineitem, which are used to get the l_suppkey. We then check if the l_suppkey matches the ps_suppkey from partsupp, which drops 3/4 of the rows. The next join is on orders, which shares ordering with lineitem; both are ordered on orderkey.

There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser.

3 — Index-based, lineitem indexed on L_partkey, l_suppkey

Vector Dynamic 10k 1
Column-wise 16.9 s 47.2 s 151   s
Row-wise 22.4 s 20.7 s 89   s

This is similar to the previous, except that now only lineitems that match ps_partkey, ps_suppkey are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size.

4 — Decomposed, index on l_partkey

Vector Dynamic 10k 1
Column-wise 35.7 s 170   s 601   s
Row-wise 44.5 s 56.2 s 130   s

Now, each of the l_extendedprice, l_discount, l_quantity and l_suppkey is a separate index lookup. The times are slightly higher but the dynamic is the same.

The non-vectored columns case is hit the hardest.

5 — Decomposed, index on l_partkey, l_suppkey

Vector Dynamic 10k 1
Column-wise 19.6 s 111   s 257   s
Row-wise 32.0 s 37   s 74.9 s

Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality.

Space Utilization

The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted.

The row-wise table also contains entries for column-wise structures (DB.*) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures.

Row-Wise    Column-Wise
MB structure
73515 R.DBA.LINEITEM
14768 R.DBA.ORDERS
11728 R.DBA.PARTSUPP
10161 r_lpk_pk
10003 r_l_pksk
9908 R.DBA.l_partkey
8761 R.DBA.l_extendedprice
8745 R.DBA.l_discount
8738 r_l_pk
8713 R.DBA.l_suppkey
6267 R.DBA.l_quantity
2223 R.DBA.CUSTOMER
2180 R.DBA.o_orderdate
2041 r_O_CK
1911 R.DBA.PART
1281 R.DBA.ps_supplycost
811 R.DBA.p_name
127 R.DBA.SUPPLIER
88 DB.DBA.LINEITEM
24 DB.DBA.ORDERS
11 DB.DBA.PARTSUPP
9 R.DBA.s_nationkey
5 l_pksk
4 DB.DBA.l_partkey
4 lpk_pk
4 DB.DBA.l_extendedprice
3 l_pk
3 DB.DBA.l_suppkey
2 DB.DBA.CUSTOMER
2 DB.DBA.l_quantity
1 DB.DBA.PART
1 O_CK
1 DB.DBA.l_discount
  
MB structure
36482 DB.DBA.LINEITEM
13087 DB.DBA.ORDERS
11587 DB.DBA.PARTSUPP
5181 DB.DBA.l_extendedprice
4431 l_pksk
3072 DB.DBA.l_partkey
2958 lpk_pk
2918 l_pk
2835 DB.DBA.l_suppkey
2067 DB.DBA.CUSTOMER
1618 DB.DBA.PART
1156 DB.DBA.l_quantity
961 DB.DBA.ps_supplycost
814 O_CK
798 DB.DBA.l_discount
724 DB.DBA.p_name
436 DB.DBA.o_orderdate
126 DB.DBA.SUPPLIER
1 DB.DBA.s_nationkey

In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression.

We note that the single column projections are smaller column-wise. The l_extendedprice is not very compressible hence column-wise takes much more space than l_quantity; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber are ordered and very compressible, the column-wise structures are in all cases noticeably more compact.

The same applies to the multipart index l_pksk and r_l_pksk (l_partkey, l_suppkey, l_orderkey, l_linenumber) in column- and row-wise representations.

Note that STRING columns (e.g., l_comment) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store.

Conclusions

Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality.

Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality.

For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch.

For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access.

For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5.

For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.

Posted at 17:46

Orri Erling: Virtuoso at SIGMOD 2015

Two papers presented at SIGMOD 2015 have been added to the Virtuoso Science Library.

  • Orri Erling (OpenLink Software); Alex Averbuch (Neo Technology); Josep Larriba-Pey (Sparsity Technologies); Hassan Chafi (Oracle Labs); Andrey Gubichev (TU Munich); Arnau Prat-Pérez (Universitat Politècnica de Catalunya); Minh-Duc Pham (VU University Amsterdam); Peter Boncz (CWI): The LDBC Social Network Benchmark: Interactive Workload. Proceedings of SIGMOD 2015, Melbourne.

    This paper is an overview of the challenges posed in the LDBC social network benchmark, from data generation to the interactive workload.

  • Mihai Capotă (Delft University of Technology), Tim Hegeman (Delft University of Technology), Alexandru Iosup (Delft University of Technology), Arnau Prat-Pérez (Universitat Politècnica de Catalunya), Orri Erling (OpenLink Software), Peter Boncz (CWI): Graphalytics: A Big Data Benchmark for Graph-Processing Platforms. Sigmod GRADES 2015.

    This paper discusses the future evolution of the LDBC Social Network Benchmark and gives a preview of Virtuoso graph traversal performance.

Posted at 16:52

Orri Erling: Big Data, Part 1: Virtuoso Meets Hive

In this series, we will look at Virtuoso and some of the big data technologies out there. SQL on Hadoop is of interest, as well as NoSQL technologies.

We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop.

The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set.

The Virtuoso version and settings are as in the Virtuoso Cluster test AMI.

The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads.

Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey); in Hive, there are no indices.

Query Virtuoso Hive Notes
332     s 742     s Data Load
Q1 1.098 s 296.636 s
Q2 0.187 s >3600     s Hive Timeout
Q3 0.761 s 98.652 s
Q4 0.205 s 147.867 s
Q5 0.808 s 114.782 s
Q6 2.403 s 71.789 s
Q7 0.59  s 394.201 s
Q8 0.775 s >3600     s Hive Timeout
Q9 1.836 s >3600     s Hive Timeout
Q10 3.165 s 179.646 s
Q11 1.37  s 43.094 s
Q12 0.356 s 101.193 s
Q13 2.233 s 208.476 s
Q14 0.488 s 89.047 s
Q15 0.72 s 136.431 s
Q16 0.814 s 105.652 s
Q17 0.681 s 255.848 s
Q18 1.324 s 337.921 s
Q19 0.417 s >3600     s Hive Timeout
Q20 0.792 s 193.965 s
Q21 0.720 s 670.718 s
Q22 0.155 s 68.462 s

Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do.

Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming.

The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join.

One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this.

Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads.

If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.

Posted at 16:16

AKSW Group - University of Leipzig: AKSW Colloquium, 13-07-2015

Philipp Frischmuth will give a brief presentation regarding the current state of his PhD thesis and Lukas Eipert will present the topic of his upcoming internship:

As part of an internship at eccenca a configurable graphical RDF editor will be developed. Graphical components such as shapes and arrows will be translated to triples depending on the configuration of the editor. This talk outlines the idea and motivation of this task.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

 

Posted at 10:06

July 01

Dublin Core Metadata Initiative: DC-2015 Preliminary Program published

2015-07-01, São Paulo State University (UNESP) and the Conference Committee of DC-2015 have published the preliminary program of the DCMI International Conference at http://dcevents.dublincore.org/IntConf/index/pages/view/schedule-15. The conference days--Wednesday and Thursday, 2-3 September--feature keynote speakers, Paul Walk and Ana Alice Baptista, paper sessions, project reports, posters (including best practice posters and demonstrations), and an array of special sessions. Tuesday and Friday are pre- and post-conference, full-day workshop events: "Development of Metadata Application Profiles", "Training the Trainers for Linked Data", and "Elaboration of Controlled Vocabularies Using SKOS". Special Session include "Schema.org Structured Data on the Web--An Extending Influence" sponsored by OCLC, "Current Developments in Metadata for Research Data" sponsored by the DCMI Science and Metadata Community, and "Cultural Heritage Linked Data". The titles and abstracts of the Technical Program are available at http://dcevents.dublincore.org/IntConf/index/pages/view/abstracts-15. Registration is open: http://dcevents.dublincore.org/IntConf/index/pages/view/reg15. Day registrations are available.

Posted at 23:59

W3C Read Write Web Community Group: Read Write Web — Q2 Summary — 2015

Summary

Q2 was a relatively quiet, yet has seen quite a bit of progress.  Some work is being done on the EU INSPIRE directive and ESWC took place in Slovenia with some interesting demos.  One that caught the eye was QueryVOWL, a visual query language, for linked data.

For those that enjoy such things, there was some interesting work and discussion on deterministic naming of blank nodes.  Also a neat new framework called Linked Data Reactor, which can be used for developing component based applications.  The web annotation group has also published an Editor’s draft.

Much of the work that has been done in this group has come together in a new spec, SoLiD (Social Linked Data).  As an early adopter of this technology I have been extraordinarily impressed, and would encourage trying it out.  There has also been a proposed charter for the next version of the Linked Data Platform.

Communications and Outreach

A few of this group met with the Social Web Working Group in Paris.  Over two days we got to demo read write technologies in action, and also to see the work from members of the indieweb community and those working with the Activity Streams specification.

Community Group

Relatively quiet this quarter on the mailing list with about 40 posts.  I get the impression that more focus has shifted to implementations and applications, where I think there is starting to be an uptick in progress.  Some ontologies have been worked on, one for SoLiD apps, and another for micrblogging.

contacts

Applications

The first release of a contacts manager on the SoLiD platform came out this month.  Which allows you to set up and store your own personal address book, in your own storage.  An interesting feature of this app is that it includes logic for managing workspaces and preferences.  Import and export is currently targeted to vcard, but more will be added, or simply fork the app and and your own!

Lots of work has been done on the linkeddata github area.  General improvements and some preliminary work on a keychain app.  One feature that I have found useful was the implementation of HTTP PATCH sending notifications to containers when something has changed.  This helped me create a quick demo for webid.im to show how it’s possible to cycle through a set of images and have them propagate through the network as things change.

ldfragments

Last but not Least…

The Triple Pattern fragments client was released and is able to query multiple APIs for data at the same time.  This is a 100% client side app and supports federated SPARQL queries.  Another great open source app, you can read the specification or dive into the source code here.

Posted at 16:36

June 30

Redlink: Redlink API, moving to 1.0

In the last months we’ve being working very hard to provide a reliable and valuable service as the Redlink Platform. Today finally we can announce we move out of public beta for going to 1.0.

That means you would need to move to the new endpoint. Don’t worry. if you are using a Redlink SDK or one of our plugins, we’re updating all for having an easy transition. This does not affect current running applications, since 1.0-BETA is deprecated but still available until the end of 2015. If you need further support about the transition, please contact us.

ipad_scr_3

In the following weeks we’ll contact all our users to discuss further interest in our services and see how we can help you do awesome things with your data.

Posted at 15:32

June 29

Leigh Dodds: “The scribe and the djinn’s agreement”, an open data parable

In a time long past, in a land far away, there was once a great city. It was the greatest city in the land and the vast marketplace at its centre was the busiest, liveliest marketplace in the world. People of all nations could be found there buying and selling their wares. Indeed, the marketplace was so large that people would spend days, even weeks, exploring its length and breadth would still discover new stalls selling a myriad of items.

A frequent visitor to the marketplace was a woman known only as the Scribe. While the Scribe was often found roaming the marketplace even she did not know of all of the merchants to be found within its confines. Yet she spent many a day helping others to find their way to the stalls they were seeking, and was happy to do so.

One day, as a gift for providing useful guidance, a mysterious stranger gave the Scribe a gift: a small magical lamp. Upon rubbing the lamp a djinn appeared before the suprised Scribe and offered her a single wish.

“Oh venerable djinn” cried the Scribe, “grant me the power to help anyone that comes to this marketplace. I wish to help anyone who needs it to find their way to whatever they desire”.

With a sneer the djinn replied: “I will grant your wish. But know this: your new found power shall come with limits. For I am a capricious spirit who resents his confinement in this lamp”. And with a flash and a roll of thunder, the magic was completed. And in the hands of the Scribe appeared the Book.

The Book contained the name and location of every merchant in the marketplace. From that day forward, by reading from the Book, the Scribe was able to help anyone who needed assistance to find whatever they needed.

After several weeks of wandering the market, happily helping those in need, the Scribe was alarmed to discover that she was confronted by a long, long line of people.

“What is happening?” she asked of the person at the head of the queue.

“It is now widely known that no-one should come to the Market without consulting the Scribe” said the man, bowing. “Could you direct me to the nearest merchant selling the finest silks and tapestries?”

And from that point forward the Scribe was faced with a never-ending stream of people asking for help. Tired and worn and no longer able to enjoy wandering the marketplace as had been her whim, she was now confined to its gates. Directing all who entered, night and day.

After some time, a young man took pity on the Scribe, pushing his way to the front of the queue. “Tell me where all of the spice merchants are to be found in the market, and then I shall share this with others!”

But no sooner had he said this than the djinn appeared in a puff of smoke: “NO! I forbid it!”. With a wave of its arm the Scribe was struck dumb until the young man departed. With a smirk the djinn disappeared.

Several days passed and a group of people arrived at the head of queue of petitioners.

“We too are scribes.” they said. “We come from a neighbouring town having heard of your plight. Our plan is to copy out your Book so that we might share your burden and help these people”.

But whilst a spark of hope was still flaring in the heart of the scribe, the djinn appeared once again. “NO! I forbid this too! Begone!” And with scream and a flash of light the scribes vanished. Looking smug the djinn disappeared.

Some time passes before a troupe of performers approach the Scribe. As a chorus they cried: “Look yonder at our stage, and the many people gathered before it. By taking turns from reading from the book, in front of wide audience, we can easily share your burden”.

But shaking her head the Scribe could only turn away whilst the djinn visited ruin upon the troupe. “No more” she whispered sadly.

And so, for many years the Scribe remained as she had been, imprisoned within the subtle trap of the djinn of the lamp. Until, one day a traveller appeared in the market. Upon reaching the head of the endless line of penitents, the man asked of the Scribe:

“Where should you go to rid your self of the evil djinn?”.

Surprised, and with sudden hope, the Scribe turned the pages of her Book…


Posted at 20:50

Orri Erling: Rethink Big and Europe?s Position in Big Data

I will here take a break from core database and talk a bit about EU policies for research funding.

I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.

RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.

So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.

Core competence

My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.

User challenges

The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.

Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.

There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.

So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.

Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.

What technology would you like to have? Network or power efficiency?

OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.

On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.

There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.

If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.

One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.

So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.

Are there pressures to adapt business models based on big data?

The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time.

So, database software should be rented by the hour. A 100-150% markup for the $2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.

There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.

The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.

Well, the question is raised, what should Europe do that would have tangible impact in the next 5 years?

This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.

What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.

I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.

What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.

In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.

The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.

The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.

It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.

Posted at 19:36

June 28

Semantic Web Company (Austria): Improved Customer Experience by use of Semantic Web and Linked Data technologies

With the rise of Linked Data technologies, there come several new approaches into play for the improvement of customer experience across all digital channels of a company. All of these methodologies can be subsumed under the term “the connected customer”.

These are interesting not only for retailers operating a web shop, but also for enterprises seeking for new ways to develop tailor-made customer services and to increase customer retention.

Linked Data methodologies can help to improve several measurements alongside a typical customer experience lifecycle.

  1. vectorstock_4550983Personalized access to information, e.g. to technical documentation
  2. Cross-selling through a better contextualization of product information
  3. Semantically enhanced help desk, user forums and self service platforms
  4. Better ways to understand and interpret a customer intention by use of enterprise vocabularies
  5. More dynamic management of complex multi-channel websites through a better cost-effectiveness
  6. More precise methods for data analytics, e.g. to allow marketers to better target campaigns and content to the user’s preferences
  7. Enhanced search experience at aggregators like Google through the use of microdata and schema.org

In the center of this approach, knowledge graphs work like a ‘linking machine’. Based on standards-based semantic models, business entities are getting linked in a most dynamic way. Those graphs go beyond the power of social graphs. While social graphs are focused on people only, are knowledge graphs connecting all kinds of relevant business objects to each other.

When customers and their behaviours are represented in a knowledge model, Linked data technologies try to preserve as much semantics as possible. By these means they are able to complement other approaches for big data analytics, which rather tend to flatten out the data model behind business entities.

Posted at 09:08

June 26

Semantic Web Company (Austria): Using SPARQL clause VALUES in PoolParty

connect-sparqlSince PoolParty fully supports SPARQL 1.1 functionalities you can use clauses like VALUES. The VALUES clause can be used to provide an unordered solution sequence that is joined with the results of the query evaluation. From my perspective it is a convenience of filtering variables and an increase in readability of queries.

E.g. when you want to know which cocktails you can create with Gin and a highball glass you can go to http://vocabulary.semantic-web.at/PoolParty/sparql/cocktails and fire this query:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX co: <http://vocabulary.semantic-web.at/cocktail-ontology/>
SELECT ?cocktailLabel
WHERE {
  ?cocktail co:consists-of ?ingredient ;
    co:uses ?drinkware ;
    skos:prefLabel ?cocktailLabel .
  ?ingredient skos:prefLabel ?ingredientLabel .
  ?drinkware skos:prefLabel ?drinkwareLabel .
  FILTER (?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en )
}

When you want to add additional pairs of ingredients and drink ware you want to filter in combination the query gets quite clumsy. Wrongly placed braces can break the syntax. In addition, when writing complicated queries you easily insert errors, e.g. by mixing boolean operators which results in wrong results…

...
FILTER ((?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en ) ||
     (?ingredientLabel = "Vodka"@en && ?drinkwareLabel ="Old Fashioned glass"@en ))
}

Using VALUES can help in this situation. For example this query shows you how to filter both pairs Gin+Highball glass and Vodka+Old Fashioned glass in a neat way:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX co: <http://vocabulary.semantic-web.at/cocktail-ontology/>
SELECT ?cocktailLabel
WHERE {
  ?cocktail co:consists-of ?ingredient ;
    co:uses ?drinkware ;
    skos:prefLabel ?cocktailLabel .
  ?ingredient skos:prefLabel ?ingredientLabel .
  ?drinkware skos:prefLabel ?drinkwareLabel .
}
VALUES ( ?ingredientLabel ?drinkwareLabel )
{
  ("Gin"@en "Highball glass"@en)
  ("Vodka"@en "Old Fashioned glass"@en)
}

Especially when you create SPARQL code automatically, e.g. generated by a form, this clause can be very useful.

 

Posted at 13:15

June 22

Dublin Core Metadata Initiative: OpenAIRE Guidelines: Promoting Repositories Interoperability and Supporting Open Access Funder Mandates

2015-06-01, The OpenAIRE Guidelines for Data Source Managers provide recommendations and best practices for encoding of bibliographic information in OAI metadata. Presenters Pedro Príncipe, University of Minho, Portugal, and Jochen Schirrwagen, Bielefeld University Library, Germany, will provide an overview of the Guidelines, implementation support in major platforms and tools for validation. The Guidelines have adopted established standards for different classes of content providers: (1) Dublin Core for textual publications in institutional and thematic repositories; (2) DataCite Metadata Kernel for research data repositories; and (3) CERIF-XML for Current Research Information Systems. The principle of these guidelines is to improve interoperability of bibliographic information exchange between repositories, e-journals, CRIS and research infrastructures. They are a means to help content providers to comply with funders Open Access policies, e.g. the European Commission Open Access mandate in Horizon2020, and to standardize the syntax and semantics of funder/project information, open access status, links between publications and datasets. Webinar Date: Wednesday, 1 July 2015, 10:00am-11:15am EDT (UTC 14:00 - World Clock: http://bit.ly/pprincipe). For additional information and to register, visit http://dublincore.org/resources/training/#2015principe.

Posted at 23:59

AKSW Group - University of Leipzig: AKSW Colloquium, 22-06-2015, Concept Expansion Using Web Tables, Mining entities from the Web, Linked Data Stack

Concept Expansion Using Web Tables by Chi Wang, Kaushik Chakrabarti, Yeye He,Kris Ganjam, Zhimin Chen, Philip A. Bernstein (WWW’2015), presented by Ivan Ermilov:

Ivan ErmilovAbstract. We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the “exclusive” tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.

Mining entities from the Web by Anna Lisa Gentile

Anna Lisa GentileThis talk explores the task of mining entities and their describing attributes from the Web. The focus is on entity-centric websites, i.e. domain specific websites containing a description page for each entity. The task of extracting information from this kind of websites is usually referred as Wrapper Induction. We propose a simple knowledge based method which is (i) highly flexible with respect to different domains and (ii) does not require any training material, but exploits Linked Data as background knowledge source to build essential learning resources. Linked Data – an imprecise, redundant and large-scale knowledge resource – proved useful to support this Information Extraction task: for domains that are covered, Linked Data serve as a powerful knowledge resource for gathering learning seeds. Experiments on a publicly available dataset demonstrate that, under certain conditions, this simple approach based on distant supervision can achieve competitive results against some complex state of the art that always depends on training data.

Linked Data Stack by Martin Röbert

martinRobertMartin will present the packaging infrastructure developed for the Linked Data Stack project, which will be followed by a discussion about the future of the project.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 10:11

June 20

Bob DuCharme: Artificial Intelligence, then (1960) and now

Especially machine learning.

Posted at 15:50

Copyright of the postings is owned by the original blog authors. Contact us.