Planet RDF

It's triples all the way down

April 23

Frederick Giasson: Exporting Entities using OSF for Drupal (Screencast)

This screencast will introduce you to the OSF for Drupal features that let you export Drupal Entities in one of the following supported serializations:

  • RDF+XML (RDF in XML)
  • RDF+N3 (RDF in N3)
  • structJSON (Internal OSF RDF serialization in JSON)
  • structXML (Internal OSF RDF serialization in XML)
  • ironJSON (irON serialization in JSON)
  • commON (CSV serialization to be used in spreadsheet applications)

I will show you how you can use OSF for Drupal to export entire datasets of Entities, or how to export Entities individually. You will see how you can configure Drupal such that different users roles get access to these functionalities.

I will also briefly discuss how you can create new converters to support more data formats.

Finally, I will show you how Drupal can be used as a linked data platform with a feature that makes every Drupal Entities dereferencable on the Web1. You will see how you can use cURL to export the Entities‘ descriptions using their URI in one of the 6 supported serialization formats.


tut_11_blog_400
 
  1. OSF for Drupal follows the Cool URIs for the Semantic Web W3C’s interest group notes

Posted at 12:41

April 22

Orri Erling: In Hoc Signo Vinces (part 12 of n): TPC-H: Result Preview

In this article, we look at the 100 GB single-server results for the whole workload. We will call this Virt-H instead of TPC-H in order to comply with the TPC rules: Use of the TPC-H label requires an audit.

The test consists of a bulk load followed by two runs. Each run consists of a single user power test and a multi-user throughput test. The number of users in the throughput test is up to the test sponsor but must be at least 5 for the 100 GB scale. The reported score is the lower of the two scores.

Result Summary

Scale Factor 100 GB
dbgen version 2.15
Lload time 0:15:02
Composite qph 241,482.3
System Availability Date 2014-04-22

The price/performance is left open. The hardware costs about 5000 euros and the software is open source so the cost per performance would be a minimum of 0.02 euros per qph at 100G. This is not compliant with the TPC pricing rules though. These require 3 year maintenance contracts for all parts.

The software configuration did not use RAID. Otherwise the software would be auditable to the best of my knowledge. The hardware would have to be the same from Dell, HP, or other large brand to satisfy the TPC pricing rule.

Executive Summaries of Each Run

Run 1

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 239,785.1
Virt-H Throughput 243,191.4
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 241,482.3
Measurement Interval in Throughput Test (Ts) 162.935000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:17:46 2014-04-21 21:18:33 0:00:47
Stream 1 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 2 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 3 2014-04-21 21:18:33 2014-04-21 21:21:06 0:02:33
Stream 4 2014-04-21 21:18:33 2014-04-21 21:21:10 0:02:37
Stream 5 2014-04-21 21:18:33 2014-04-21 21:21:16 0:02:43
Refresh 0 2014-04-21 21:17:46 2014-04-21 21:17:49 0:00:03
  2014-04-21 21:17:50 2014-04-21 21:17:51 0:00:01
Refresh 1 2014-04-21 21:19:25 2014-04-21 21:19:38 0:00:13
Refresh 2 2014-04-21 21:18:33 2014-04-21 21:18:48 0:00:15
Refresh 3 2014-04-21 21:18:49 2014-04-21 21:19:01 0:00:12
Refresh 4 2014-04-21 21:19:01 2014-04-21 21:19:13 0:00:12
Refresh 5 2014-04-21 21:19:13 2014-04-21 21:19:25 0:00:12

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.311882 0.383459 1.143286 0.439926 1.594027 0.736482 1.440826 1.198925
Stream 1 5.192341 0.952574 6.184940 1.194804 6.998207 5.122059 5.962717 6.773401
Stream 2 7.354001 1.191604 4.238262 1.770639 5.782669 1.357578 4.034697 6.354747
Stream 3 6.489788 1.585291 4.645022 3.358926 7.904636 3.220767 5.694622 7.431067
Stream 4 5.609555 1.066582 6.740518 2.503038 9.439980 3.424101 4.404849 4.256317
Stream 5 10.346825 1.787459 4.391000 3.151059 4.974037 2.932079 6.191782 3.619255
Min Qi 5.192341 0.952574 4.238262 1.194804 4.974037 1.357578 4.034697 3.619255
Max Qi 10.346825 1.787459 6.740518 3.358926 9.439980 5.122059 6.191782 7.431067
Avg Qi 6.998502 1.316702 5.239948 2.395693 7.019906 3.211317 5.257733 5.686957
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.476940 2.004782 2.070967 1.015134 7.995799 2.142581 1.989357 1.581758
Stream 1 11.351299 6.657059 7.719765 5.157236 25.156379 8.566067 7.028898 8.146883
Stream 2 13.954105 8.341359 10.265949 3.289724 25.249435 6.370577 11.262650 7.684574
Stream 3 13.597277 5.783821 5.944240 5.214661 24.253991 8.742896 7.701709 5.801641
Stream 4 15.612070 6.126494 4.533748 5.733828 23.021583 6.423207 8.358223 6.866477
Stream 5 8.421209 9.040726 7.799425 3.908758 23.342975 9.934672 11.455598 8.258504
Min Qi 8.421209 5.783821 4.533748 3.289724 23.021583 6.370577 7.028898 5.801641
Max Qi 15.612070 9.040726 10.265949 5.733828 25.249435 9.934672 11.455598 8.258504
Avg Qi 12.587192 7.189892 7.252625 4.660841 24.204873 8.007484 9.161416 7.351616
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.258070 0.981896 1.161602 1.933124 2.203497 1.042949 3.349407 1.296630
Stream 1 8.213340 4.070175 5.662723 12.260503 7.792825 3.323136 9.296430 3.939927
Stream 2 16.754827 3.895688 4.413773 7.529466 6.288539 2.717479 11.222082 4.135510
Stream 3 8.486809 2.615640 7.426936 7.274289 6.706145 3.402654 8.278881 4.260483
Stream 4 12.604905 7.735042 5.627039 6.343302 7.242370 3.492640 6.503095 3.698821
Stream 5 8.221733 2.670036 5.866626 13.108081 9.428098 4.282014 8.213320 4.088321
Min Qi 8.213340 2.615640 4.413773 6.343302 6.288539 2.717479 6.503095 3.698821
Max Qi 16.754827 7.735042 7.426936 13.108081 9.428098 4.282014 11.222082 4.260483
Avg Qi 10.856323 4.197316 5.799419 9.303128 7.491595 3.443585 8.702762 4.024612

Run 2

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 257,944.7
Virt-H Throughput 240,998.0
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 249,327.4
Measurement Interval in Throughput Test (Ts) 164.417000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:21:20 2014-04-21 21:22:01 0:00:41
Stream 1 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 2 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 3 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 4 2014-04-21 21:22:02 2014-04-21 21:24:44 0:02:42
Stream 5 2014-04-21 21:22:02 2014-04-21 21:24:46 0:02:44
Refresh 0 2014-04-21 21:21:20 2014-04-21 21:21:22 0:00:02
&$160; 2014-04-21 21:21:22 2014-04-21 21:21:23 0:00:01
Refresh 1 2014-04-21 21:22:49 2014-04-21 21:23:04 0:00:15
Refresh 2 2014-04-21 21:22:01 2014-04-21 21:22:14 0:00:13
Refresh 3 2014-04-21 21:22:14 2014-04-21 21:22:27 0:00:13
Refresh 4 2014-04-21 21:22:26 2014-04-21 21:22:39 0:00:13
Refresh 5 2014-04-21 21:22:39 2014-04-21 21:22:49 0:00:10

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.437262 0.227516 1.172620 0.541201 1.542084 0.743255 1.459368 1.183166
Stream 1 5.205225 0.499833 4.854558 4.818087 5.920773 3.347414 5.446411 3.723247
Stream 2 5.833803 0.659051 6.023266 3.123523 4.358200 3.371315 6.772453 4.978415
Stream 3 6.308935 0.662744 7.573807 5.000859 5.282467 4.391930 5.280472 7.852718
Stream 4 5.791856 0.421592 5.953592 4.688037 9.949038 3.098282 4.153124 4.824209
Stream 5 13.537098 1.760386 3.308982 2.299178 4.882695 2.652497 5.383128 10.178447
Min Qi 5.205225 0.421592 3.308982 2.299178 4.358200 2.652497 4.153124 3.723247
Max Qi 13.537098 1.760386 7.573807 5.000859 9.949038 4.391930 6.772453 10.178447
Avg Qi 7.335383 0.800721 5.542841 3.985937 6.078635 3.372288 5.407118 6.311407
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.441940 1.948770 2.154384 1.148494 6.014453 1.647725 1.437587 1.585284
Stream 1 14.127674 7.824844 7.100679 3.586457 28.216115 7.587547 9.859152 5.829869
Stream 2 16.102880 7.676986 5.887327 2.796729 24.847035 7.146757 11.408922 7.641239
Stream 3 15.678701 5.786427 9.221883 2.692321 28.434916 6.657457 8.219745 7.706585
Stream 4 11.985421 10.182807 5.667618 6.875264 27.547492 7.438075 9.065924 8.895070
Stream 5 6.913707 7.662703 8.657333 3.282895 24.126612 10.963691 12.138564 7.962654
Min Qi 6.913707 5.786427 5.667618 2.692321 24.126612 6.657457 8.219745 5.829869
Max Qi 16.102880 10.182807 9.221883 6.875264 28.434916 10.963691 12.138564 8.895070
Avg Qi 12.961677 7.826753 7.306968 3.846733 26.634434 7.958705 10.138461 7.607083
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.275267 1.139390 1.165591 2.073658 2.261869 0.703055 2.327755 1.146501
Stream 1 13.720792 4.428528 3.651645 9.841610 6.710473 2.595879 9.783844 3.800103
Stream 2 12.532257 2.312755 6.182661 8.666967 9.383983 1.414853 7.570509 4.539598
Stream 3 7.578779 3.342352 8.155356 4.925493 6.590047 2.612912 8.497542 4.638512
Stream 4 10.967178 2.173935 6.382803 5.082562 8.744671 3.074768 7.577794 4.435140
Stream 5 9.438581 2.551124 8.375607 8.339441 8.201650 1.982935 7.334306 3.404017
Min Qi 7.578779 2.173935 3.651645 4.925493 6.590047 1.414853 7.334306 3.404017
Max Qi 13.720792 4.428528 8.375607 9.841610 9.383983 3.074768 9.783844 4.638512
Avg Qi 10.847517 2.961739 6.549614 7.371215 7.926165 2.336269 8.152799 4.163474

Details of System Under Test (SUT)

Hardware

Chassis Supermicro 2U
Motherboard Supermicro X9DR3-LN4F+
CPU 2 x Intel Xeon E5-2630 @ 2.3 GHz
(6 cores, 12 threads each;
total 12 cores, 24 threads)
RAM 192 GB DDR3 (24 x 8 GB, 1066MHz)
Storage 2 x Crucial 512 GB SSD
 

Software

DBMS Virtuoso Open Source 7.11.3209
(feature/analytics on v7fasttrack on GitHub)
OS CentOS 6.2

Conclusions

This experiment places Virtuoso in the ballpark with Actian Vector (formerly branded VectorWise), which has dominated the TPC-H score board in recent years. The published Vector results are on more cores and/or faster clock; one would have to run on the exact same platform to make precise comparisons.

Virtuoso ups the ante by providing this level of performance in open source. For a comparison with EXASolution and Actian Matrix (formerly ParAccel), we will have to go to the Virtuoso scale-out configuration, to follow shortly.

The next articles will provide a detailed analysis of performance and instructions for reproducing the results. The run outputs and scripts are available for download.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

Posted at 15:30

Norm Walsh: Another side of summer

A cocktail recipe from the Pacific northwest. Imitation is the sincerest form of flattery.

Posted at 00:47

April 21

Bob DuCharme: RDF lists and SPARQL

Not great, but not terrible, and a bit better with SPARQL 1.1

Posted at 13:35

April 16

Dublin Core Metadata Initiative: Inaugural chairs named to DCMI Advisory Board standing committees

2014-04-16, DCMI is pleased to announce the inaugural chairs of the Advisory Board's new standing committees. Ana Alice Baptista, University of Minho, Portugal, and Wei (Keven) Liu, Shanghai Library will serve as co-chairs of the Education & Outreach Committee. Muriel Foulonneau, Public Research Centre Henri Tudor, Luxembourg, and Emma Tonkin, King's College London will co-chair the Conferences & Meetings Committee. The inaugural chairs will shepherd their respective committees through refinement of the committee charges and formulation or working processes. For the full announcement of the new appointments, see Advisory Board Chair Marcia Zeng's message to the DCMI Community at http://dublincore.org/news/chairMessage/20140416/.

Posted at 23:59

Dublin Core Metadata Initiative: Extension of DC-2014 "Call for Participation" to 17 May 2014

2014-04-16, The deadline for submission of full papers, project reports and posters to the peer reviewed program for DC-2014 in Austin, Texas has been extended by two weeks to 17 May 2014. In addition to submission related to the conference theme --Metadata Intersections: Bridging the Archipelago of Cultural Memory, submissions are welcome on any topic addressing metadata models, technologies and applications. Submission describing innovative best practices in metadata are welcome from practitioners as well as researchers and application developers. The "Call for Submissions" can be found at http://purl.org/dcevents/dc-2014/cfp.

Posted at 23:59

Dublin Core Metadata Initiative: Become a DC-2014 sponsor!

2014-04-16, The Texas Digital Library and the Conference Committee for DC-2014 in Austin Texas in October are pleased to offer a limited number of sponsors the opportunity to present themselves directly to the conference participants and to a global audience beyond the conference venue. The sponsorship categories for DC-2014 range from provision of fellowship funding to support access to the conference and pre- and post-conference activities by recipients otherwise unable to attend to flash drives containing the open proceedings. Additional information regarding the categories of sponsorship and how to become a DC-2014 Sponsor can be found at http://dcevents.dublincore.org/IntConf/index/pages/view/2014-cfs.

Posted at 23:59

schema.org: Announcing Schema.org Actions



When we launched schema.org almost 3 years ago, our main focus was on providing vocabularies for describing entities --- people, places, movies, restaurants, ... But the Web is not just about static descriptions of entities. It is about taking action on these entities --- from making a reservation to watching a movie to commenting on a post.

Today, we are excited to start the next chapter of schema.org and structured data on the Web by introducing vocabulary that enables websites to describe the actions they enable and how these actions can be invoked.

The new actions vocabulary is the result of over two years of intense collaboration and debate amongst the schema.org partners and the larger Web community. Many thanks to all those who participated in these discussions, in particular to members of the Web Schemas and Hydra groups at W3C. We are hopeful that these additions to schema.org will help unleash new categories of applications.

Jason Douglas,
Sam Goto (Google)

Steve Macbeth, 
Jason Johnson (Microsoft)

Alexander Shubin (Yandex)

Peter Mika (Yahoo)

To learn more, see the overview document or Action, potentialAction and EntryPoint on schema.org.

Posted at 11:07

April 14

Frederick Giasson: The Open Semantic Framework Academy

osf_academy_logo_pillars
The Open Semantic Framework Academy YouTube channel has just been released this morning.

The Open Semantic Framework Academy is a dedicated channel for instructional screencasts on OSF. Via its growing library of videos, the OSF Academy is your one-stop resource for how to deploy, manage and use the Open Semantic Framework. OSF is a complete, turnkey stack of semantic technologies and methods for enterprises of all sizes.

All the aspects and features of the Open Semantic Framework will be covered in this series of screencasts. Dozens of such screencasts will be published in the following month or two. They are a supplement to the OSF Wiki documentation, but they are not mean to be a replacement.

Intro to the Open Semantic Framework (OSF)

This kick-off video to the OSF Academy overviews the Open Semantic Framework platform and describes it in terms of the 5 Ws (welcome, why, what, when, where) and the 1 H (how).

tut_1_blog_400

Installing Core OSF (Open Semantic Framework)

This screencast Introduce you to the Open Semantic Framework. Then it will show you how to install OSF using the OSF Installer script on a Ubuntu server. Finally it will introduce you to the system integration tests using the OSF Tests Suites.

tut_2_blog_400

Open Semantic Framework Web Resources

This screencast will show you all the web sites that exists to help you learning about the Open Semantic Framework. Such websites are the OSF main site, the OSF Wiki, Mike Bergman and Fred Giasson‘s blogs, demo portals such as Citizen DAN, NOW, MyPeg, HealthDirect and Pregnancy Birth and Babies.

tut_3_blog_400

Posted at 14:17

April 07

Orri Erling: In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

Posted at 16:28

April 02

Semantic Web Company (Austria): American Physical Society Taxonomy – Case Study

image_jb

Joseph A Busch

Taxonomy Strategies has been working with the American Physical Society (APS) to develop a new faceted classification scheme.

The proposed scheme includes several discrete sets of categories called facets whose values can be combined to express concepts such as existing Physics and Astronomy Classification Scheme (PACS) codes, as well as new concepts that have not yet emerged, or have been difficult to express with the existing PACS.

PACS codes formed a single-hierarchy classification scheme, designed to assign the “one best” category that an item will be classified under. Classification schemes come from the need to physically locate objects in one dimension, for example in a library where a book will be shelved in one and only one location, among an ordered set of other books. Traditional journal tables of contents similarly place each article in a given issue in a specific location among an ordered set of other articles, certainly a necessary constraint with paper journals and still useful online as a comfortable and familiar context for readers.

However, the real world of concepts is multi-dimensional. In collapsing to one dimension, a classification scheme makes essentially arbitrary choices that have the effect of placing some related items close together while leaving other related items in very distant bins. It also has the effect of repeating the terms associated with the last dimension in many different contexts, leading to an appearance of significant redundancy and complexity in locating terms.

A faceted taxonomy attempts to identify each stand-alone concept through the term or terms commonly associated with it, and have it mean the same thing whenever used. Hierarchy in a taxonomy is useful to group related terms together; however the intention is not to attempt to identify an item such as an article or book by a single concept, but rather to assign multiple concepts to represent the meaning. In that way, related items can be closely associated along multiple dimensions corresponding to each assigned concept. Where previously a single PACS code was used to indicate the research area, now two, three, or more of the new concepts may be needed (although often a single new concept will be sufficient). This requires a different mindset and approach in applying the new taxonomy to the way APS has been accustomed to working with PACS; however it also enables significant new capabilities for publishing and working with all types of content including articles, papers and websites.

To build and maintain the faceted taxonomy, APS has acquired the PoolParty taxonomy management tool. PoolParty will enable APS editorial staff to create, retrieve, update and delete taxonomy term records. The tool will support the various thesaurus, knowledge organization system and ontology standards for concepts, relationships, alternate terms etc. It will also provide methods for:

  • Associating taxonomy terms with content items, and storing that association in a content index record.
  • Automated indexing to suggest taxonomy terms that should be associated with content items, and text mining to suggest terms to potentially be added to the taxonomy.
  • Integrating taxonomy term look-up, browse and navigation in a selection user interface that, for example, authors and the general public could use.
  • Implementing a feedback user interface allowing authors and the general public to suggest terms, record the source of the suggestion, and inform the user on the disposition of their suggestion.

Arthur Smith, project manager for the new APS taxonomy notes “PoolParty allows our subject matter experts to immediately visualize the layout of the taxonomy, to add new concepts, suggest alternatives, and to map out the relationships and mappings to other concept schemes that we need. While our project is still in an early stage, the software tool is already proving very useful.”

About

Taxonomy Strategies (www.taxonomystrategies.com) is an information management consultancy that specializes in applying taxonomies, metadata, automatic classification, and other information retrieval technologies to the needs of business and other organizations.

The American Physical Society (www.aps.org) is a non-profit membership organization working to advance and diffuse the knowledge of physics through its outstanding research journals, scientific meetings, and education, outreach, advocacy and international activities. APS represents over 50,000 members, including physicists in academia, national laboratories and industry in the United States and throughout the world. Society offices are located in College Park, MD (Headquarters), Ridge, NY, and Washington, DC.

Enhanced by Zemanta

Posted at 11:37

Redlink: Build your first Semantic Application with Redlink

O ur motivation is to bring the power of content analysis  and linked data technologies to a wider audience of IT integrators, and hence end-users. To deliver on this promise we are keenly aware that the complexity of such technologies need to be kept under the hood while keeping you focused on what is important, i.e. how to add value to your data.

In this screencast we show how easy it is for non-experts to setup their first Redlink App that analyses content based on a customised dataset (user dictionary). With this simple App you can automatically analyse all your legacy data according to the facts that are most important to your business needs. This is our first version of the Redlink dashboard and we are hard at work on the new version that promises to put more powerful content analysis features at your fingertips.

If you’re interested in participating in our private beta, register to get involved

Lead the innovation of Content Analysis and Open Source

Since our inception last year, we have been focused on leading the delivery of semantic technologies  as a full data platform completely based on open standards. Thanks to the contributions of so many to Apache Stanbol and Apache Marmotta. These contributions address critical requirements and here at Redlink our goal is to make these technologies accessible and enterprise ready.

Extend and enable the ecosystem. A modern approach to NLP

We are also focused on enabling a broader ecosystem of NLP vendors and we have deep engineering and go-to-market partnerships with research institutions and startups active in the content analysis market to extend Redlink content analysis capabilities even further.

Register to our Private Beta

 

Posted at 11:01

March 31

W3C Read Write Web Community Group: Read Write Web — Q1 Summary — 2014

Summary

This month The Web celebrated its 25th birthday.  Celebrating on web25 Tim Berners-Lee poses 3 important questions.  1. How do we connect the nearly two-thirds of the planet who can’t yet access the Web?   2. Who has the right to collect and use our personal data, for what purpose and under what rules?  3. How do we create a high-performance open architecture that will run on any device, rather than fall back into proprietary alternatives?  Join the discussion at the web we want campaign and perhaps we can help make the web more interactive and more read/write.

Two important technologies became W3C RECs this month, the long awaited JSON LD and RDF 1.1.  Well done to everyone involved on reaching these milestones.

This community group reaches 2.5 years old.  Congrats to our co-chair Andrei Sambra who has moved over to work with Tim, Joe and team at MIT.  Some work on identity, applications and libraries has moved forward this quarter, more below!

Communications and Outreach

In Paris this month there was a well attended workshop on Web Payments.  The group hopes to take a set of specs to REC status including some on identity credentials and access controlled, reading and writing, of user profiles.

Andrei had a productive talk with Frank Karlitschek the creator of the popular personal data store, OwnCloud.  Hopefully, it is possible to mutually benefit by reusing some of the ideas created in this group.

Community Group

There’s been some updated software for our W3C CG blogging platform, so anyone that wishes to make a post related to read and write web standards ping Andrei or myself, or just dive in!

There’s been some useful contributions to rdflib.js and a new library which proposes the ‘pointed graph’ idea.  I’ve also taken feedback on the User Header discussions we have had and put them in a wiki page.  Additionally, the WebID specs now have a permanent home.

There have also been discussions on deeplinking / bookmarkability for single page apps.  I’ve also been working on an ontology for crypto currencies which I am hoping to integrate into the RWW via a tipping robot next quarter.

cimba-logo

Applications

Great work from the guys over at MIT with a new decentralized blogging platform, Cimba.  Cimba is also a 100% client side app, it can run on a host, on github, or even on your local file system.  Feel free to sign up and start some channels, or just take a look at the screencast.

In addition to Cimba, Webizen has been launched to help you search for connections more easily.  Search for friends on the decentralized social web, or add your own public URI.

ldfragments

Last but not least…

For those of you that enjoy SPARQL, Linked Data Fragments, presents new ways to query linked data via a web client.  This tool is designed to provide a ‘fragment’ of a whole data set with high reliability, so happy SPARQLing!

Posted at 18:35

Orri Erling: OpenPHACTS in Vienna

Hugh Williams and I (Orri Erling) went to the Open PHACTS Steering Committee meeting in Vienna last week. I am a great fan of Open PHACTS; the meetings are fun, with a great team spirit, and there is always something new to learn.

Paul Groth gave a talk about the stellar success of the the initial term of Open PHACTS.

  • Three releases of platform and data
  • 18 applications
  • Open PHACTS Foundation for sustainable exploitation and further development of the platform
  • Superb culture of collaboration
    • great team spirit
    • great output from distributed organization
    • lots of face-to-face time
    • example to every other big collaborative project

"The reincarnation of Steve Jobs," commented someone from the audience. "Except I am a nice guy," retorted Paul.

Commented one attendee, "The semantic web…., I just was in Boston at a semantic web meeting – so nerdy, something to make you walk out of the room… so it is a definite victory for Open PHACTS and why not also semantic web, that something based on these principles actually works."

It is a win anyhow, so I did not say anything at the meeting. So I will say something here, where I have more space as the message bears repeating.

We share part of the perception, so we hardly ever say "semantic web." The word is "linked data," and it means flexible schema and global identifiers. Flexible schema means that everything does not have to be modeled upfront. Global identifiers means that data, when transferred out of its silo of origin, remains interpretable and self-describing, so you can mix it with other data without things getting confused. "Desiloization" is a wonderful new word for describing this.

This ties right into FAIRport and FAIR data: Findable, Accessible, Interoperable, Reusable. Barend Mons talked a lot about this: open just means downloadable; fair means something you can do science with. Barend’s take is that RDF with a URI for everything is the super wire format for exchanging data. When you process it, you will diversely cook it, so an RDF store is one destination but not the only possibility. It has been said before: there is a range of choices between storing triples verbatim, and making application specific extractions, including ones with a schema, whether graph DB or relational.

Nanopublications are also moving ahead. Christine Chichester told me about pending publications involving Open PHACTS nanopublictions about post-translation modification of proteins and their expression in different tissues. So there are nanopublications out there and they can be joined, just as intended. Victory of e-science and data integration.

The Open PHACTS project is now officially extended for another two-year term, bringing the total duration to five years. The Open PHACTS Foundation exists as a legal entity and has its first members. This is meant to be a non-profit industry association for sharing of pre-competitive data and services around these between players in the pharma space, in industry as well as academia. There are press releases to follow in due time.

I am looking forward to more Open PHACTS. From the OpenLink and Virtuoso side, there are directly relevant developments that will enter production in the next few months, including query caching discussed earlier on this blog, as well as running on the TPC-H tuned analytics branch for overall better query optimization. Adaptive schema is something of evident value to Open PHACTS, as much of the integrated data comes from relational sources, so is regular enough. Therefore taking advantage of this for storage cannot hurt. We will see this still within the scope of the project extension.

Otherwise, more cooperation in formulating the queries for the business questions will also help.

All in all, Open PHACTS is the celebrated beauty queen of all the Innovative Medicine Initiative, it would seem. Superbly connected, unparalleled logo cloud, actually working and useful data integration, delivering on time on all in fact very complex business questions.

Posted at 15:49

Semantic Web Company (Austria): Why SKOS should be a focal point of your linked data strategy

skos_hand-small

The Simple Knowledge Organization System (SKOS) has become one of the ‘sweet spots’ in the linked data ecosystem in recent years. Especially when semantic web technologies are being adapted for the requirements of enterprises or public administration, SKOS has played a most central role to create knowledge graphs.

In this webinar, key people from the Semantic Web Company will describe why controlled vocabularies based on SKOS play a central role in a linked data strategy, and how SKOS can be enriched by ontologies and linked data to further improve semantic information management.

SKOS unfolds its potential at the intersection of three disciplines and their methods:

  • library sciences: taxonomy and thesaurus management
  • information sciences: knowledge engineering and ontology management
  • computational linguistics: text mining and entity extraction

Linked Data based IT-architectures cover all three aspects and provide means for agile data, information, and knowledge management.

In this webinar, you will learn about the following questions and topics:

  • How SKOS builds the foundation of enterprise knowledge graphs to be enriched by additional vocabularies and ontologies?
  • How can knowledge graphs be used build the backbone of metadata services in organisations?
  • How text mining can be used to create high-quality taxonomies and thesauri?
  • How can knowledge graphs be used for enterprise information integration?

Based on PoolParty Semantic Suite, you will see several live demos of end-user applications based on linked data and of PoolParty’s latest release which provides outstanding facilities for professional linked data management, including taxonomy, thesaurus and ontology management.

Register here: https://www4.gotomeeting.com/register/404918583

 

Posted at 11:04

March 30

Egon Willighagen: Linked Open Drug Data: three years on

Almost three years ago I collaborated with others in the W3C Health Care and Life Sciences interest group. One of the results of that was a paper in the special issue around the semantic web conference at one of the bianual, national ACS meeting (look at this nice RDFa-rich meeting page!). My contribution was around the ChEMBL-RDF, which I recently finally published, though it was already described earlier in an HCLS note.

Anyway, when this paper reached the most viewed paper position in the JChemInf journal, and I tweeted that event, I was asked for an update of the linked data graph (the darker nodes are the twelve the LODD task force worked on). A good questions indeed, particularly if you consider the name, and that not all of the data sets were really Open (see some of the things on Is It Open Data?). UMLS is not open; parts of SIDER and STICH are, but not all; CAS is not at all, and KEGG Cpd has since been locked down. Etc. A further issue is that the Berlin node in the LODD network is down, which hosted many data sets (Open or not). Chem2Bio2RDF seems down too.

Bio2RDF is still around, however (doi:10.1007/978-3-642-38288-8_14). At this moment, it is a considerable part of the current Linked Drug Data network. It provides 28 data sets. It even provides data from KEGG, but I still have to ask them what they had to do to be allowed to redistribute the data, and whether that applies to others too. Open PHACTS is new and integrated a number of data sets, like ChEMBL, WikiPathways, ChEBI, a subset of ChemSpider, and DrugBank. However, it does not expose that data as Linked Data. There is also the new (well, compared to three years ago :) Linked Life Data which exposes quite a few data sets, some originating from the Berlin node.

Of course, DBPedia is still around too. Also important that more and more data bases themselves provide RDF, like Uniprot which has a SPARQL end point in beta, WikiPathways, PubChem, and ChEMBL at the EBI. And more will come, /me thinks.


I am aggregating data in a Google Spreadsheet, but obviously this needs to go onto the DataHub. And a new diagram needs to be generated. And I need to figure out how things are linked. But the biggest question is: where are all the triples with the chemistry behind the drugs? Like organic syntheses, experimental physical and chemical data (spectra, pKa, logP/logD, etc), crystal structures (I think COD is working on a RDF version), etc, etc. And, what data sets am I missing in the spreadsheet (for example, data sets exposed via OpenTox)?

Posted at 15:00

March 28

John Goodwin: Visualising the Location Graph – example with Gephi and Ordnance Survey linked data

This is arguably a simpler follow up to my previous

Posted at 11:56

March 27

Ebiquity research group UMBC: Do not be a Gl***hole, use Face-Block.me!

If you are a Google Glass user, you might have been greeted with concerned looks or raised eyebrows at public places. There has been a lot of chatter in the “interweb” regarding the loss of privacy that results from people taking your pictures with Glass without notice. Google Glass has simplified photography but as what happens with revolutionary technology people are worried about the potential misuse.

FaceBlock helps to protect the privacy of people around you by allowing them to specify whether or not to be included in your pictures. This new application developed by the joint collaboration between researchers from the Ebiquity Research Group at University of Maryland, Baltimore County and Distributed Information Systems (DIS) at University of Zaragoza (Spain), selectively obscures the face of the people in pictures taken by Google Glass.

Comfort at the cost of Privacy?

As the saying goes, “The best camera is the one that’s with you”. Google Glass suits this description as it is always available and can take a picture with a simple voice command (“Okay Glass, take a picture”). This allows users to capture spontaneous life moments effortlessly. On the flip side, this raises significant privacy concerns as pictures can taken without one’s consent. If one does not use this device responsibly, one risks being labelled a “Glasshole”. Quite recently, a Google Glass user was assaulted by the patrons who objected against her wearing the device inside the bar. The list of establishments which has banned Google Glass within their premises is growing day by day. The dos and donts for Glass users released by Google is a good first step but it doesn’t solve the problem of privacy violation.

FaceBlock_Image_Google_Glass

Privacy-Aware pictures to the rescue

FaceBlock takes regular pictures taken by your smartphone or Google Glass as input and converts it into privacy-aware pictures. This output is generated by using a combination of Face Detection and Face Recognition algorithms. By using FaceBlock, a user can take a picture of herself and specify her policy/rule regarding pictures taken by others (in this case ‘obscure my face in pictures from strangers’). The application would automatically generate a face identifier for this picture. The identifier is a mathematical representation of the image. To learn more about the working on FaceBlock, you should watch the following video.

Using Bluetooth, FaceBlock can automatically detect and share this policy with Glass users near by. After receiving this face identifier from a nearby user, the following post processing steps happen on Glass as shown in the images.

FaceBlock_Image_Eigen_UncheckFaceBlock_Image_Eigen_CheckFaceBlock_Image_Blur

What promises does it hold?

FaceBlock is a proof of concept implementation of a system that can create privacy-aware pictures using smart devices. The pervasiveness of privacy-aware pictures could be a right step towards balancing privacy needs and comfort afforded by technology. Thus, we can get the best out of Wearable Technology without being oblivious about the privacy of those around you.

FaceBlock is part of the efforts of Ebiquity and SID in building systems for preserving user privacy on mobile devices. For more details, visit http://face-block.me

Posted at 18:13

: Simplified SPARQL Aggregates

We have observed that SPARQL 1.1’s syntax for aggregates tends to surprise many, if not most, people. According to the standard SPARQL 1.1 grammar, to write a COUNT() query you have to be somewhat verbose:

SELECT (COUNT(*) AS ?count) WHERE {?s ?p ?o}

Note the required explicit variable name, ?count, as well as the required parentheses surrounding the projected expression. With a strictly standards-compliant SPARQL query processor, omitting either the parentheses or the variable binding will result in a syntax error. This was also the case with Dydra until recently.

As of today, however, we also support a simplified and abbreviated convenience syntax for aggregates, one that is more in line with the previous experience and initial expectations of users coming from, say, an SQL background:

SELECT COUNT(*) WHERE {?s ?p ?o}

The same convenience syntax is available for all the familiar COUNT, SUM, MIN, MAX, and AVG aggregates.

To write maximally portable SPARQL queries, you should probably continue to use the standard syntax in any non-trivial software you release. However, when initially exploring your dataset and formulating your queries using the Dydra query editor, we’re sure you’ll quickly find the abbreviated syntax to be a very welcome convenience indeed.

If you have similar pet peeves and pain points with other aspects of the SPARQL standard, let us know. Making your use of SPARQL as easy and pleasant as possible is what we’re here for.

Posted at 03:05

: DAWG Test Suite for SPOCQ

ARQ proposed a standard textual notation for abstract SPARQL algebra expressions, which it calls ‘symbolic SPARQL expressions’ (SSEs). The documentation describes expressions which are compatible with Lisp’s S-expressions and, given appropriate reader and printer support, can serve as the externalization for a model of abstract SPARQL algebra expressions represented as simple lists.

This approach has proven invaluable to the development of Dydra’s SPOCQ query engine. In the early phases, expositions in the literature on SPARQL 1.0 semantics made it relatively easy to implement the query processor in terms of a small collection of Lisp macros which constituted an operational semantics for SPARQL.

As the work progressed towards support for SPARQL 1.1, this abstract algebra model and the SSE notation made it easier to comprehend the intent of the W3C specification and to verify the query processor’s performance against the Data Access Working Group’s evolving test suite.

In order to document this aspect of our implementation, we have translated the query expressions from the original SPARQL form into SSEs and present them here for reference. We also record and publish our latest DAWG conformance test results here:

Posted at 03:05

: SPARQL 1.1 Now Supported

The top feature request we’ve heard from our users and customers lately has been support for SPARQL 1.1. We’ve listened, and are pleased to announce that Dydra now implements SPARQL 1.1 Query and Update in full.

SPARQL 1.1 Query

We now fully support SPARQL 1.1 Query, including aggregates, subqueries, negation and filtering, SELECT expressions, property paths, assignments and bindings, syntactic sugar for CONSTRUCT, and many additional operators and functions.

Signing up for a Dydra account is the very easiest way to get to grips with the new features in SPARQL 1.1. Our easy-to-use browser-based query editor lets you interactively write and test queries on your dataset without any hoops to jump through to install complex software or to configure and serve a SPARQL endpoint yourself.

SPARQL 1.1 Update

We also now fully support SPARQL 1.1 Update. Starting today, you can use all of the update operators on your repository data: INSERT/DELETE, INSERT DATA, DELETE DATA, LOAD, CLEAR, CREATE, DROP, COPY, MOVE, and ADD.

We will in the near future be updating our documentation with further details regarding the transactional semantics of Dydra, since the official SPARQL 1.1 Update specification leaves them up to each implementation.

Faster Queries

We’ve improved our query engine’s performance quite a bit. It was already pretty fast before, and it’s yet faster now. In particular, we’ve improved the throughput for “worst-case” queries that perform a lot of full edge scans of large repositories. These kinds of queries can now run even up to a hundred times faster.

This is, indeed, one of the benefits of hosting your SPARQL endpoint with Dydra: as we simultaneously both roll out ever more hardware capacity and work to further improve our query engine, you reap the benefits without having to lift a finger.

Faster Imports

Last but not least, we’ve overhauled how we do data imports. We now make use of the very latest release of the Raptor RDF Syntax Library, the high-performance, open-source library for parsing and serializing most RDF serialization formats as well as RSS/Atom feeds and various microformats.

Anything that Raptor can parse, Dydra can now import into your repository. The new import functionality is also considerably faster than what we offered before.

We intend to keep our internal Raptor build closely in sync with the upstream open-source release, which means that the best way to ensure Dydra can import some new format or another is to contribute to the Raptor project on GitHub.

Posted at 03:05

: Welcoming Zachary Voase

This Monday marked Zachary Voase’s first hands-on day at Datagraph. Zack moved from London to Berlin a week ago and will be working daily with James Anderson and me at our premises here in Friedrichshain, with his primary role being developer evangelism for the Dydra platform.

Zack is an experienced polyglot programmer, prolific open-source hacker, and long-time Semantic Web enthusiast. His work for the last several years has been centered around Django and Rails projects written in Python and Ruby, respectively, and he’ll be taking point on our efforts in making Dydra the easiest available SPARQL solution for those developer communities. If you were in Amsterdam for DjangoCon Europe last month, you may have sat in on Zack’s talk Django on Rails: Getting Resourceful.

We’re all stoked to have Zack working on Dydra, and none more so than I: Zack was actually there at the very inception of what eventually became Dydra, back in sunny Marbella a summer or two ago, so welcoming him on board feels like at long last uniting the original team. We don’t really bother with oddball interview questions, but if we did, Zack’s off-the-cuff formulation of a SPARQL query to compute the Jaccard coefficient for movie recommendations would probably have had us convinced.

Zack blogs at blog.zacharyvoase.com and tweets and hacks as @zacharyvoase. Some of his previous Semantic Web-related blog posts include Modern Accounting on the Open Web, The Semantic Blog, and Bioinformatics and the Semantic Web. His voluminous open-source output at GitHub includes a number of widely-used Python libraries and Django apps, and soon tutorial applications for Dydra.

Welcome to the team, Zack, and we wish you the best of luck in your ongoing quest for the perfect espresso in Berlin!

Posted at 03:05

: SP2B Benchmark Results

SP2Bench (hereafter referred to as SP2B) is a challenging and comprehensive SPARQL performance benchmark suite designed to cover the most important SPARQL constructs and operator constellations using a broad range of RDF data access patterns. While there exist other benchmark suites for RDF data access—BSBM and LUBM are two other well-known benchmarks—we’ve found SP2B to overall be the most helpful metric by which to track and evaluate performance gains in Dydra’s query processing. Not coincidentally, SP2B’s main author also wrote one of the definitive works on SPARQL query optimization—exactly the kind of light reading we like to enjoy in our copious free time…

Results

We obtained the following results (note the log scale) on a standalone deployment of Dydra’s proprietary query engine, affectionately known as SPOCQ, to a machine of roughly comparable hardware specifications as those used for previously-published SP2B results:

SP2B results overview

We benchmarked four input dataset sizes ranging from 10,000 to 1,000,000 triples (indicated as 10K, 50K, 250K, and 1M). Following the methodology of the comprehensive SP2B analysis published by Revelytix last year, we executed each query up to 25 times for a given dataset size, and averaged the response times for each query/dataset combination after discarding possible outliers (the several lowest and highest values) from the sample. The query timeout was defined as 1,800 seconds.

The hardware used in these benchmarks was a dedicated Linux server with a dual-core AMD Athlon 64 X2 6000+ processor, 8 GB of DDR2 PC667 RAM memory, and 750 GB of SATA disk storage in a software-based RAID-1 configuration. This was a relatively underpowered server by present-day standards, given that e.g. my MacBook Pro laptop outperforms it on most tasks (including these benchmarks); but, it does have the benefit of being roughly comparable to both the Amazon EC2 instance size used in the Revelytix analysis as well as to the hardware used in the original SP2B papers.

Comments

Several of the SP2B queries—in particular Q4, Q5a, Q6, and Q7—are tough on any SPARQL engine, and in published results it has been typical to see many implementations fail some of these already at dataset sizes of 50,000 to 250,000 triples. So far as we know, SPOCQ is the only native SPARQL implementation that correctly completes all SP2B queries on the 250,000 triple dataset within the specified timeout (1,800 seconds) and without returning bad data or experiencing other failures. Likewise, we correctly complete everything but Q5a (on which see a comment further below) on the 1,000,000 triple dataset as well.

Q1, Q3b, Q3c, Q10, and Q12

As depicted above, SPOCQ’s execution time was constant-time on a number of queries—specifically Q1, Q3b, Q3c, Q10, and Q12c—regardless of dataset size. The execution times for these queries all measured in the 20-40 millisecond range, depending on the exact query. Of the SP2B queries, these are the most similar to the types of day-to-day queries we actually observe being executed on the Dydra platform, and showcase the very efficient indexing we do in SPOCQ’s storage substrate.

A Detailed Look at Q7

The SP2B query Q7 is designed to test a SPARQL engine’s ability to handle nested closed-world negation (CWN). Previously-published benchmark results indicate that along with Q4, this query has proved the most difficult for the majority of SPARQL implementations, with very few managing to complete it on datasets larger than 10,000 to 50,000 triples. We’re happy to report that our SPARQL implementation is among those select few:

SP2B Q7 comparison

The above chart combines our results with Revelytix’s comprehensive SP2B benchmark results. The depicted 1,800+ second bars here indicate either a timeout or a failure to return correct results (see pp. 38-39 of the Revelytix analysis for more details).

None of the implementations Revelytix benchmarked were able to complete SP2B Q7 on 1,000,000 triples within a one-hour timeout. SPOCQ completes the task in 80 seconds.

While we benchmarked on more or less comparable hardware and with comparable methodology, we do not claim that the comparison in the preceding chart is valid as such; take it with a grain of salt. It is indicative, however, of the amount of work we have put, and are putting, into preparing Dydra’s query engine for the demands we expect it to face as we exit our beta stage.

A Detailed Look at Q4

The SP2B query Q4 deals with long graph chains and produces a very large solution sequence quadratic to the input dataset size. It is probably the most difficult of the SP2B queries, with few SPARQL implementations managing to finish it on input datasets larger than 50,000 triples.

SP2B Q4 comparison

As with Q7, this chart draws on data from the aforementioned Revelytix analysis (see pp. 32-33 of their report for details), and the same caveats certainly apply to this comparison. Nonetheless, of the implementations Revelytix benchmarked, only Oracle completed SP2B Q4 on 1,000,000 triples within a one-hour timeout. They reported a time of 522 seconds for Oracle. SPOCQ completes the task in 134 seconds.

A Special Note on Q5a

No existing SPARQL implementation does well on Q5a for larger datasets. We believe this is due to an oversight in the SP2B specification, where Q5a is defined as using a plain equality comparison in its FILTER condition, yet it is suggested that this makes for an implicit join that can be identified and optimized for. However, since joins in SPARQL are in fact defined in terms of a sameTerm comparison, such an optimization cannot be safely performed in the general case.

We have therefore also included results for an amended version of Q5a, named Q5a′:

SELECT DISTINCT ?person ?name
WHERE { ?article rdf:type bench:Article.
        ?article dc:creator ?person.
        ?inproc rdf:type bench:Inproceedings.
        ?inproc dc:creator ?person2.
        ?person foaf:name ?name.
        ?person2 foaf:name ?name2
        FILTER(sameTerm(?name,?name2)) }

Q5a′ simply substitutes sameTerm(?name, ?name2) in place of the ?name = ?name2 comparison, allowing the join to be optimized for. Q5a′ runs in comparable time to Q5b, as intended by the authors of SP2B. We suggest that others benchmarking SP2B note execution times for Q5a′ as well.

Caveats

While our production cluster has considerably more aggregate horsepower than the benchmark machine used for the above, it wouldn’t make for a very meaningful comparison given that all previously-published SP2B results have been single-machine deployments. So, the figures given here should be considered first and foremost merely a baseline capability demonstration of the technology that the Dydra platform is built on.

Further, during our ongoing beta period we are enforcing a maximum query execution time of 30 seconds, which of course would tend to preclude executing long-running analytic queries of the SP2B kind. If you have special evaluation needs you’d like to discuss with us, please contact Ben Lavender (ben@dydra.com).

Credits

In closing, we would like to express our thanks to the Freiburg University Database Group, the authors of the SP2B performance benchmark suite. SP2B has provided us with an invaluable yardstick by which to mark our weekly improvements to Dydra’s query processing throughput. Anyone developing a SPARQL engine today without exposing it to the non-trivial and tough queries of SP2B are doing themselves a serious disservice—as attested to by the difficulty most SPARQL implementations have with the more strenuous SP2B queries. SP2B is truly the gold standard of SPARQL benchmarks, ignored at one’s own peril.

Posted at 03:05

: Tuning SPARQL Queries

Here follows an excerpt from our upcoming Dydra Developer Guide, from a section that provides some simple tips on how to tune your queries for better application performance.

SPARQL is a powerful query language, and as such it is easy to write complex queries that require a great deal of computing power to execute. As both query execution time and billing directly depend on how much processing a query requires, it is useful to understand some of Dydra’s key performance characteristics. With larger datasets, simple changes to a query can result in a significant performance improvement.

This post describes several factors that strongly influence the execution time and cost of queries, and explains a number of tips and tricks that will help you tune your queries for optimal application performance and a reduced monthly bill.

Note that the following may contain too much detail if you are casually using Dydra for typical and straightforward use cases. You probably won’t need these tips until you are dealing with large datasets or complex queries. Nonetheless, you may still find it interesting to at least glance over this material.

SELECT Queries

A general tip for SELECT queries is to avoid unnecessarily projecting variables you won’t actually use. That is, if your query’s WHERE clause binds the variables ?a, ?b, and ?c, but you actually only ever use ?b when iterating over the solution sequence in your application, then you might want to avoid specifying the query in either of the following two forms:

SELECT * WHERE { ... }
SELECT ?a ?b ?c WHERE { ... }

Rather, it is better to be explicit and project just the variables you actually intend to use:

SELECT ?b WHERE { ... }

The above has two benefits. Firstly, Dydra’s query processing will apply more aggressive optimizations knowing that the values of the variables ?a and ?c will not actually be returned in the solution sequence. Secondly, the size of the solution sequence itself, and hence the network use necessary for your application to retrieve it, is reduced by not including superfluous values. The combination of these two factors can make a big performance difference for complex queries returning large solution sequences.

If you remember just one thing from this subsection, remember this: SELECT * is a useful shorthand when manually executing queries, but not something that you should much want to use in a production application dealing with complex queries on non-trivial amounts of data.

Remember, also, that SPARQL provides an ASK query form. If all you need to know is whether a query matches something or not, use an ASK query instead of a SELECT query. This enables the query to be optimized more aggressively, and instead of a solution sequence you will get back a simple boolean value indicating whether the query matched or not, minimizing the data transferred in response to your query.

The ORDER BY Clause

The ORDER BY clause can be very useful when you want your solution sequence to be sorted. It is important to realize, though, that ORDER BY is a relatively heavy operation, as it requires the query processing to materialize and sort a full intermediate solution sequence, which prevents Dydra from returning initial results to you until all results are available.

This does not mean that you should avoid using ORDER BY when it serves a purpose. If you need your query results sorted by particular criteria, it is best to let Dydra do that for you rather than manually sorting the data in your application. After all, that is why ORDER BY is there. However, if the solution sequence is large, and if the latency to obtain the initial solutions is important (sometimes known as the “time-to-first-solution” factor), you may wish to consider whether you in fact need an ORDER BY clause or not.

The OFFSET Clause

Dydra’s query processing guarantees that a query solution sequence has a consistent and deterministic ordering even in the absence of an ORDER BY clause. This has an important and useful consequence: the results of an OFFSET clause are always repeatable, whether or not the query has an ORDER BY clause.

Concretely, this means that if you have a query containing an OFFSET clause, and you execute that query multiple times in succession, you will get the same solution sequence in the same order each time. This is not a universal property of SPARQL implementations, but you can rely on it with Dydra.

This feature facilitates, for example, paging through a large solution sequence using an OFFSET and LIMIT clause combination, without needing ORDER BY. So, again, don’t use an ORDER BY clause unnecessarily if you merely want to page through the solution sequence (say) a hundred solutions at a time.

The LIMIT Clause

Always ensure that your queries include a LIMIT clause whenever possible. If your application only needs the first 100 query solutions, specify a LIMIT 100. This puts an explicit upper bound on the amount of work to be performed in answering your query.

Note, however, that if your query contains both ORDER BY and LIMIT clauses, query processing must always construct and examine the full solution sequence in order to sort it. Therefore the amount of processing needed is not actually reduced by a LIMIT clause in this case. Still, limiting the size of the ordered solution sequence with an explicit LIMIT improves performance by reducing network use.

Posted at 03:05

March 26

Redlink: Adding Semantic Search to Apache Solr™

To add semantic capabilities to Apache Solr, developers need to install the Redlink Solr Plugin which relies on the Analysis API. The Plugin enhances documents by extracting named entities and facts during Solr updates. Adding the Plugin into an existing Solr repository can be done within minutes, avoiding any major integration effort.

T he global enterprise search market reached revenues of more than US$1.47bn in 2012 and this figure has been forecasted by Frost & Sullivan will reach US$4.68bn by 2019.

As Search vendors are looking beyond meta-data tagging and towards deep-content analysis algorithms and pattern detection, Search users are becoming more aware of the effectiveness of conceptual and semantic search.

Redlink Apache Solr Plugin

Here at Redlink we see this demand coming from various verticals such as the public sector willing to provide better access to documents and public knowledge bases to eCommerce platform providers willing to increase the effectiveness of their platforms in terms of sales volume and revenues.

We designed the Solr Plugin with these requirements in mind to add Semantic Search to Apache Solr the popular, fast and Open Source enterprise search platform from the Apache Lucene project.

Apache-Solr-288x176

To add semantic capabilities to Apache Solr, as described by Thomas Kurz in this screencast, the Redlink Solr Plugin uses our Analysis API. The Plugin enhances documents by extracting named entities and facts during Solr updates. Adding the Plugin into an existing Solr repository can be done within minutes, avoiding any major integration effort.

To display cutting edge results using faceted search and autosuggest you can simply add Redlink Semantic Technology with a 5 minute installation and a valid API Key: register to get one and download the plugin.

Thomas Kurz is co-founder and search expert of Redlink and he will be attending Enterprise Search Europe 2014 on the 29th and 30th of April 2014 in London.

Posted at 16:43

Redlink: Adding Semantic Search to Apache Solr™

To add semantic capabilities to Apache Solr, developers need to install the Redlink Solr Plugin which relies on the Analysis API. The Plugin enhances documents by extracting named entities and facts during Solr updates. Adding the Plugin into an existing Solr repository can be done within minutes, avoiding any major integration effort.

T he global enterprise search market reached revenues of more than US$1.47bn in 2012 and this figure has been forecasted by Frost & Sullivan will reach US$4.68bn by 2019.

As Search vendors are looking beyond meta-data tagging and towards deep-content analysis algorithms and pattern detection, Search users are becoming more aware of the effectiveness of conceptual and semantic search.

Redlink Apache Solr Plugin

Here at Redlink we see this demand coming from various verticals such as the public sector willing to provide better access to documents and public knowledge bases to eCommerce platform providers willing to increase the effectiveness of their platforms in terms of sales volume and revenues.

We designed the Solr Plugin with these requirements in mind to add Semantic Search to Apache Solr the popular, fast and Open Source enterprise search platform from the Apache Lucene project.

Apache-Solr-288x176

To add semantic capabilities to Apache Solr, as described by Thomas Kurz in this screencast, the Redlink Solr Plugin uses our Analysis API. The Plugin enhances documents by extracting named entities and facts during Solr updates. Adding the Plugin into an existing Solr repository can be done within minutes, avoiding any major integration effort.

To display cutting edge results using faceted search and autosuggest you can simply add Redlink Semantic Technology with a 5 minute installation and a valid API Key: register to get one and download the plugin.

Thomas Kurz is co-founder and search expert of Redlink and he will be attending Enterprise Search Europe 2014 on the 29th and 30th of April 2014 in London.

Posted at 16:43

Copyright of the postings is owned by the original blog authors. Contact us.