Planet RDF

It's triples all the way down

July 01

Dublin Core Metadata Initiative: DC-2015 Preliminary Program published

2015-07-01, São Paulo State University (UNESP) and the Conference Committee of DC-2015 have published the preliminary program of the DCMI International Conference at http://dcevents.dublincore.org/IntConf/index/pages/view/schedule-15. The conference days--Wednesday and Thursday, 2-3 September--feature keynote speakers, Paul Walk and Ana Alice Baptista, paper sessions, project reports, posters (including best practice posters and demonstrations), and an array of special sessions. Tuesday and Friday are pre- and post-conference, full-day workshop events: "Development of Metadata Application Profiles", "Training the Trainers for Linked Data", and "Elaboration of Controlled Vocabularies Using SKOS". Special Session include "Schema.org Structured Data on the Web--An Extending Influence" sponsored by OCLC, "Current Developments in Metadata for Research Data" sponsored by the DCMI Science & Metadata Community, and "Cultural Heritage Linked Data". The titles and abstracts of the Technical Program are available at http://dcevents.dublincore.org/IntConf/index/pages/view/abstracts-15. Registration is open: http://dcevents.dublincore.org/IntConf/index/pages/view/reg15. Day registrations are available.

Posted at 23:59

W3C Read Write Web Community Group: Read Write Web — Q2 Summary — 2015

Summary

Q2 was a relatively quiet, yet has seen quite a bit of progress.  Some work is being done on the EU INSPIRE directive and ESWC took place in Slovenia with some interesting demos.  One that caught the eye was QueryVOWL, a visual query language, for linked data.

For those that enjoy such things, there was some interesting work and discussion on deterministic naming of blank nodes.  Also a neat new framework called Linked Data Reactor, which can be used for developing component based applications.  The web annotation group has also published an Editor’s draft.

Much of the work that has been done in this group has come together in a new spec, SoLiD (Social Linked Data).  As an early adopter of this technology I have been extraordinarily impressed, and would encourage trying it out.  There has also been a proposed charter for the next version of the Linked Data Platform.

Communications and Outreach

A few of this group met with the Social Web Working Group in Paris.  Over two days we got to demo read write technologies in action, and also to see the work from members of the indieweb community and those working with the Activity Streams specification.

Community Group

Relatively quiet this quarter on the mailing list with about 40 posts.  I get the impression that more focus has shifted to implementations and applications, where I think there is starting to be an uptick in progress.  Some ontologies have been worked on, one for SoLiD apps, and another for micrblogging.

contacts

Applications

The first release of a contacts manager on the SoLiD platform came out this month.  Which allows you to set up and store your own personal address book, in your own storage.  An interesting feature of this app is that it includes logic for managing workspaces and preferences.  Import and export is currently targeted to vcard, but more will be added, or simply fork the app and and your own!

Lots of work has been done on the linkeddata github area.  General improvements and some preliminary work on a keychain app.  One feature that I have found useful was the implementation of HTTP PATCH sending notifications to containers when something has changed.  This helped me create a quick demo for webid.im to show how it’s possible to cycle through a set of images and have them propagate through the network as things change.

ldfragments

Last but not Least…

The Triple Pattern fragments client was released and is able to query multiple APIs for data at the same time.  This is a 100% client side app and supports federated SPARQL queries.  Another great open source app, you can read the specification or dive into the source code here.

Posted at 16:36

June 30

Redlink: Redlink API, moving to 1.0

In the last months we’ve being working very hard to provide a reliable and valuable service as the Redlink Platform. Today finally we can announce we move out of public beta for going to 1.0.

That means you would need to move to the new endpoint. Don’t worry. if you are using a Redlink SDK or one of our plugins, we’re updating all for having an easy transition. This does not affect current running applications, since 1.0-BETA is deprecated but still available until the end of 2015. If you need further support about the transition, please contact us.

ipad_scr_3

In the following weeks we’ll contact all our users to discuss further interest in our services and see how we can help you do awesome things with your data.

Posted at 15:32

June 29

Leigh Dodds: “The scribe and the djinn’s agreement”, an open data parable

In a time long past, in a land far away, there was once a great city. It was the greatest city in the land and the vast marketplace at its centre was the busiest, liveliest marketplace in the world. People of all nations could be found there buying and selling their wares. Indeed, the marketplace was so large that people would spend days, even weeks, exploring its length and breadth would still discover new stalls selling a myriad of items.

A frequent visitor to the marketplace was a woman known only as the Scribe. While the Scribe was often found roaming the marketplace even she did not know of all of the merchants to be found within its confines. Yet she spent many a day helping others to find their way to the stalls they were seeking, and was happy to do so.

One day, as a gift for providing useful guidance, a mysterious stranger gave the Scribe a gift: a small magical lamp. Upon rubbing the lamp a djinn appeared before the suprised Scribe and offered her a single wish.

“Oh venerable djinn” cried the Scribe, “grant me the power to help anyone that comes to this marketplace. I wish to help anyone who needs it to find their way to whatever they desire”.

With a sneer the djinn replied: “I will grant your wish. But know this: your new found power shall come with limits. For I am a capricious spirit who resents his confinement in this lamp”. And with a flash and a roll of thunder, the magic was completed. And in the hands of the Scribe appeared the Book.

The Book contained the name and location of every merchant in the marketplace. From that day forward, by reading from the Book, the Scribe was able to help anyone who needed assistance to find whatever they needed.

After several weeks of wandering the market, happily helping those in need, the Scribe was alarmed to discover that she was confronted by a long, long line of people.

“What is happening?” she asked of the person at the head of the queue.

“It is now widely known that no-one should come to the Market without consulting the Scribe” said the man, bowing. “Could you direct me to the nearest merchant selling the finest silks and tapestries?”

And from that point forward the Scribe was faced with a never-ending stream of people asking for help. Tired and worn and no longer able to enjoy wandering the marketplace as had been her whim, she was now confined to its gates. Directing all who entered, night and day.

After some time, a young man took pity on the Scribe, pushing his way to the front of the queue. “Tell me where all of the spice merchants are to be found in the market, and then I shall share this with others!”

But no sooner had he said this than the djinn appeared in a puff of smoke: “NO! I forbid it!”. With a wave of its arm the Scribe was struck dumb until the young man departed. With a smirk the djinn disappeared.

Several days passed and a group of people arrived at the head of queue of petitioners.

“We too are scribes.” they said. “We come from a neighbouring town having heard of your plight. Our plan is to copy out your Book so that we might share your burden and help these people”.

But whilst a spark of hope was still flaring in the heart of the scribe, the djinn appeared once again. “NO! I forbid this too! Begone!” And with scream and a flash of light the scribes vanished. Looking smug the djinn disappeared.

Some time passes before a troupe of performers approach the Scribe. As a chorus they cried: “Look yonder at our stage, and the many people gathered before it. By taking turns from reading from the book, in front of wide audience, we can easily share your burden”.

But shaking her head the Scribe could only turn away whilst the djinn visited ruin upon the troupe. “No more” she whispered sadly.

And so, for many years the Scribe remained as she had been, imprisoned within the subtle trap of the djinn of the lamp. Until, one day a traveller appeared in the market. Upon reaching the head of the endless line of penitents, the man asked of the Scribe:

“Where should you go to rid your self of the evil djinn?”.

Surprised, and with sudden hope, the Scribe turned the pages of her Book…


Posted at 20:50

Orri Erling: Rethink Big and Europe?s Position in Big Data

I will here take a break from core database and talk a bit about EU policies for research funding.

I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.

RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.

So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.

Core competence

My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.

User challenges

The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.

Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.

There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.

So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.

Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.

What technology would you like to have? Network or power efficiency?

OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.

On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.

There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.

If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.

One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.

So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.

Are there pressures to adapt business models based on big data?

The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time.

So, database software should be rented by the hour. A 100-150% markup for the $2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.

There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.

The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.

Well, the question is raised, what should Europe do that would have tangible impact in the next 5 years?

This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.

What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.

I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.

What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.

In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.

The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.

The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.

It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.

Posted at 19:36

June 28

Semantic Web Company (Austria): Improved Customer Experience by use of Semantic Web and Linked Data technologies

With the rise of Linked Data technologies, there come several new approaches into play for the improvement of customer experience across all digital channels of a company. All of these methodologies can be subsumed under the term “the connected customer”.

These are interesting not only for retailers operating a web shop, but also for enterprises seeking for new ways to develop tailor-made customer services and to increase customer retention.

Linked Data methodologies can help to improve several measurements alongside a typical customer experience lifecycle.

  1. vectorstock_4550983Personalized access to information, e.g. to technical documentation
  2. Cross-selling through a better contextualization of product information
  3. Semantically enhanced help desk, user forums and self service platforms
  4. Better ways to understand and interpret a customer intention by use of enterprise vocabularies
  5. More dynamic management of complex multi-channel websites through a better cost-effectiveness
  6. More precise methods for data analytics, e.g. to allow marketers to better target campaigns and content to the user’s preferences
  7. Enhanced search experience at aggregators like Google through the use of microdata and schema.org

In the center of this approach, knowledge graphs work like a ‘linking machine’. Based on standards-based semantic models, business entities are getting linked in a most dynamic way. Those graphs go beyond the power of social graphs. While social graphs are focused on people only, are knowledge graphs connecting all kinds of relevant business objects to each other.

When customers and their behaviours are represented in a knowledge model, Linked data technologies try to preserve as much semantics as possible. By these means they are able to complement other approaches for big data analytics, which rather tend to flatten out the data model behind business entities.

Posted at 09:08

June 26

Semantic Web Company (Austria): Using SPARQL clause VALUES in PoolParty

connect-sparqlSince PoolParty fully supports SPARQL 1.1 functionalities you can use clauses like VALUES. The VALUES clause can be used to provide an unordered solution sequence that is joined with the results of the query evaluation. From my perspective it is a convenience of filtering variables and an increase in readability of queries.

E.g. when you want to know which cocktails you can create with Gin and a highball glass you can go to http://vocabulary.semantic-web.at/PoolParty/sparql/cocktails and fire this query:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX co: <http://vocabulary.semantic-web.at/cocktail-ontology/>
SELECT ?cocktailLabel
WHERE {
  ?cocktail co:consists-of ?ingredient ;
    co:uses ?drinkware ;
    skos:prefLabel ?cocktailLabel .
  ?ingredient skos:prefLabel ?ingredientLabel .
  ?drinkware skos:prefLabel ?drinkwareLabel .
  FILTER (?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en )
}

When you want to add additional pairs of ingredients and drink ware you want to filter in combination the query gets quite clumsy. Wrongly placed braces can break the syntax. In addition, when writing complicated queries you easily insert errors, e.g. by mixing boolean operators which results in wrong results…

...
FILTER ((?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en ) ||
     (?ingredientLabel = "Vodka"@en && ?drinkwareLabel ="Old Fashioned glass"@en ))
}

Using VALUES can help in this situation. For example this query shows you how to filter both pairs Gin+Highball glass and Vodka+Old Fashioned glass in a neat way:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX co: <http://vocabulary.semantic-web.at/cocktail-ontology/>
SELECT ?cocktailLabel
WHERE {
  ?cocktail co:consists-of ?ingredient ;
    co:uses ?drinkware ;
    skos:prefLabel ?cocktailLabel .
  ?ingredient skos:prefLabel ?ingredientLabel .
  ?drinkware skos:prefLabel ?drinkwareLabel .
}
VALUES ( ?ingredientLabel ?drinkwareLabel )
{
  ("Gin"@en "Highball glass"@en)
  ("Vodka"@en "Old Fashioned glass"@en)
}

Especially when you create SPARQL code automatically, e.g. generated by a form, this clause can be very useful.

 

Posted at 13:15

June 22

Dublin Core Metadata Initiative: OpenAIRE Guidelines: Promoting Repositories Interoperability and Supporting Open Access Funder Mandates

2015-06-01, The OpenAIRE Guidelines for Data Source Managers provide recommendations and best practices for encoding of bibliographic information in OAI metadata. Presenters Pedro Príncipe, University of Minho, Portugal, and Jochen Schirrwagen, Bielefeld University Library, Germany, will provide an overview of the Guidelines, implementation support in major platforms and tools for validation. The Guidelines have adopted established standards for different classes of content providers: (1) Dublin Core for textual publications in institutional and thematic repositories; (2) DataCite Metadata Kernel for research data repositories; and (3) CERIF-XML for Current Research Information Systems. The principle of these guidelines is to improve interoperability of bibliographic information exchange between repositories, e-journals, CRIS and research infrastructures. They are a means to help content providers to comply with funders Open Access policies, e.g. the European Commission Open Access mandate in Horizon2020, and to standardize the syntax and semantics of funder/project information, open access status, links between publications and datasets. Webinar Date: Wednesday, 1 July 2015, 10:00am-11:15am EDT (UTC 14:00 - World Clock: http://bit.ly/pprincipe). For additional information and to register, visit http://dublincore.org/resources/training/#2015principe.

Posted at 23:59

AKSW Group - University of Leipzig: AKSW Colloquium, 22-06-2015, Concept Expansion Using Web Tables, Mining entities from the Web, Linked Data Stack

Concept Expansion Using Web Tables by Chi Wang, Kaushik Chakrabarti, Yeye He,Kris Ganjam, Zhimin Chen, Philip A. Bernstein (WWW’2015), presented by Ivan Ermilov:

Ivan ErmilovAbstract. We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the “exclusive” tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.

Mining entities from the Web by Anna Lisa Gentile

Anna Lisa GentileThis talk explores the task of mining entities and their describing attributes from the Web. The focus is on entity-centric websites, i.e. domain specific websites containing a description page for each entity. The task of extracting information from this kind of websites is usually referred as Wrapper Induction. We propose a simple knowledge based method which is (i) highly flexible with respect to different domains and (ii) does not require any training material, but exploits Linked Data as background knowledge source to build essential learning resources. Linked Data – an imprecise, redundant and large-scale knowledge resource – proved useful to support this Information Extraction task: for domains that are covered, Linked Data serve as a powerful knowledge resource for gathering learning seeds. Experiments on a publicly available dataset demonstrate that, under certain conditions, this simple approach based on distant supervision can achieve competitive results against some complex state of the art that always depends on training data.

Linked Data Stack by Martin Röbert

martinRobertMartin will present the packaging infrastructure developed for the Linked Data Stack project, which will be followed by a discussion about the future of the project.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 10:11

June 20

Bob DuCharme: Artificial Intelligence, then (1960) and now

Especially machine learning.

Posted at 15:50

June 16

Orri Erling: Virtuoso Elastic Cluster Benchmarks AMI on Amazon EC2

We have another new Amazon machine image, this time for deploying your own Virtuoso Elastic Cluster on the cloud. The previous post gave a summary of running TPC-H on this image. This post is about what the AMI consists of and how to set it up.

Note: This AMI is running a pre-release build of Virtuoso 7.5, Commercial Edition. Features are subject to change, and this build is not licensed for any use other than the AMI-based benchmarking described herein.

There are two preconfigured cluster setups; one is for two (2) machines/instances and one is for four (4). Generation and loading of TPC-H data, as well as the benchmark run itself, is preconfigured, so you can do it by entering just a few commands. The whole sequence of doing a terabyte (1000G) scale TPC-H takes under two hours, with 30 minutes to generate the data, 35 minutes to load, and 35 minutes to do three benchmark runs. The 100G scale is several times faster still.

To experiment with this AMI, you will need a set of license files, one per machine/instance, which our Sales Team can provide.

Detailed instructions are on the AMI, in /home/ec2-user/cluster_instructions.txt, but the basic steps to get up and running are as follows:

  1. Instantiate machine image ami-811becea) (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-cl) with two or four (2 or 4) R3.8xlarge instances within one virtual private cluster and placement group. Make sure the VPC security is set to allow all connections.

  2. Log in to the first, and fill in the configuration file with the internal IP addresses of all machines instantiated in step 1.

  3. Distribute the license files to the instances, and start the OpenLink License Manager on each machine.

  4. Run 3 shell commands to set up the file systems and the Virtuoso configuration files.

  5. If you do not plan to run one of these benchmarks, you can simply start and work with the Virtuoso cluster now. It is ready for use with an empty database.

  6. Before running one of these benchmark, generate the appropriate dataset with the dbgen.sh command.

  7. Bulk load the data with load.sh.

  8. Run the benchmark with run.sh.

Right now the cluster benchmarks are limited to TPC-H but cluster versions of the LDBC Social Network and Semantic Publishing benchmarks will follow soon.

Posted at 21:53

June 12

AKSW Group - University of Leipzig: AKSW Colloquium, 15-06-2015, Caching for Link Discovery

Using Caching for Local Link Discovery on Large Data Sets [PDF]
by Mofeed Hassan

Engineering the Data Web in the Big Data era demands the development of time- and Mofeed Hassan's depictionspace-efficient solutions for covering the lifecycle of Linked Data. As shown in previous works, using pure in-memory solutions is doomed to failure as the size of datasets grows continuously with time. In this work, presented by Mofeed Hassan, a study is performed on caching solutions for one of the central tasks on the Data Web, i.e., the discovery of links between resources. To this end, 6 different caching approaches were evaluated on real data using different settings. Our results show that while existing caching approaches already allow performing Link Discovery on large datasets from local resources, the achieved cache hits are still poor. Hence, we suggest the need for dedicated solutions to this problem for tackling the upcoming challenges pertaining to the edification of a semantic Web.

Posted at 21:33

Libby Miller: AWS new instance ssh timing out

In case this is any use to anyone else –

I’ve had AWS instances running for a few years. Today I went to create another one for something and infuriatingly, couldn’t connect to it over ssh at all: ssh just kept timing out.

I found a few links to do with groups, but the default group created for me in the (much improved) wizard seemed to be fine for incoming ssh connections. I then found a bunch of

Posted at 16:19

June 10

Orri Erling: In Hoc Signo Vinces (part 21 of n): Running TPC-H on Virtuoso Elastic Cluster on Amazon EC2

We have made an Amazon EC2 deployment of Virtuoso 7 Commercial Edition, configured to use the Elastic Cluster Module with TPC-H preconfigured, similar to the recently published OpenLink Virtuoso Benchmark AMI running the Open Source Edition. The details of the new Elastic Cluster AMI and steps to use it will be published in a forthcoming post. Here we will simply look at results of running TPC-H 100G scale on two machines, and 1000G scale on four machines. This shows how Virtuoso provides great performance on a cloud platform. The extremely fast bulk load — 33 minutes for a terabyte! — means that you can get straight to work even with on-demand infrastructure.

In the following, the Amazon instance type is R3.8xlarge, each with dual Xeon E5-2670 v2, 244G RAM, and 2 x 300G SSD. The image is made from the Amazon Linux with built-in network optimization. We first tried a RedHat image without network optimization and had considerable trouble with the interconnect. Using network-optimized Amazon Linux images inside a virtual private cloud has resolved all these problems.

The network optimized 10GE interconnect at Amazon offers throughput close to the QDR InfiniBand running TCP-IP; thus the Amazon platform is suitable for running cluster databases. The execution that we have seen is not seriously network bound.

100G on 2 machines, with a total of 32 cores, 64 threads, 488 GB RAM, 4 x 300 GB SSD

Load time: 3m 52s
Run Power Throughput Composite
1 523,554.3 590,692.6 556,111.2
2 565,353.3 642,503.0 602,694.9

1000G on 4 machines, with a total of 64 cores, 128 threads, 976 GB RAM, 8 x 300 GB SSD

Load time: 32m 47s
Run Power Throughput Composite
1 592,013.9 754,107.6 668,163.3
2 896,564.1 828,265.4 861,738.4
3 883,736.9 829,609.0 856,245.3

For the larger scale we did 3 sets of power + throughput tests to measure consistency of performance. By the TPC-H rules, the worst (first) score should be reported. Even after bulk load, this is markedly less than the next power score due to working set effects. This is seen to a lesser degree with the first throughput score also.

The numerical quantities summaries are available in a report.zip file, or individually --

Subsequent posts will explain how to deploy Virtuoso Elastic Clusters on AWS.

In Hoc Signo Vinces (TPC-H) Series

Posted at 16:03

June 09

Orri Erling: Introducing the OpenLink Virtuoso Benchmarks AMI on Amazon EC2

The OpenLink Virtuoso Benchmarks AMI is an Amazon EC2 machine image with the latest Virtuoso open source technology preconfigured to run —

  • TPC-H , the classic of SQL data warehousing

  • LDBC SNB, the new Social Network Benchmark from the Linked Data Benchmark Council

  • LDBC SPB, the RDF/SPARQL Semantic Publishing Benchmark from LDBC

This package is ideal for technology evaluators and developers interested in getting the most performance out of Virtuoso. This is also an all-in-one solution to any questions about reproducing claimed benchmark results. All necessary tools for building and running are included; thus any developer can use this model installation as a starting point. The benchmark drivers are preconfigured with appropriate settings, and benchmark qualification tests can be run with a single command.

The Benchmarks AMI includes a precompiled, preconfigured checkout of the v7fasttrack github repository, checkouts of the github repositories of the benchmarks, and a number of running directories with all configuration files preset and optimized. The image is intended to be instantiated on a R3.8xlarge Amazon instance with 244G RAM, dual Xeon E5-2670 v2, and 600G SSD.

Benchmark datasets and preloaded database files can be downloaded from S3 when large, and generated as needed on the instance when small. As an alternative, the instance is also set up to do all phases of data generation and database bulk load.

The following benchmark setups are included:

  • TPC-H 100G
  • TPC-H 300G
  • LDBC SNB Validation
  • LDBC SNB Interactive 100G
  • LDBC SNB Interactive 300G (SF3)
  • LDBC SPB Validation
  • LDBC SPB Basic 256 Mtriples (SF5)
  • LDBC SPB Basic 1 Gtriple

The AMI will be expanded as new benchmarks are introduced, for example, the LDBC Social Network Business Intelligence or Graph Analytics.

To get started:

  1. Instantiate machine image ami-eb789280 (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-6) with a R3.8xlarge instance.

  2. Connect via ssh.

  3. See the README (also found in the ec2-user's home directory) for full instructions on getting up and running.

Posted at 15:51

Orri Erling: SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso

In this post we will look at running the LDBC SNB on Virtuoso.

First, let's recap what the benchmark is about:

  1. fairly frequent short updates, with no update contention worth mentioning
  2. short random lookups
  3. medium complex queries centered around a person's social environment

The updates exist so as to invalidate strategies that rely too heavily on precomputation. The short lookups exist for the sake of realism; after all, an online social application does lookups for the most part. The medium complex queries are to challenge the DBMS.

The DBMS challenges have to do firstly with query optimization, and secondly with execution with a lot of non-local random access patterns. Query optimization is not a requirement, per se, since imperative implementations are allowed, but we will see that these are no more free of the laws of nature than the declarative ones.

The workload is arbitrarily parallel, so intra-query parallelization is not particularly useful, if also not harmful. There are latency constraints on operations which strongly encourage implementations to stay within a predictable time envelope regardless of specific query parameters. The parameters are a combination of person and date range, and sometimes tags or countries. The hardest queries have the potential to access all content created by people within 2 steps of a central person, so possibly thousands of people, times 2000 posts per person, times up to 4 tags per post. We are talking in the millions of key lookups, aiming for sub-second single-threaded execution.

The test system is the same as used in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The software is the feature/analytics branch of v7fasttrack, available from www.github.com.

The dataset is the SNB 300G set, with:

1,136,127 persons
125,249,604 knows edges
847,886,644 posts , including replies
1,145,893,841 tags of posts or replies
1,140,226,235 likes of posts or replies

As an initial step, we run the benchmark as fast as it will go. We use 32 threads on the driver side for 24 hardware threads.

Below are the numerical quantities for a 400K operation run after 150K operations worth of warmup.

Duration: 10:41.251
Throughput: 623.71 (op/s)

The statistics that matter are detailed below, with operations ranked in order of descending client-side wait-time. All times are in milliseconds.

% of total total_wait name count mean min max
20     % 4,231,130 LdbcQuery5 656 6,449.89    245 10,311
11     % 2,272,954 LdbcQuery8 18,354 123.84    14 2,240
10     % 2,200,718 LdbcQuery3 388 5,671.95    468 17,368
7.3   % 1,561,382 LdbcQuery14 1,124 1,389.13    4 5,724
6.7   % 1,441,575 LdbcQuery12 1,252 1,151.42    15 3,273
6.5   % 1,396,932 LdbcQuery10 1,252 1,115.76    13 4,743
5     % 1,064,457 LdbcShortQuery3PersonFriends 46,285 22.9979  0 2,287
4.9   % 1,047,536 LdbcShortQuery2PersonPosts 46,285 22.6323  0 2,156
4.1   % 885,102 LdbcQuery6 1,721 514.295   8 5,227
3.3   % 707,901 LdbcQuery1 2,117 334.389   28 3,467
2.4   % 521,738 LdbcQuery4 1,530 341.005   49 2,774
2.1   % 440,197 LdbcShortQuery4MessageContent 46,302 9.50708 0 2,015
1.9   % 407,450 LdbcUpdate5AddForumMembership 14,338 28.4175  0 2,008
1.9   % 405,243 LdbcShortQuery7MessageReplies 46,302 8.75217 0 2,112
1.9   % 404,002 LdbcShortQuery6MessageForum 46,302 8.72537 0 1,968
1.8   % 387,044 LdbcUpdate3AddCommentLike 12,659 30.5746  0 2,060
1.7   % 361,290 LdbcShortQuery1PersonProfile 46,285 7.80577 0 2,015
1.6   % 334,409 LdbcShortQuery5MessageCreator 46,302 7.22234 0 2,055
1     % 220,740 LdbcQuery2 1,488 148.347   2 2,504
0.96  % 205,910 LdbcQuery7 1,721 119.646   11 2,295
0.93  % 198,971 LdbcUpdate2AddPostLike 5,974 33.3062  0 1,987
0.88  % 189,871 LdbcQuery11 2,294 82.7685  4 2,219
0.85  % 182,964 LdbcQuery13 2,898 63.1346  1 2,201
0.74  % 158,188 LdbcQuery9 78 2,028.05    1,108 4,183
0.67  % 143,457 LdbcUpdate7AddComment 3,986 35.9902  1 1,912
0.26  % 54,947 LdbcUpdate8AddFriendship 571 96.2294  1 988
0.2   % 43,451 LdbcUpdate6AddPost 1,386 31.3499  1 2,060
0.0086% 1,848 LdbcUpdate4AddForum 103 17.9417  1 65
0.0002% 44 LdbcUpdate1AddPerson 2 22       10 34

At this point we have in-depth knowledge of the choke points the benchmark stresses, and we can give a first assessment of whether the design meets its objectives for setting an agenda for the coming years of graph database development.

The implementation is well optimized in general but still has maybe 30% room for improvement. We note that this is based on a compressed column store. One could think that alternative data representations, like in-memory graphs of structs and pointers between them, are better for the task. This is not necessarily so; at the least, a compressed column store is much more space efficient. Space efficiency is the root of cost efficiency, since as soon as the working set is not in memory, a random access workload is badly hit.

The set of choke points (technical challenges) actually revealed by the benchmark is so far as follows:

  • Cardinality estimation under heavy data skew — Many queries take a tag or a country as a parameter. The cardinalities associated with tags vary from 29M posts for the most common to 1 for the least common. Q6 has a common tag (in top few hundred) half the time and a random, most often very infrequent, one the rest of the time. A declarative implementation must recognize the cardinality implications from the literal and plan accordingly. An imperative one would have to count. Missing this makes Q6 take about 40% of the time instead of 4.1% when adapting.

  • Covering indices — Being able to make multi-column indices that duplicate some columns from the table often saves an entire table lookup. For example, an index on post by author can also contain the post's creation date.

  • Multi-hop graph traversal — Most queries access a two-hop environment starting at a person. Two queries look for shortest paths of unbounded length. For the two-hop case, it makes almost no difference whether this is done as a union or a special graph traversal operator. For shortest paths, this simply must be built into the engine; doing this client-side incurs prohibitive overheads. A bidirectional shortest path operation is a requirement for the benchmark.

  • Top K Most queries returning posts order results by descending date. Once there are at least k results, anything older than the kth can be dropped, adding a date selection as early as possible in the query. This interacts with vectored execution, so that starting with a short vector size more rapidly produces an initial top k.

  • Late projection — Many queries access several columns and touch millions of rows but only return a few. The columns that are not used in sorting or selection can be retrieved only for the rows that are actually returned. This is especially useful with a column store, as this removes many large columns (e.g., text of a post) from the working set.

  • Materialization — Q14 accesses an expensive-to-compute edge weight, the number of post-reply pairs between two people. Keeping this precomputed drops Q14 from the top place. Other materialization would be possible, for example Q2 (top 20 posts by friends), but since Q2 is just 1% of the load, there is no need. One could of course argue that this should be 20x more frequent, in which case there could be a point to this.

  • Concurrency control — Read-write contention is rare, as updates are randomly spread over the database. However, some pages get read very frequently, e.g., some middle level index pages in the post table. Keeping a count of reading threads requires a mutex, and there is significant contention on this. Since the hot set can be one page, adding more mutexes does not always help. However, hash partitioning the index into many independent trees (as in the case of a cluster) helps for this. There is also contention on a mutex for assigning threads to client requests, as there are large numbers of short operations.

In subsequent posts, we will look at specific queries, what they in fact do, and what their theoretical performance limits would be. In this way we will have a precise understanding of which way SNB can steer the graph DB community.

SNB Interactive Series

Posted at 15:24

June 08

W3C Data Activity: INSPIRE in RDF

As many people who work in the field will know, the 2007 INSPIRE Directive tasks European Union Member States with harmonizing their spatial and environmental data. The relevant department of the European Commission, the JRC, has lead the definition of … Continue reading

Posted at 09:46

Tetherless World Constellation group RPI: GYA, CODATA-ECDP and Open Science

During May 25-29, 2015, the Global Young Academy (GYA) held the 5th International Conference for Young Scientists and its Annual General Meeting at Montebello, Quebec, Canada. I attended the public day of the conference on May 27, as a delegate of the CODATA Early Career Data Professionals Working Group (ECDP).

The GYA was founded in 2010 and its objective is to be the voice of young scientists around the world. Members are chosen for their demonstrated excellence in scientific achievement and commitment to service. Currently there are 200 members from 58 countries, representing all major world regions. Most
GYA members attended the conference at Montebello, together with about 40 guests from other institutions, including Prof. Gordon McBean, president of the International Council for Science and Prof. Howard Alper, former co-chair of IAP: the Global Network of Science Academies.

GYA issued a position statement on Open Science in 2012, which calls for scientific results and data to be made freely available for scientists around the world, and advocates ways forward that will transform scientific research into a truly global endeavor. Dr. Sabina Leonelli from the University of Exeter, UK is one of the lead authors of the position statement, and also a lead of the GYA Open Science Working Group. A major objective of my attendance to the GYA conference is to discuss the future opportunities on collaborations between CODATA-ECDP and GYA. Besides Sabina, I also met Dr. Abdullah Tariq, another lead of the GYA Open Science WG, and several other members of the GYA executive committee.
The discussion was successful. We mentioned the possibility of an interest group in Global Open Science within CODATA, to have a few members join both organizations, to propose sessions on the diversity of conditions under which open data work around the world, perhaps for the next CODATA/RDA meeting in Paris or later meetings of the type, to collaborate around business models for data centers, and to reach out to other organizations and working groups of open data and/or open science, etc.

GYA is such an active group both formed and organized by young people. And I was so happy to see that Open Science is one of the four core activities that GYA is currently promoting. I would recommend ECDP and CODATA members to see more details of GYA on the website and propose future collaborations to promote topics of common interest on open data and open science.

Posted at 03:11

June 07

Ebiquity research group UMBC: UMBC Schema Free Query system on ESWC Schema-agnostic Queries over Linked Data

This year’s ESWC Semantic Web Evaluation Challenge track had a task on Schema-agnostic Queries over Linked Data: SAQ-2015. The idea is to support a SPARQL-like query language that does not require knowing the underlying graph schema nor the URIs to use for terms and individuals, as in the follwing examples.

 SELECT ?y {BillClinton hasDaughter ?x. ?x marriedTo ?y.}

 SELECT ?x {?x isA book. ?x by William_Goldman.
            ?x has_pages ?p. FILTER (?p > 300)}

We adapted our Schema Free Querying system to the task as described in the following paper.


Zareen Syed, Lushan Han, Muhammad Mahbubur Rahman, Tim Finin, James Kukla and Jeehye Yun, UMBC_Ebiquity-SFQ: Schema Free Querying System, ESWC Semantic Web Evaluation Challenge, Extended Semantic Web Conference, June 2015.

Users need better ways to explore large complex linked data resources. Using SPARQL requires not only mastering its syntax and semantics but also understanding the RDF data model, the ontology and URIs for entities of interest. Natural language question answering systems solve the problem, but these are still subjects of research. The Schema agnostic SPARQL queries task defined in SAQ-2015 challenge consists of schema-agnostic queries following the syntax of the SPARQL standard, where the syntax and semantics of operators are maintained, while users are free to choose words, phrases and entity names irrespective of the underlying schema or ontology. This combination of query skeleton with keywords helps to remove some of the ambiguity. We describe our framework for handling schema agnostic or schema free queries and discuss enhancements to handle the SAQ-2015 challenge queries. The key contributions are the robust methods that combine statistical association and semantic similarity to map user terms to the most appropriate classes and properties used in the underlying ontology and type inference for user input concepts based on concept linking.

Posted at 13:58

June 06

Ebiquity research group UMBC: Querying RDF Data with Text Annotated Graphs

New paper: Lushan Han, Tim Finin, Anupam Joshi and Doreen Cheng, Querying RDF Data with Text Annotated Graphs, 27th International Conference on Scientific and Statistical Database Management, San Diego, June 2015.

Scientists and casual users need better ways to query RDF databases or Linked Open Data. Using the SPARQL query language requires not only mastering its syntax and semantics but also understanding the RDF data model, the ontology used, and URIs for entities of interest. Natural language query systems are a powerful approach, but current techniques are brittle in addressing the ambiguity and complexity of natural language and require expensive labor to supply the extensive domain knowledge they need. We introduce a compromise in which users give a graphical “skeleton” for a query and annotates it with freely chosen words, phrases and entity names. We describe a framework for interpreting these “schema-agnostic queries” over open domain RDF data that automatically translates them to SPARQL queries. The framework uses semantic textual similarity to find mapping candidates and uses statistical approaches to learn domain knowledge for disambiguation, thus avoiding expensive human efforts required by natural language interface systems. We demonstrate the feasibility of the approach with an implementation that performs well in an evaluation on DBpedia data.

Posted at 14:26

June 05

Ebiquity research group UMBC: Discovering and Querying Hybrid Linked Data

g6Ibfkd

New paper: Zareen Syed, Tim Finin, Muhammad Rahman, James Kukla and Jeehye Yun, Discovering and Querying Hybrid Linked Data, Third Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, held in conjunction with the 12th Extended Semantic Web Conference, Portoroz Slovenia, June 2015.

In this paper, we present a unified framework for discovering and querying hybrid linked data. We describe our approach to developing a natural language query interface for a hybrid knowledge base Wikitology, and present that as a case study for accessing hybrid information sources with structured and unstructured data through natural language queries. We evaluate our system on a publicly available dataset and demonstrate improvements over a baseline system. We describe limitations of our approach and also discuss cases where our system can complement other structured data querying systems by retrieving additional answers not available in structured sources.

Posted at 14:00

AKSW Group - University of Leipzig: AKSW Colloquium, 08-06-2015, DBpediaSameAs, Dynamic-LOD

DBpediaSameAs: An approach to tackling heterogeneity in DBpedia identifiers by Andre Valdestilhas

This work provides an approach to tackle heterogeneity about a problem where several transient owl:sameAs redundant occurrences were found in DBpedia identifiers during searching for owl:sameAs occurrences that were observed while finding of co-references between different data sets.andre_terno_ita

Thus, in this work there are 3 contributions in order to solve this problem: (1) DBpedia Unique Identifier, which was provided to obtain a normalization for owl:sameAs occurrences providing a unique DBpedia identifier instead of several transient owl:sameAs redundant occurrences,  (2) Rate and suggest links, in order to improve the quality and also giving the possibility to have statistic data about the links, and (3) As a result of our work we were able to achieve a performance gain where the physical size has decreased from 16.2 GB to 6 GB triples and we also have the possibility to perform normalization and create an index.

The usability of the interface was evaluated by using a standard system of usability questionnaire. The positive results from all of our interviewed participants showed that the DBpediaSameAs property is easy to use and can thus lead to novel insights.

As proof of concept an implementation is provided in a computational web system, including a Service on the web and a Graphical User Interface.

Dynamic-LOD: An approach to count links using Bloom filters by Ciro Baron

The Web of Linked Data is growing and it becomes increasingly necessary to discover the relationship between different datasets.

Ciro Baron will present an approach for accurate link counting which uses Bloom filters (BF) to compare and approximately count links between datasets, solving the problem of lack of up-to-date meta-data about linksets. The paper which compare performance to classical approaches such as binary search tree (BST) and hash tables (HT) can be found in this link(http://svn.aksw.org/papers/2015/ISWC_DynLOD/public.pdf), and the results show that Bloom filter is 12x more efficient regarding of memory usage with adequate query speed performance.

In addition, Ciro will show a small cloud generated for all English DBpedia datasets and vocabularies available in Linked Open Vocabularies (LOV).

We evaluated Dynamic-LOD in three different aspects: firstly by analyzing data structure performance comparing BF with HS and BST, secondly a quantitative evaluation regarding false positives, speed to count links in a dense scenario like DBpedia and thirdly on a large scale based on lod-cloud distributions. In fact, all three evaluations indicates that BF is a good choice for what our work proposes.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 00:00

June 03

Orri Erling: The Virtuoso Science Library

There is a lot of scientific material on Virtuoso, but it has not been presented all together in any one place. So I am making here a compilation of the best resources with a paragraph of introduction on each. Some of these are project deliverables from projects under the EU FP7 programme; some are peer-reviewed publications.

For the future, an updated version of this list may be found on the main Virtuoso site.

European Project Deliverables

  • GeoKnow D 2.6.1: Graph Analytics in the DBMS (2015-01-05)

    This introduces the idea of unbundling basic cluster DBMS functionality like cross partition joins and partitioned group by to form a graph processing framework collocated with the data.

  • GeoKnow D2.4.1: Geospatial Clustering and Characteristic Sets (2015-01-06)

    This presents experimental results of structure-aware RDF applied to geospatial data. The regularly structured part of the data goes in tables; the rest is triples/quads. Furthermore, for the first time in the RDF space, physical storage location is correlated to properties of entities, in this case geo location, so that geospatially adjacent items are also likely adjacent in the physical data representation.

  • LOD2 D2.1.5: 500 billion triple BSBM (2014-08-18)

    This presents experimental results on lookup and BI workloads on Virtuoso cluster with 12 nodes, for a total of 3T RAM and 192 cores. This also discusses bulk load, at up to 6M triples/s and specifics of query optimization in scale-out settings.

  • LOD2 D2.6: Parallel Programming in SQL (2012-08-12)

    This discusses ways of making SQL procedures partitioning-aware, so that one can, map-reduce style, send parallel chunks of computation to each partition of the data.

Publications

2015

  • Minh-Duc, Pham, Linnea, P., Erling, O., and Boncz, P.A. "Deriving an Emergent Relational Schema from RDF Data," WWW, 2015.

    This paper shows how RDF is in fact structured and how this structure can be reconstructed. This reconstruction then serves to create a physical schema, reintroducing all the benefits of physical design to the schema-last world. Experiments with Virtuoso show marked gains in query speed and data compactness.

2014

2013

2012

  • Orri Erling: Virtuoso, a Hybrid RDBMS/Graph Column Store. IEEE Data Eng. Bull. (DEBU) 35(1):3-8 (2012)

    This paper introduces the Virtuoso column store architecture and design choices. One design is made to serve both random updates and lookups as well as the big scans where column stores traditionally excel. Examples are given from both TPC-H and the schema-less RDF world.

  • Minh-Duc Pham, Peter A. Boncz, Orri Erling: S3G2: A Scalable Structure-Correlated Social Graph Generator. TPCTC 2012:156-172

    This paper presents the basis of the social network benchmarking technology later used in the LDBC benchmarks.

2011

2009

  • Orri Erling, Ivan Mikhailov: Faceted Views over Large-Scale Linked Data. LDOW 2009

    This paper introduces anytime query answering as an enabling technology for open-ended querying of large data on public service end points. While not every query can be run to completion, partial results can most often be returned within a constrained time window.

  • Orri Erling, Ivan Mikhailov: Virtuoso: RDF Support in a Native RDBMS. Semantic Web Information Management 2009:501-519

    This is a general presentation of how a SQL engine needs to be adapted to serve a run-time typed and schema-less workload.

2008

2007

  • Orri Erling, Ivan Mikhailov: RDF Support in the Virtuoso DBMS. CSSW 2007:59-68

    This is an initial discussion of RDF support in Virtuoso. Most specifics are by now different but this can give a historical perspective.

Posted at 16:51

Redlink: 12th Extended Semantic Web Conference

eswc2015-teaser

For twelfth consecutive year the Extended Semantic Web Conference, formerly European Semantic Web Conference,  took place in early June, this year in Portoroz, Slovenia. ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic  web technologies. For us as an innovation driven startup it’s very important to keep talking with the community in our continuous effort to improve our products.

The developers workshop brought together the most practical face of the current research. With a very interesting program, including a talk about WordLift, the day finished with very interesting discussions and conclusions.

I was invited to present the keynote at the SALAD2015 (Services and Applications over Linked APIs and Data) workshop, looking back the last ten years of semantics on service-oriented architectures.

In the main track, the trend was clearly to reach out to other relevant research areas in which web semantics plays an important role, such as Machine Learning or Big data, applying the results to both established scenarios, like we are doing it in TourPack for the tourism sector, as well as for new scenarios, such as the trending nowadays Internet of Things.

Find the full program of the conference in case you are interested to know with more detail the different papers presented.

Posted at 13:26

Copyright of the postings is owned by the original blog authors. Contact us.