Planet RDF

It's triples all the way down

December 13

AKSW Group - University of Leipzig: SANSA 0.5 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.5 – the fifth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • A data lake concept for querying heterogeneous data sources has been integrated into SANSA
  • New clustering algorithms have been added and the interface for clustering has been unified
  • Ontop RDB2RDF engine support has been added
  • RDF data quality assessment methods have been substantially improved
  • Dataset statistics calculation has been substantially improved
  • Improved unit test coverage

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects HOBBIT, Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin and Simple-ML.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

 

Posted at 08:25

December 01

Libby Miller: Cat detector with Tensorflow on a Raspberry Pi 3B+

Like this
Download Raspian Stretch with Desktop    

Burn a card with Etcher.

(Assuming a Mac) Enable ssh

touch /Volumes/boot/ssh

Put a wifi password in

nano /Volumes/boot/wpa_supplicant.conf
country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="foo"
  psk="bar"
}

Connect the Pi camera, attach a dial to GPIO pin 12 and ground, boot up the Pi, ssh in, then

sudo apt-get update
sudo apt-get upgrade
sudo raspi-config # and enable camera; reboot

install tensorflow

sudo apt install python3-dev python3-pip
sudo apt install libatlas-base-dev
pip3 install --user --upgrade tensorflow

Test it

python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

get imagenet

git clone https://github.com/tensorflow/models.git
cd ~/models/tutorials/image/imagenet
python3 classify_image.py

install openCV

pip3 install opencv-python
sudo apt-get install libjasper-dev
sudo apt-get install libqtgui4
sudo apt install libqt4-test
python3 -c 'import cv2; print(cv2.__version__)'

install the pieces for talking to the camera

cd ~/models/tutorials/image/imagenet
pip3 install imutils picamera
mkdir results

download edited version classify_image

curl -O https://gist.githubusercontent.com/libbymiller/d542d596566774a35752d134f80b1332/raw/471f066e4dc498501bab7731a07fa0c1926c1575/classify_image_dial.py

Run it, and point at a cat

python3 classify_image_dial.py

Posted at 20:48

November 30

Dublin Core Metadata Initiative: DCMI 2018 Conference Proceedings Published

We are pleased to announce that the full conference proceedings for the DCMI Annual Conference 2018 have been published. Thanks to all who contributed to the programme!

Posted at 00:00

Dublin Core Metadata Initiative: DCMI to Maintain the Bibliographic Ontology (BIBO)

The Dublin Core Metadata Initiative is pleased to announce that it has accepted responsibility for maintaining the Bibliographic Ontology (BIBO). BIBO is a long-established ontology, is very stable and is widely used in bibliographic linked-data. We are also very pleased to announce that Bruce D'Arcus has joined the DCMI Usage Board, and will act as the point of contact for issues relating to BIBO. Bruce had this to say about the decision:

Posted at 00:00

November 29

Gregory Williams: Thoughts on HDT

I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.

Benefits

HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.

Problems and Challenges

I ran into a number of outright problems when trying to implement HDT from scratch:

  • The HDT documentation is incomplete/incorrect in places, and required reverse engineering the existing implementations to determine critical format details; questions remain about specifics (e.g. canonical dictionary escaping):

    Here are some of the issues I found during implementation:

    • �DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose

    • �FourSectionDictionary conflicts with the previous section on the format URI (http://purl.org/HDT/hdt#dictionaryPlain vs. http://purl.org/HDT/hdt#dictionaryFour)

    • The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits

    • “Log64” referenced in �BitmapTriples does not seem to be defined anywhere

    • There doesn’t seem to be documentation on exactly how RDF term data (“strings”) is encoded in the dictionary. Example datasets are enough to intuit the format, but it’s not clear why \u and \U escapes are supported, as this adds complexity and inefficiency. Moreover, without a canonical format (including when/if escapes must be used), it is impossible to efficiently implement dictionary lookup

  • The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding

  • There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible

  • The secondary index format seems to be entirely undocumented

In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:

  • The default dictionary encoding format (plain front coding) is inefficient for datatyped literals and unnecessarily allows escaped content, resulting in inefficient parsing

  • Distinct value space for predicate and subject/object dictionary IDs is at odds with many triple stores, and makes interoperability difficult (e.g. dictionary lookup is not just dict[id] -> term, but dict[id, pos] -> term; a single term might have two IDs if it is used as both predicate and subject/object)

  • The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit

  • A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)

  • The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging

Open Questions

Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.

Future Directions

In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.

Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.

Posted at 16:58

November 26

Dublin Core Metadata Initiative: Webinar: SKOS - visión general y modelado de vocabularios controlados

This webinar is scheduled for Wednesday, November 28, 2018, 15:00 UTC (convert this time to your local timezone here) and is free for DCMI members. SKOS (Simple Knowledge Organization Systems) es la recomendación del W3C para representar y publicar conjuntos de datos de clasificaciones, tesauros, encabezamientos de materia, glosarios y otros tipos de vocabularios controlados y sistemas de organización del conocimiento. La primera parte del webinar incluye una visión general de las tecnologías de la web semántica y muestra detalladamente los diferentes elementos del modelo SKOS.

Posted at 00:00

November 25

Leigh Dodds: UnINSPIREd: problems accessing local government geospatial data

This weekend I started a side project which I plan to spend some time on this winter. The goal is to create a web interface that will let people explore geospatial datasets published by the three local authorities that make up the West of England Combined Authority: Bristol City Council, South Gloucestershire Council and Bath & North East Somerset Council.

Through Bath: Hacked we’ve already worked with the council to publish a lot of geospatial data. We’ve also run community mapping events and created online tools to explore geospatial datasets. But we don’t have a single web interface that makes it easy for anyone to explore that data and perhaps mix it with new data that they have collected.

Rather than build something new, which would be fun but time consuming, I’ve decided to try out TerriaJS. Its an open source, web based mapping tool that is already being used to publish the Australian National Map. It should handle doing the West of England quite comfortably. It’s got a great set of features and can connect to existing data catalogues and endpoints. It seems to be perfect for my needs.

I decided to start by configuring the datasets that are already in the Bath: Hacked Datastore, the Bristol Open Data portal, and data.gov.uk. Every council also has to publish some data via standard APIs as part of the INSPIRE regulations, so I hoped to be able to quickly bring a list of existing datasets without having to download and manage them myself.

Unfortunately this hasn’t proved as easy as I’d hoped. Based on what we’ve learned so far about the state of geospatial data infrastructure in our project at the ODI I had reasonably low expectations. But there’s nothing like some practical experience to really drive things home.

Here’s a few of the challenges and issues I’ve encountered so far.

  • The three councils are publishing different sets of data. Why is that?
  • The dataset licensing isn’t open and looks to be inconsistent across the three councils. When is something covered by INSPIRE rather than the PSMA end user agreement?
  • The new data.gov.uk “filter by publisher” option doesn’t return all datasets for the specified publisher. I’ve reported this as a bug, in the meantime I’ve fallen back on searching by name
  • The metadata for the datasets is pretty poor, and there is little supporting documentation. I’m not sure what some of the datasets are intended to represent. What are “core strategy areas“?
  • The INSPIRE service endpoints do include metadata that isn’t exposed via data.gov.uk. For example this South Gloucester dataset includes contact details, data on geospatial extents, and format information which isn’t otherwise available. It would be nice to be able to see this and not have to read the XML
  • None of the metadata appears to tell me when the dataset was last updated. The last modified data on data.gov.uk is (I think) the date the catalogue entry was last updated. Are the Section 106 agreements listed in this dataset from 2010 or are they regularly updated. How can I tell?
  • Bath is using GetMapping to host its INSPIRE datasets. Working through them on data.gov.uk I found that 46 out of the 48 datasets I reviewed have broken endpoints. I’m reasonably certain these used to work. I’ve reported the issue to the council.
  • The two datasets that do work in Bath cannot be used in TerriaJS. I managed to work around the fact that they require a username and password to access but have hit a wall because the GetMapping APIs only seem to support EPSG:27700 (British National Grid) and not EPSG:3857 as used by online mapping tools. So the APIs refuse to serve the data in a way that can be used by the framework. The Bristol and South Gloucestershire endpoints handle this fine. I assume this is either a limitation of the GetMapping service or a misconfiguration. I’ve asked for help.
  • A single Web Mapping Service can expose multiple datasets as individual layers. But apart from Bristol, both Bath and South Gloucestershire are publishing each dataset through its own API endpoint. I hope the services they’re using aren’t charging per end-point, as they’re probably unnecessary? Bristol has chosen to publish a couple of API that bring together several datasets, but these are also available individually through separate APIs.
  • The same datasets are repeated across data catalogues and endpoints. Bristol has its data listed as individual datasets in its own platform, listed as individual datasets in data.gov.uk and also exposed via two different collections which bundle some (or all?) of them together. I’m unclear on the overlap or whether there are differences between them in terms of scope, timeliness, etc. The licensing is also different. Exploring the three different datasets that describe allotments in Bristol, only one actually displayed any data in TerriaJS. I don’t know why
  • The South Gloucestershire web mapping services all worked seamlessly, but I noticed that if I wanted to download the data, then I would need to jump through hoops to register to access it. Obviously not ideal if I do want to work with the data locally. This isn’t required by the other councils. I assume this is a feature of MisoPortal
  • The South Gloucestershire datasets don’t seem to include any useful attributes for the features represented in the data. When you click on the points, lines and polygons in TerriaJS no additional information is displayed. I don’t know yet whether this data just isn’t included in the dataset or if its a bug in the API or in how TerriaJS is requesting it. I’d need to download or explore the data in some other way to find out. However the data that is available from Bath and Bristol also has inconsistencies in how its described, so I suspect there aren’t any agreed standards
  • Neither the GetMapping or MisoPortal APIs support CORS. This means you can’t access the data from Javascript running directly in the browser, which is what TerriaJS does by default. I’ve had to configure those to be accessed via a proxy. “Web mapping services” should work on the web.
  • While TerriaJS doesn’t have a plugin for OpenDataSoft (which powers the Bristol Open Data platform), I found that OpenDataSoft do provide a Web Feature Service interface. So I was able to configure that in TerriaJS to access that. Unfortunately I then found that either there’s a bug in the platform or a problem with the data because most of the points were in the Indian Ocean

The goal of the INSPIRE legislation was to provide a common geospatial data infrastructure across Europe. What I’m trying to do here should be relatively quick and easy to do. Looking at this graph of INSPIRE conformance for the UK, everything looks rosy.

But, based on an admittedly small sample of only three local authorities, the reality seems to be that:

  • services are inconsistently implemented and have not been designed to be used as part of native web applications and mapping frameworks
  • metadata quality is poor
  • there is inconsistent detail about features which makes it hard to aggregate, use and compare data across different areas
  • it’s hard to tell the provenance of data because of duplicated copies of data across catalogues and endpoints. Without modification or provenance information, its unclear whether data is data is up to date
  • licensing is unclear
  • links to service endpoints are broken. At best, this leads to wasted time from data users. At worst, there’s public money being spent on publishing services that no-one can access

It’s important that we find ways to resolve these problems. As this recent survey by the ODI highlights, SMEs, startups and local community groups all need to be able to use this data. Local government needs more support to help strengthen our geospatial data infrastructure.

Posted at 18:07

November 21

Dublin Core Metadata Initiative: Webinar: Lightweight Methodology and Tools for Developing Ontology Networks

This webinar is scheduled for Thursday, November 29, 2018, 15:00 UTC (convert this time to your local timezone here) and is free for DCMI members. The increasing uptake of semantic technologies is giving the ontologies the opportunity to be treated as first-class citizens within software development projects. Together with the deserved visibility and attention that ontologies are getting, it comes the responsibility for ontology development teams to combine their activities with software development practices as seamless as possible.

Posted at 00:00

November 18

Bob DuCharme: Extracting RDF data models from Wikidata

That's "models", plural.

Posted at 14:41

November 17

Egon Willighagen: New paper: "Explicit interaction information from WikiPathways in RDF facilitates drug discovery in the Open PHACTS Discovery Platform"

Figure from the article showing the interactive
Open PHACTS documentation to access
interactions.
Ryan, PhD candidate in our group, is studying how to represent and use interaction information in pathway databases, and WikiPathways specifically. His paper Explicit interaction information from WikiPathways in RDF facilitates drug discovery in the Open PHACTS Discovery Platform (doi:10.12688/f1000research.13197.2) was recently accepted in F1000Research, which extends on work started by, among others, Andra (see doi:10.1371/journal.pcbi.1004989).

The paper describes the application programming interfaces (API) methods of the Open PHACTS REST API for accessing interaction information, e.g. to learn which genes are upstream of downstream in the pathway. This information can be used in pharmacological research. The paper discussed examples queries and demonstrates how the API methods can be called from HTML+JavaScript and Python.

Posted at 10:30

November 03

Egon Willighagen: Fwd: "We challenge you to reuse Additional Files (a.k.a. Supplementary Information)"

Download statistics of J. Cheminform.
Additional Files show a clear growth.
Posted on the BMC (formerly BioMedCentral) Research in progress blog our challenge to you to reuse additional files:
Since our open-access portfolio in BMC and SpringerOpen started collaborating with Figshare, Additional Files and Supplementary Information have been deposited in journal-specific Figshare repositories, and files available for the Journal of Cheminformatics alone have been viewed more than ten thousand times. Yet what is the best way to make the most of this data and reuse the files? Journal of Cheminformatics challenges you to think about just that with their new upcoming special issue.
We already know you are downloading the data frequently and more every year, so let us know what you're doing with that data!

For example, I would love to see more data from these additional files end up in databases, such as Wikidata, but any reuse in RDF form would interest me.

Posted at 09:45

October 28

Bob DuCharme: SPARQL full-text Wikipedia searching and Wikidata subclass inferencing

Wikipedia querying techniques inspired by a recent paper.

Posted at 17:37

October 24

Dublin Core Metadata Initiative: Webinar: SKOS - visión general y modelado de vocabularios controlados

IMPORTANT UPDATE: Due to unforseen circumstances, we have had to postpone this webinar. Please look out for a further announcement giving the new date and time SKOS (Simple Knowledge Organization Systems) es la recomendación del W3C para representar y publicar conjuntos de datos de clasificaciones, tesauros, encabezamientos de materia, glosarios y otros tipos de vocabularios controlados y sistemas de organización del conocimiento. La primera parte del webinar incluye una visión general de las tecnologías de la web semántica y muestra detalladamente los diferentes elementos del modelo SKOS.

Posted at 00:00

October 22

AKSW Group - University of Leipzig: AKSW at web.br in São Paulo

From October 1st until 6th a delegation from AKSW Group, Leipzig University of Applied Sciences (HTWK), eccenca GmbH, and Max Planck Institute for Human Cognitive and Brain Sciences went to São Paulo, Brazil to meet people from the Web Technologies Study Center (ceweb.br) for evaluation future collaboration.
For getting to know our mutual research interests we held the Workshop on Linked Data Management.


The Workshop on Linked Data Management (Workshop sobre Gestão de Dados Abertos) was co-located with the annual conference of the Brazilian Word Wide Web Consortium (Conferencia web.br 2018) in São Paulo.
During the workshop 11 talks were held by researchers from the Brazilian hosts and the German delegation.
By mutually presenting our research areas, open questions, and visions to the audience overlapping research interests and complementing areas of expertise could be identified.
A recurring hypothesis was that Open Data is a very powerful method to foster participation, accessibility, and collaboration across areas.
During the presentations the potential in the areas of research data in the digital humanities, the accessibility of educational resources and organization of educational infrastructures, and the participation in public administration and government became visible.
A recurring topic in the presentations was the need for collaboration among actors and stakeholders which arises the need for methodologies and systems for supporting the collaboration.
Asset for a potential future cooperation in this research area between the Brazilian and the German side were the mutually complementing interests and experiences of the groups.
The Brazilian side has an existing involvement with public administration, government, and education particularly with the special needs from a developing country perspective.
On the German side a strong background in the creation and operation of data management systems and infrastructures, as well as data integration exists.
We are currently in the process of establishing useful communication channels and collaboration platforms which allow efficient joint work across timezone, language, and continental borders, to foster the cooperation between the two groups.
For a common understanding of our interests and skills the first subject of collaboration is an common extended documentation of the initial workshop. Following this documentation a requirements engineering process will be started identify the concrete needs and potentials on both sides for a common project in future.
The second workshop planed in June 2019 will focus on the results of this discussion.

After the workshop we have also visited the DFG Office in Latin America to discuss possible research collaboration between German institutions and institutions in São Paulo.

The Open Data Management Workshop and the visit of the German delegation is funded by the German Research Foundation (DFG) in cooperation with the São Paulo Research Foundation (FAPESP) under grant agreement number 388784229.

Also read about our trip at the HTWK news portal (German).

Posted at 07:37

October 05

Libby Miller: Etching on a laser cutter

I’ve been struggling with this for ages, but yesterday at Hackspace – thanks to Barney (and I now realise, Tiff said this too and I got distracted and never followed it up) – I got it to work.

The issue was this: I’d been assuming that everything you lasercut had to be a vector DXF, so I was tracing bitmaps using inkscape in order to make a suitable  SVG, converting to DXF, loading it into the lasercut software at hackspace, downloading it and – boom – “the polyline must be closed” for etching: no workie. No matter what I did it to the export in Inkscape or how I edited it, it just didn’t work.

The solution is simply to use a black and white png, with a non-transparent background. This loads directly into lasercut (which comes with JustAddSharks lasers) and…just…works.

As a bonus and for my own reference – I got good results with 300 speed / 30 power (below 30 didn’t seem to work) for etching (3mm acrylic).

 

Posted at 12:45

October 01

Ebiquity research group UMBC: paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part
of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

Early Detection of Cybersecurity Threats Using Collaborative Cognition

Sandeep Narayanan, Ashwinkumar Ganesan, Karuna Joshi, Tim Oates, Anupam Joshi and Tim Finin, Early detection of Cybersecurity Threats using Collaborative Cognition, 4th IEEE International Conference on Collaboration and Internet Computing, Philadelphia, October. 2018.

 

The early detection of cybersecurity events such as attacks is challenging given the constantly evolving threat landscape. Even with advanced monitoring, sophisticated attackers can spend more than 100 days in a system before being detected. This paper describes a novel, collaborative framework that assists a security analyst by exploiting the power of semantically rich knowledge representation and reasoning integrated with different machine learning techniques. Our Cognitive Cybersecurity System ingests information from various textual sources and stores them in a common knowledge graph using terms from an extended version of the Unified Cybersecurity Ontology. The system then reasons over the knowledge graph that combines a variety of collaborative agents representing host and network-based sensors to derive improved actionable intelligence for security administrators, decreasing their cognitive load and increasing their confidence in the result. We describe a proof of concept framework for our approach and demonstrate its capabilities by testing it against a custom-built ransomware similar to WannaCry.

Posted at 13:20

September 28

Gregory Williams: Property Path use in Wikidata Queries

I recently began taking a look at the Wikidata query logs that were published a couple of months ago and wanted to look into how some features of SPARQL were being used on Wikidata. The first thing I’ve looked at is the use of property paths: how often paths are used, what path operators are used, and with what frequency.

Using the “interval 3” logs (2017-08-07–2017-09-03 representing ~78M successful queries1), I found that ~25% of queries used property paths. The vast majority of these use just a single property path, but there are queries that use as many as 19 property paths:

Pct. Count Number of Paths
74.3048% 58161337 0 paths used in query
24.7023% 19335490 1 paths used in query
0.6729% 526673 2 paths used in query
0.2787% 218186 4 paths used in query
0.0255% 19965 3 paths used in query
0.0056% 4387 7 paths used in query
0.0037% 2865 8 paths used in query
0.0030% 2327 9 paths used in query
0.0011% 865 6 paths used in query
0.0008% 604 11 paths used in query
0.0006% 434 5 paths used in query
0.0005% 398 10 paths used in query
0.0002% 156 12 paths used in query
0.0001% 110 15 paths used in query
0.0001% 101 19 paths used in query
0.0001% 56 13 paths used in query
0.0000% 12 14 paths used in query

I normalized IRIs and variable names used in the paths so that I could look at just the path operators and the structure of the paths. The type of path operators used skews heavily towards * (ZeroOrMore) as well as sequence and inverse paths that can be rewritten as simple BGPs. Here are the structures representing at least 0.1% of the paths in the dataset:

Pct. Count Path Structure
49.3632% 10573772 ?s <iri1> * ?o .
39.8349% 8532772 ?s <iri1> / <iri2> ?o .
4.6857% 1003694 ?s <iri1> / ( <iri2> * ) ?o .
1.8983% 406616 ?s ( <iri1> + ) / ( <iri2> * ) ?o .
1.4626% 313290 ?s ( <iri1> * ) / <iri2> ?o .
1.1970% 256401 ?s ( ^ <iri1> ) / ( <iri2> * ) ?o .
0.7339% 157212 ?s <iri1> + ?o .
0.1919% 41110 ?s ( <iri1> / ( <iri2> * ) ) / ( ^ <iri3> ) ?o .
0.1658% 35525 ?s <iri1> / <iri2> / <iri3> ?o .
0.1496% 32035 ?s <iri1> / ( <iri1> * ) ?o .
0.1124% 11889 ?s ( <iri1> / <iri2> ) / ( <iri3> * ) ?o .

There are also some rare but interesting uses of property paths in these logs:

Pct. Count Path Structure
0.0499% 5274 ?s ( ( <iri1> / ( <iri2> * ) ) / ( <iri3> / ( <iri2> * ) ) ) / ( <iri4> / ( <iri2> * ) ) ?o .
0.0015% 157 ?s ( <iri1> / <iri2> / <iri3> / <iri4> / <iri5> / <iri6> / <iri7> / <iri8> / <iri9> ) * ?o .
0.0003% 28 ?s ( ( ( ( <iri1> / <iri2> / <iri3> ) ? ) / ( <iri4> ? ) ) / ( <iri5> * ) ) / ( <iri6> / ( <iri7> ? ) ) ?o .

Without further investigation it’s hard to say if these represent meaningful queries or are just someone playing with SPARQL and/or Wikidata, but I found them curious.

  1. These numbers don’t align exactly with the Wikidata query dumps as there were some that I couldn’t parse with my tools. ↩︎

Posted at 17:06

September 23

Ebiquity research group UMBC: talk: Design and Implementation of an Attribute Based Access Controller using OpenStack Services

Design and Implementation of an Attribute Based Access Controller using OpenStack Services

Sharad Dixit, Graduate Student, UMBC
10:30am Monday, 24 September 2018, ITE346

With the advent of cloud computing, industries began a paradigm shift from the traditional way of computing towards cloud computing as it fulfilled organizations present requirements such as on-demand resource allocation, lower capital expenditure, scalability and flexibility but with that it brought a variety of security and user data breach issues. To solve the issues of user data and security breach, organizations have started to implement hybrid cloud where underlying cloud infrastructure is set by the organization and is accessible from anywhere around the world because of the distinguishable security edges provided by it. However, most of the cloud platforms provide a Role Based Access Controller which does not adequate for complex organizational structures. A novel mechanism is proposed using OpenStack services and semantic web technologies to develop a module which evaluates user’s and project’s multi-varied attributes and run them against access policy rules defined by an organization before granting the access to the user. Henceforth, an organization can deploy our module to obtain a robust and trustworthy access control based on multiple attributes of a user and the project the user has requested in a hybrid cloud platform like OpenStack.

Posted at 19:44

Bob DuCharme: Panic over "superhuman" AI

Robot overlords not on the way.

Posted at 16:27

September 22

Libby Miller: Simulating crap networks on a Raspberry Pi

I’ve been having trouble with libbybot (my Raspberry Pi / lamp based presence robot) in some locations. I suspect this is because the Raspberry Pi 3’s inbuilt wifi antenna isn’t as strong as that in, say a laptop, so wifi problems that go unnoticed most of the time are much more obvious.

The symptoms are these:

  • Happily listening / watching remotely
  • Stream dies
  • I get a re-notification that libbybot is online, but can’t connect to it properly

My hypothesis is that the Raspberry Pi is briefly losing wifi connectivity, Chromium auto-reconnects, but the webRTC stream doesn’t re-initiate.

Anyway, the first step to mitigating the problem was to try and emulate it. There were a couple of ways I could have gone about this. One was use network shaping tools on my laptop to try and emulate the problems by messing with the receiving end. A more realistic way would be to shape the traffic on the Pi itself, as that’s where the problem is occurring.

Searching for network shaping tools – and specifically dropped packets and network latency, led me to the FreeBSD firewall, called dummynet and referenced by ipfw. However, this is tightly coupled to the kernel and doesn’t seem suitable for the Raspberry Pi.

On the laptop, there is a tool for network traffic shaping on Mac OS – it used to be ipfw, but since 10.10 (details) it’s been an app called network link conditioner, available as part of Mac OS X developer tools.

Before going through the xcode palaver for something that wasn’t really what I wanted, I had one last dig for an easier way, and indeed there is: wondershaper led me to using tc to limit the bandwidth which in turn led to iptables for dropped packets.

But. None of these led to the behaviour that I wanted, in fact libbybot (which uses RTCMulticonnection for webRTC) worked perfectly under most conditions I could simulate. The same when using tc with with Netem, which can emulate network-wide delays – all fine.

Finally I twigged that the problem was probably a several-second network outage, and for that you can use iptables again. In this case using it to stop the web page (which runs on port 8443) being accessed from the Pi. Using this I managed to emulate the symptoms I’d been seeing.

Here are a few of the commands I used, for future reference.

The final, useful command: emulate a dropped network on a specific port for 20 seconds using iptables output command:

#/bin/bash
echo "stopping external to 8443"
iptables -A OUTPUT -p tcp --dport 8443 -j DROP
sleep 20
echo "restarting external to 8443"
iptables -D OUTPUT -p tcp --dport 8443 -j DROP

Other things I tried: drop 30% of (input or output) packets randomly, using iptable’s statistics plugin

sudo iptables -A INPUT -m statistic --mode random --probability 0.30 -j DROP

sudo iptables -A OUTPUT -m statistic --mode random --probability 0.30 -j DROP

list current iptables rules

iptables -L

clear all (flush)

iptables -F

Delay all packets by 100ms using tc and netem

sudo tc qdisc add dev wlan0 root netem delay 100ms

change that to 2000ms

sudo tc qdisc change dev wlan0 root netem delay 2000ms 10ms 25%

All the tc rules go away when you reboot.

Context and links:

tc and netem: openWRT: Netem (Network emulator)

iptablesUsing iptables to simulate service interruptions by Matt Parsons, and The Beginner’s guide to iptables, the linux firewall

 

Posted at 12:40

September 17

Dublin Core Metadata Initiative: A Successful DCMI 2018 Conference

The DCMI Annual Conference was held last week, hosted by the Faculty of Engineering of the University of Porto, Portugal. The conference was co-located with TPDL which meant that while many people arrived as part of one community, all left with the experience and appreciation of two! The full conference proceedings are now available, with copies of presentation slides where appropriate. Some photographs of the conference can be found on Flickr, tagged with 'dcmi18'.

Posted at 00:00

September 15

Egon Willighagen: Wikidata Query Service recipe: qualifiers and the Greek alphabet

Just because I need to look this up each time myself, I wrote up this quick recipe for how to get information from statement qualifiers from Wikidata. Let's say, I want to list all Greek letters, with in one column the lower case and in the other the upper case letter. This is what our data looks like:


So, let start with a simple query that lists all letters in the Greek alphabet:

SELECT ?letter WHERE {
  ?letter wdt:P361 wd:Q8216 .
}

Of course, that only gives me the Wikidata entries, and not the Unicode characters we are after. So, let's add that Unicode character property:

SELECT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
}

Ah, that gets us somewhere:



But you see that the upper and lower case are still in separate rows, rather than columns. To fix that, we need access to those qualifiers. It's all in there in the Wikidata RDF, but the model is giving people a headache (so do many things, like math, but that does not mean we should stop doing it!). It all comes down to keeping notebooks, write down your tricks, etc. It's called the scientific method (there is more to that, than just keeping notebooks, tho).

Qualifiers
So, a lot of important information is put in qualifiers, and not just the statements. Let's first get all statements for a Greek letter. We would do that with:

?letter ?pprop ?statement .

One thing we want to know about the property we're looking at, is the entity linked to that. We do that by adding this bit:

?property wikibase:claim ?propp .

Of course, the property we are interested in is the Unicode character, so can put that directly in:

wd:P487 wikibase:claim ?propp .

Next, the qualifiers for the statement. We want them all:

?statement ?qualifier ?qualifierVal .
?qualifierProp wikibase:qualifier ?qualifier .

And because we do not want any qualifier but the applies to part, we can put that in too:

?statement ?qualifier ?qualifierVal .
wd:P518 wikibase:qualifier ?qualifier .

Furthermore, we are only interested in lower case and upper case, and we can put that in as well (for upper case):

?statement ?qualifier wd:Q98912 .
wd:P518 wikibase:qualifier ?qualifier .

So, if we want both upper and lower case, we now get this full query:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp .
  ?statement ?qualifier wd:Q8185162 .
  wd:P518 wikibase:qualifier ?qualifier .
}

We are not done yet, because you can see in the above example that we get the unicode character differently from the statement. This needs to be integrated, and we need the wikibase:statementProperty for that:

wd:P487 wikibase:statementProperty ?statementProp .
?statement ?statementProp ?unicode .

If we integrate that, we get this query, which is indeed getting complex:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier wd:Q8185162 ;
             ?statementProp ?unicode .  
  wd:P518 wikibase:qualifier ?qualifier .
}

But basically we have our template here, with three parameters:
  1. the property of the statement (here P487: Unicode character)
  2. the property of the qualifier (here P518: applies to part)
  3. the object value of the qualifier (here Q98912: upper case)
If we use the SPARQL VALUES approach, we get the following template. Notice that I renamed the variables of ?letter and ?unicode. But I left the wdt:P361 wd:Q8216 (='part of' 'Greek alphabet') in, so that this query does not time out:

SELECT DISTINCT ?entityOfInterest ?statementDataValue WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 . # 'part of' 'Greek alphabet'
  VALUES ?qualifierObject { wd:Q8185162 }
  VALUES ?qualifierProperty { wd:P518 }
  VALUES ?statementProperty { wd:P487 }

  # template
  ?entityOfInterest ?pprop ?statement .
  ?statementProperty wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier ?qualifierObject ;
             ?statementProp ?statementDataValue .  
  ?qualifierProperty wikibase:qualifier ?qualifier .
}

So, there is our recipe, for everyone to copy/paste.

Completing the Greek alphabet example
OK, now since I actually started with the upper and lower case Unicode character for Greek letters, let's finish that query too. Since we need both, we need to use the template twice:

SELECT DISTINCT ?entityOfInterest ?lowerCase ?upperCase WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
}

Still one issue left to fix. Some greek letters have more than one upper case Unicode character. We need to concatenate those. That requires a GROUP BY and the GROUP_CONCAT function, and get this query:

SELECT DISTINCT ?entityOfInterest
  (GROUP_CONCAT(DISTINCT ?lowerCase; separator=", ") AS ?lowerCases)
  (GROUP_CONCAT(DISTINCT ?upperCase; separator=", ") AS ?upperCases)
WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
} GROUP BY ?entityOfInterest

Now, since most of my blog posts are not just fun, but typically also have a use case, allow me to shed light on the context. Since you are still reading, your officially part of the secret society of brave followers of my blog. Tweet to my egonwillighagen account a message consisting of a series of letters followed by two numbers (no spaces) and another series of letters, where the two numbers indicate the number of letters at the start and the end, for example, abc32yz or adasgfshjdg111x, and I will you add you to my secret list of brave followers (and I will like the tweet; if you disguise the string to suggest it has some meaning, I will also retweet it). Only that string is allowed and don't tell anyone what it is about, or I will remove you from the list again :) Anyway, my ambition is to make a Wikidata-based BINAS replacement.

So, we only have a human readable name. The frequently used SERVICE wikibase:label does a pretty decent job and we end up with this table:


Posted at 09:12

September 13

AKSW Group - University of Leipzig: AskNow 0.1 Released

Dear all,

we are very happy to announce AskNow 0.1 – the initial release of Question Answering Components and Tools over RDF Knowledge Graphs.

Website: http://asknow.sda.tech/
Demo: http://asknowdemo.sda.tech
GitHub: https://github.com/AskNowQA

The following components with corresponding features are currently supported by AskNow:

  • AskNow UI 0.1: The UI interface works as a platform for users to pose their questions to the AskNow QA system. The UI displays the answers based on whether the answer is an entity or a list of entities, boolean or literal. For entities it shows the abstracts from DBpedia.
    Github: https://github.com/AskNowQA/AskNowUI

We want to thank everyone who helped to create this release, in particular the projects HOBBIT, SOLIDE, WDAqua, BigDataEurope.

View this announcement on Twitter: https://twitter.com/AskNowQA/status/1040205350853599233

Kind regards,
The AskNow Development Team
(http://asknow.sda.tech/people/)

Posted at 13:35

Dublin Core Metadata Initiative: Webinar: SKOS - Overview and Modeling of Controlled Vocabularies

This webinar is scheduled for Thursday, October 11, 2018, 14:00 UTC (convert this time to your local timezone here) and is free for DCMI members. SKOS (Simple Knowledge Organization Systems) is the recommendation of the W3C to represent and publish datasets for classifications, thesauri, subject headings, glossaries and other types of controlled vocabularies and knowledge organization systems in general. The first part of the webinar includes an overview of the technologies of the semantic web and shows in detail the different elements of the SKOS model.

Posted at 00:00

Dublin Core Metadata Initiative: Webinar: SKOS - Overview and Modeling of Controlled Vocabularies

This webinar is scheduled for Thursday, October 11, 2018, 14:00 UTC ([convert this time to your local timezone here]()) and is free for DCMI members. SKOS (Simple Knowledge Organization Systems) is the recommendation of the W3C to represent and publish datasets for classifications, thesauri, subject headings, glossaries and other types of controlled vocabularies and knowledge organization systems in general. The first part of the webinar includes an overview of the technologies of the semantic web and shows in detail the different elements of the SKOS model.

Posted at 00:00

September 11

W3C Blog Semantic Web News: JSON-LD Guiding Principles, First Public Working Draft

Coming to consensus is difficult in any working group, and doubly so when the working group spans a broad cross-section of the web community. Everyone brings their own unique set of experiences, skills and desires for the work, but at the end of the process, there can be only one specification.  In order to provide a framework in which to manage the expectations of both participants and other stakeholders, the JSON-LD WG started out by establishing a set of guiding principles.  These principles do not constrain decisions, but provide a set of core aims and established consensus to reference during difficult discussions.  The principles are lights to lead us back out of the darkness of never-ending debate towards a consistent and appropriately scoped set of specifications. A set of specifications that have just been published as First Public Working Drafts.

These principles start with the uncontroversial “Stay on target!”, meaning to stay focused on the overall mission of the group to ensure the ease of creation and consumption of linked data using the JSON format by the widest possible set of developers. We note that the target audience is software developers generally, not necessarily browser-based applications.

To keep the work grounded, requiring use cases with actual data, that have support from at least two organizations (W3C members or otherwise) was also decided as important principles to keep in mind. The use cases are intended to be supporting evidence for the practicality and likely adoption of a proposed feature, not a heavyweight requirements analysis process.

Adoption of specifications is always a concern, and to maximize the likelihood of uptake, we have adopted several principles around simplicity, usability and preferring phased or incremental solutions. To encourage experimentation and to try and reduce the chances of needing a breaking change in the future, we have adopted a principle of defining only what is conforming to the specification, and leaving all other functionality open. Extensions are welcome, they are just not official or interoperable.

Finally, and somewhat controversially, we adopted the principle that new features should be compatible with the RDF Data Model. While there are existing features that cannot be expressed in RDF that are possible in JSON-LD, we do not intend to increase this separation between the specifications and hope to close it as much as possible.

Using these guidelines, the working group has gotten off to a very productive start and came to very quick consensus around whether or not many features suggested to the JSON-LD Community Group were in scope for the work or not, including approving the much requested lists-of-lists functionality. This will allow JSON arrays to directly include JSON arrays as items in JSON-LD 1.1, enabling a complete semantic mapping for JSON structures such as GeoJSON, and full round-tripping through RDF. The publication of the FPWD documents is a testimony to the efforts of the Working Group, and especially those of Gregg Kellogg as editor.

Posted at 22:19

September 08

Egon Willighagen: Also new this week: "Google Dataset Search"

There was a lot of Open Science news this week. The announcement of the Google Dataset Search was one of them:


 Of course, I first tried searching for "RDF chemistry" which shows some of my data sets (and a lot more):


It picks up data from many sources, such as Figshare in this image. That means it also works (well, sort of, as Noel O'Boyle noticed) for supplementary information from the Journal of Cheminformatics.

It picks up metadata in several ways, among which schemas.org. So, next week we'll see if we can get eNanoMapper extended to spit compatible JSON-LD for its data sets, called "bundles".

Integrated with Google Scholar?
While the URL for the search engine does not suggest the service is more than a 20% project, we can hope it will stay around like Google Scholar has been. But I do hope they will further integrate it with Scholar. For example, in the above figure, it did pick up that I am the author of that data set (well, repurposed from an effort of Rich Apodaca), it did not figure out that I am also on Scholar.

So, these data sets do not show up in your Google Scholar profile yet, but they must. Time will tell where this data search engine is going. There are many interesting features, and given the amount of online attention, they won't stop development just yet, and I expect to discover more and better features in the next months. Give it a spin!

Posted at 09:13

August 27

Bob DuCharme: Pipelining SPARQL queries in memory with the rdflib Python library

Using retrieved data to make more queries.

Posted at 13:55

August 18

Egon Willighagen: Compound (class) identifiers in Wikidata

Bar chart showing the number of compounds
with a particular chemical identifier.
I think Wikidata is a groundbreaking project, which will have a major impact on science. One of the reasons is the open license (CCZero), the very basic approach (Wikibase), and the superb community around it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is easily done with Docker.

Wikidata has many sub projects, such as WikiCite, which captures the collective of primary literature. Another one is the WikiProject Chemistry. The two nicely match up, I think, making a public database linking chemicals to literature (tho, very much needs to be done here), see my recent ICCS 2018 poster (doi:10.6084/m9.figshare.6356027.v1, paper pending).

But Wikidata is also a great resource for identifier mappings between chemical databases, something we need for our metabolism pathway research. The mapping, as you may know, are used in the latter via BridgeDb and we have been using Wikidata as one of three sources for some time now (the others being HMDB and ChEBI). WikiProject Chemistry has a related ChemID effort, and while the wiki page does not show much recent activity, there is actually a lot of ongoing effort (see plot). And I've been adding my bits.

Limitations of the links
But not each identifier in Wikidata has the same meaning. While they are all classified as 'external-id', the actual link may have different meaning. This, of course, is the essence of scientific lenses, see this post and the papers cited therein. One reason here is the difference in what entries in the various databases mean.

Wikidata has an extensive model, defined by the aforementioned WikiProject Chemistry. For example, it has different concepts for chemical compounds (in fact, the hierarchy is pretty rich) and compound classes. And these are differently modeled. Furthermore, it has a model that formalizes that things with a different InChI are different, but even allows things with the same InChI to be different, if need arises. It tries to accurately and precisely capture the certainty and uncertainty of the chemistry. As such, it is a powerful system to handle identifier mappings, because databases are not clear, and chemistry and biological in data is even less: we measure experimentally a characterization of chemicals, but what we put in databases and give names, are specific models (often chemical graphs).

That model differs from what other (chemical) databases use, or seem to use, because not always do databases indicate what they actually have in a record. But I think this is a fair guess.

ChEBI
ChEBI (and the matching ChEBI ID) has entries for chemical classes (e.g. fatty acid) and specific compounds (e.g. acetate).

PubChem, ChemSpider, UniChem
These three resources use the InChI as central asset. While they do not really have the concept of compound classes so much (though increasingly they have classifications), they do have entries where stereochemistry is undefined or unknown. Each one has their own way to link to other databases themselves, which normally includes tons of structure normalization (see e.g. doi:10.1186/s13321-018-0293-8 and doi:10.1186/s13321-015-0072-8)

HMDB
HMDB (and the matching P2057) has a biological perspective; the entries reflect the biology of a chemical. Therefore, for most compounds, they focus on the neutral forms of compounds. This makes linking to/from other databases where the compound is not neutral chemically less precise.

CAS registry numbers
CAS (and the matching P231) is pretty unique itself, and has identifiers for substances (see Q79529), much more than chemical compounds, and comes with a own set of unique features. For example, solutions of some compound, by design, have the same identifier. Previously, formaldehyde and formalin had different Wikipedia/Wikidata pages, both with the same CAS registry number.

Limitations of the links #2
Now, returning to our starting point: limitations in linking databases. If we want FAIR mappings, we need to be as precise as possible. Of course, that may mean we need more steps, but we can always simplify at will, but we never can have a computer make the links more complex (well, not without making assumptions, etc).

And that is why Wikidata is so suitable to link all these chemical databases: it can distinguish differences when needed, and make that explicit. It make mappings between the databases more FAIR.


Posted at 12:46

August 09

Egon Willighagen:

Posted at 11:59

Copyright of the postings is owned by the original blog authors. Contact us.