Planet RDF

It's triples all the way down

March 22

AKSW Group - University of Leipzig: LDK conference @ University of Leipzig

With the advent of digital technologies, an ever-increasing amount of language data is now available across various application areas and industry sectors, thus making language data more and more valuable. In that context, we are happy to invite you to join the 2nd Language, Data and Knowledge (LDK) conference which will be held in Leipzig from May 20th till 22nd, 2019.

This new biennial conference series aims at bringing together researchers from across disciplines concerned with language data in data science and knowledge-based applications.

In that context, the acquisition, provenance, representation, maintenance, usability, quality as well as legal, organizational and infrastructure aspects of language data are in the centre of research revolving around language data and thus constitute the focus of the conference.

To register and be part of the LDK conference and its associated events, please go to http://2019.ldk-conf.org/registration/.

Keynote Speakers

  • Keynote #1: Christian Bizer, Mannheim University
  • Keynote #2: Christiane Fellbaum, Princeton University
  • Keynote #3: Eduard Werner, Leipzig University

Associated Events

The following events are co-located with LDK 2019:

Workshops on the 20th May 2019

DBpedia Community Meeting on the 23rd May 2019

Looking forward to meeting you at the conference!

Posted at 08:21

March 16

Leigh Dodds:

Posted at 14:48

February 24

Bob DuCharme: curling SPARQL

A quick reference.

Posted at 15:45

February 22

AKSW Group - University of Leipzig: 13th DBpedia community meeting in Leipzig

We are happy to invite you to join the 13th edition of the DBpedia Community Meeting, which will be held in Leipzig. Following the LDK conference, May 20-22, the DBpedia Community will get together on May 23rd, 2019 at Mediencampus Villa Ida. Once again the meeting will be accompanied by a varied program of exciting lectures and showcases.

Highlights/ Sessions

  • Keynote #1: Making Linked Data Fun with DBpedia by Peter Haase, metaphacts
  • Keynote #2: From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph by Heiko Paulheim, Universität Mannheim
  • NLP and DBpedia Session
  • DBpedia Association Hour
  • DBpedia Showcase Session

Call for Contribution

What cool things do you do with DBpedia? Present your tools and datasets at the DBpedia Community Meeting! Please submit your presentations, posters, demos or other forms of contributions through our web form.

Tickets

Attending the DBpedia Community meeting costs 40 €. You need to buy a ticket via eshop.sachsen.de. DBpedia members get free admission. Please contact your nearest DBpedia chapter for a promotion code, or please contact the DBpedia Association.

If you would like to attend the LDK conference, please register here.

We are looking forward to meeting you in Leipzig!

Posted at 11:22

February 15

Dublin Core Metadata Initiative: DCMI 2019: Call for Participation

The Dublin Core Metadata Initiative (DCMI) annual international conference in 2019 will be hosted by the National Library of Korea, in Seoul, South Korea, 23rd - 26th September, 2019. The Organising Committee has just published a call for participation. Please consider submitting a proposal!

Posted at 00:00

February 09

Egon Willighagen: Comparing Research Journals Quality #1: FAIRness of journal articles

What a traditional research article
looks like. Nice layout, hard to
reuse the knowledge from.
Image: CC BY-SA 4.0.
After Plan S was proposed, there finally was a community-wide discussion on the future of publishing. Not everyone is clearly speaking out if they want open access or not, but there's a start for more. Plan S aims to reform the current model. (Interestingly, the argument that not a lot of journals are currently "compliant" is sort of the point of the Plan.) One thing it does not want to reform, is the quality of the good journals (at least, I have not seen that as one of the principles). There are many aspects to the quality of a research journal. There are also many things that disguise themselves as aspects of quality but are not. This series discusses quality of a journal. We skip the trivial ones, like peer review, for now, because I honestly do not believe that the cOAlition S funders want worse peer review.

We start with FAIRness (doi:10.1038/sdata.2016.18). This falls, if you like, under the category of added value. FAIRness does not change the validness of the conclusions of an article, it just improves the rigor of the knowledge dissemination. To me, a quality journal is one that takes knowledge dissemination seriously. All journals have a heritage of being printed on paper, and most journals have been very slows in adopting innovative approaches. So, let's put down some requirements of the journal of 2020.

First the about the article itself:

About findable

  • uses identifiers (DOI) at least at article level, but possibly also for figures and supplementary information
  • provides data of an article (including citations)
  • data is actively distributed (PubMed, Scopus, OpenCitations, etc)
  • maximizes findability by supporting probably more than one open standard
About accessible
  • data can be accessed using open standards (HTTP, etc)
  • data is archived (possibly replicated by others, like libraries)
About interoperable
  • data is using open standards (RDF, XML, etc)
  • data uses open ontologies (many open standards exist, see this preprint)
  • uses linked data approaches (e.g. for citations)
About reusable
  • data is as complete as possible
  • data is available under an Open Science compliant license
  • data is uses modern and used community standards
Pretty straightforward. For author, title, journal, name, year, etc, most journals apply this. Of course, bigger publishers that invested in these aspects many moons ago can be compliant much easier, because they already were.

Second, what about the content of the article? There we start seeing huge differences.

About findable
  • important concepts in the article are easily identified (e.g. with markup)
  • important concepts use (compact) identifiers
Here, the important concepts are entities like cities, genes, metabolites, species, etc, etc. But also reference data sets, software, cited articles, etc. Some journals only use keywords, some journals have policies about use of identifiers for genes and proteins. Using identifiers for data and software is rare, sadly.

About accessible
  • articles can be retrieved by concept identifiers (via open, free standards)
  • article-concept identifier links are archived
  • table and figure data is annotated with concept identifiers
  • table and figure data can be accessed in an automated way
Here we see a clear problem. Publishers have been actively fighting this for years, even to today. Text miners and projects like Europe PMC are stepping in, but severely hampered by copyright law and publishers not wishing to make exception.

About interoperable
  • concept are describes common standards (many available)
  • table and figure data is available as something like CSV, RDF
Currently, the only serious standard used by the majority of (STM?) journals are MeSH terms for keywords and perhaps CrossRef XML for citations. Table and figures are more than just a graphical representations. Some journals are experimenting with this.

About reusable
  • the content of the article has a clear licence, Open Science compliant
  • the content is available with relevant standards of now
This is hard. These community standards are a moving target. For example, how we name concepts changes over time. But also identifiers themselves change over time. But a journal can be specific and accurate, which ensures that even 50 years from now, the context of the content can be determined. Of course, with proper Open Science approaches, translation to then modern community standards is simplified.

There are tons of references I can give here. If you really like these ideas, I recommend:
  1. continue reading my blog with many, many pointers
  2. read (and maybe sign) our Open Science Feedback to the Guidance on the Implementation of Plan S (doi:10.5281/zenodo.2560200), where many of these ideas are part of


Posted at 16:11

February 05

Egon Willighagen: Plan S: Less publications, but more quality, more reusable? Yes, please.

If you look at opinions published in scholarly journals (RSS feed, if you like to keep up), then Plan S is all 'bout the money (as Meja already tried to warn us):


No one wants puppies to die. Similarly, no one wants journals to die. But maybe we should. Well, the journals, not the puppies. I don't know, but it does make sense to me (at this very moment):

The past few decades has seen a significant growth of journals. And before hybrid journals were introduced, publishers tended to start new journals, rather than make journals Open Access. At the same time, the number of articles too has gone up significantly. In fact, the flood of literature is drowning researchers and this problem has been discussed for years. But if we have too much literature, should we not aim for less literature? And do it better instead?

Over the past 13 years I have blogged on many occasions about how we can make journals more reusable. And many open scientist can quote you Linus: "given enough eyeballs, all bugs are shallow". In fact, just worded differently, any researcher will tell you exactly the same, which is why we do peer review.
But the problem here is the first two words: given enough.

What if we just started publishing half of what we do now? If we have an APC-business model, we have immediately halved(!) the publishing cost. We also save ourselves from a lot of peer-review work, reading of marginal articles.

And what if we just the time we freed up for actually making knowledge dissemination better? Make journals articles actually machine readable, put some RDF in them? What if we could reuse supplementary information. What if we could ask our smartphone to compare the claims of one article with that of another, just like we compare two smartphones. Oh, they have more data, but theirs has a smaller error margin. Oh, they tried it at that temperature, which seems to work better than in that other paper.

I have blogged about this topic for more than a decade now. I don't want to wait another 15 years for journal publications to evolve. I want some serious activity. I want Open Science in our Open Access.

This is one of my personal motives to our Open Science Feedback to cOAlition S, and I am happy that 40 people joined in the past 36 hours, from 12 countries. Please have a read, and please share it with others. Let your social network know why the current publishing system needs serious improvement and that Open Science has had the answer for years now.

Help our push and show your support to cOAlition S to trigger exactly this push for better scholarly publishing: https://docs.google.com/document/d/14GycQnHwjIQBQrtt6pyN-ZnRlX1n8chAtV72f0dLauU/edit?usp=sharing

Posted at 07:01

January 30

Tetherless World Constellation group RPI: AGU Fall Meeting 2018

The American Geophysical Union (AGU) Fall Meeting 2018 was the first time I attended a conference that was of such magnitude in all aspects – attendees, arrangements, content and information. It is an overwhelming experience for a first timer but totally worth it. The amount of knowledge and information that one can learn at this event is the biggest takeaway; depends on each person’s abilities but trying to get the most out of it is what one will always aim for.

There were 5 to 6 types of events that were held throughout the day for all 5 days. The ones that stood out for me were the poster sessions, e-lightning talks, oral sessions and the centennial plenary sessions.
The poster sessions helped to see at a glance the research that is going on in the various fields all over the world. No matter how much I tried, I found it hard to cover all the sections that piqued my interest in the poster hall. The e-lightning talks were a good way to strike up a conversation on the topic of the talks and get a discussion going among all the attendees. Being a group discussion structure I felt that there was more interaction as compared to the other venues. The oral sessions were a great place to get to know how people are exploring their areas of interests and the various methods and approaches that they are using for the same. However, I felt that it is hard for the presenter to cover everything that is important and relevant in the given time span. The time constraints are there for a very valid reason but that might lead to someone losing out on leads if the audience doesn’t fully get the concept. Not all presenters were up to the mark. I could feel a stark difference between the TWC presenters (who knew how to get all the right points across) and the rest of the presenters. The centennial plenary sessions were a special this year as AGU is celebrating the centennial year. These sessions highlighted the best of research practices, innovations, achievements and studies. The time slots for this session were very small but the work spoke for itself.

The Exhibit Hall had all the companies and organisations that are in the field or related to it. Google, NASA and AGU had sessions, talks and events being conducted here as well. While Google and NASA were focussing on showcasing the ‘Geo-‘ aspect of their work. AGU was focussing on the data aspect too which was refreshing. They had sessions going on about data from the domain scientists’ point of view. This comes across as fundamental or elementary knowledge to us at TWC but the way they are trying to enable domain scientists to be able to communicate better with data scientists is commendable.  AGU is also working on an initiative called “Make data ‘FAIR’ (Findable Accessible Interoperable Reusable) again’ which is once again trying to spread awareness amongst the domain scientists. The exhibit hall is also a nice place to interact with industry, universities and organisations who have research programs for the doctorate students and postdocs.

In retrospect, I think planning REALLY ahead of time is a good idea so that you know what to ditch and what not to miss. A list of ‘must attend’ could have helped with the decision making process. A group discussion at one of our meetings where everyone shares what they find important, before AGU, could be a good idea. Being just an audience is great and one gets to learn a lot, but contributing to this event would be even better. This event was amazing and has given me a good idea as to how to be prepared the next time I am attending it.

 

Posted at 14:49

January 29

Tetherless World Constellation group RPI: AGU Conference: Know Before You Go

If this is your first American Geophysical Union (AGU) conference, be ready! Below are a few pointers for future first-timers.

The conference I attended was hosted in Washington, D.C. at the Walter E. Washington Convention Center during the week of December 10th, 2018. It brought together over 25,000 people. Until this conference, I had not experienced the pleasure and the power of so many like-minds in one space. The experience, while exhausting, was exhilarating!

One of the top universal concerns at the AGU Conference is scheduling. You should know that I was not naïve to the opportunities and scheduling difficulties prior to 2018, my first year of attendance. I had spent the last several months organizing an application development team that successfully created a faceted browsing app with calendaring for this particular conference using live data. Believe me when I say, “Schedule before you go”. Engage domain scientists and past participants about sessions, presentations, events, and posters that are a must-see. There is so much to learn at the conference. Do not miss the important stuff. The possibilities are endless, and you will need the expertise of those prior attendees. Plan breaks for yourself. Use those breaks to wander the poster hall, exhibit hall, or the vendor displays.

Key Elements in Scheduling Your Week

  • Do not front load your week. You need time to explore.
    • Be prepared to alter your existing schedule, as a result.
  • Plan on being exhausted.
  • Eat to fuel your body and your mind.
    • Relax, but not too much.
  • Plan on networking. To do that, you need to be sharp!
    • The opportunities to network will exceed your wildest expectations.
  • Take business cards – your own, and from people you meet.

Finally, take some time to see the city that holds the conference. There are many experiences to be had that will add to your education.

The Sessions

So. Many. Sessions!

There are e-lightning talks. There are oral sessions.  There are poster sessions. There are town hall sessions. There are scientific workshops. There are tutorial talks. There are keynotes. Wow!

The e-lightning talks are exciting. There are lots of opportunity to interact in this presentation mode. The e-lightning talks are held in the Poster Hall. A small section provides chairs for about 15 – 20 attendees, with plenty of standing room only space. This informal session leads to great discussion amongst attendees. Be sure to put one of these in your schedule!

Oral sessions are what you would expect; people working in the topic, sitting in chairs at the front of the room, each giving a brief talk, then, time permitting, a Q&A session at the end. Remember these panels are filled with knowledge. For the oral sessions that you schedule to attend, read the papers prior to attending. More importantly, have some questions prepared.

//Steps onto soapbox//

  1. If you are female, know the facts! (Nature International Journal of Science, 2018)
  2. Females are less likely to ask a question if a male asked a prior question.
  3. Get up there!
  4. Grab the mic!
  5. Ask the question anyway.
  6. Do NOT wait to speak with the presenters until afterwards. They are feeling just as overwhelmed as you are by all of the opportunities available to them at this conference.
  7. Please read the referenced article in bullet #1. The link is provided at the end of this post.

//Steps down from soapbox//

The poster sessions are a great way to unwind by getting in some walking. There are e-posters which are presented on screens provided by AGU or the venue. There are the usual posters as well. The highlights of attending a poster session, besides the opportunity to stretch your legs, include the opportunity to practice meeting new people, asking in-depth questions on topics of interest, talking to people doing the research, and checking out the data being used for the research. You will want to have a notepad with you for the poster sessions. Don’t just take notes; take business cards! Remember, what makes poster sessions special is that they are an example of the latest research that has not, yet, become a published paper. The person doing the research is quite likely the presenter of the poster.

All those special sessions – the town halls, the scientific workshops, the tutorial talks, and keynotes – these are the ones that you ask prior attendees, past participants, and experts on which ones are the must-see. Get them in your schedule. Pay attention. Take notes. Read the papers behind the sessions; if not the papers, the abstracts as a minimum. Have your questions ready before you go!

Timing

This is really important. Do NOT arrive without your time at this conference well planned. To do that you are going to need to spend several weeks preparing; reading papers, studying schedules, writing questions, and more. In order to have a really successful, time-well-spent type of experience, you are going to need to begin preparing for this immense conference by November 1st.

Oh, how I wish I had listened to all the people that told me this!

Put an hour per day in your calendar, from November 1st until AGU Conference Week, to study and prepare for this conference. I promise you will not regret the time you spent preparing.

The biggest thing to remember and the one thing that all attendees must do is:

Have a great time!

 

 

Works Cited

Nature International Journal of Science. (2018, October 17). Why fewer women than men ask questions at conferences. Retrieved from Nature International Journal of Science Career Brief: https://www.nature.com/articles/d41586-018-07049-x

 

Posted at 18:47

January 26

Leigh Dodds: Impressions from pidapalooza 19

This week I was at the third pidapalooza conference in Dublin. It’s a conference that is dedicated open identifiers: how to create them, steward them, drive adoption and promote their benefits.

Anyone who has spent any time reading this blog or following me on twitter will know that this is a topic close to my heart. Open identifiers are infrastructure.

I’ve separately written up the talk I gave on documenting identifiers to help drive adoption and spur the creation of additional services. I had lots of great nerdy discussions around URIs, identifier schemes, compact URIs, standards development and open data. But I wanted to briefly capture and share a few general impressions.

Firstly, while the conference topic is very much my thing, and the attendees were very much my people (including a number of ex-colleagues and collaborators), I was approaching the event from a very different perspective to the majority of other attendees.

Pidapalooza as a conference has been created by organisations from the scholarly publishing, research and archiving communities. Identifiers are a key part of how the integrity of the scholarly record is maintained over the long term. They’re essential to support archiving and access to a variety of research outputs, with data being a key growth area. Open access and open data were very much in evidence.

But I think I was one of only a few (perhaps the only?) attendee from what I’ll call the “broader” open data community. That wasn’t a complete surprise but I think the conference as a whole could benefit from a wider audience and set of participants.

If you’re working in and around open data, I’d encourage you to go to pidapalooza, submit some talk ideas and consider sponsoring. I think that would be beneficial for several reasons.

Firstly, in the pidapalooza community, the idea of data infrastructure is just a given. It was refreshing to be around a group of people that past the idea of thinking of data as infrastructure and were instead focusing on how to build, govern and drive adoption of that infrastructure. There’s a lot of lessons there that are more generally applicable.

For example I went to a fascinating talk about how EIDR, an identifier for movie and television assets, had helped to drive digital transformation in that sector. Persistent identifiers are critical to digital supply chains (Netflix, streaming services, etc). There are lessons here for other sectors around benefits of wider sharing of data.

I also attended a great talk by the Australian Research Data Commons that reviewed the ways in which they were engaging with their community to drive adoption and best practices for their data infrastructure. They have a programme of policy change, awareness raising, skills development, community building and culture change which could easily be replicated in other areas. It paralleled some of the activities that the Open Data Institute has carried out around its sector programmes like OpenActive.

The need for transparent governance and long-term sustainability were also frequent topics. As was the recognition that data infrastructure takes time to build. Technology is easy, its growing a community and building consensus around an approach that takes time.

(btw, I’d love to spend some time capturing some of the lessons learned by the research and publishing community, perhaps as a new entry to the series of data infrastructure papers that the ODI has previously written. If you’d like to collaborate with or sponsor the ODI to explore that, then drop me a line?)

Secondly, while the pidapalooza community seem to have generally accepted (with a few exceptions) the importance of web identifiers and open licensing of reference data. But that practice is still not widely adopted in other domains. Few of the identifiers I encounter in open government data, for example, are well documented, openly licensed or are supported by a range of APIs and services.

Finally, much of the focus of pidapalooza was on identifying research outputs and related objects: papers, conferences, organisations, datasets, researchers, etc. I didn’t see many discussions around the potential benefits and consequences of use of identifiers in research datasets. Again, this focus follows from the community around the conference.

But as the research, data science and machine-learning communities begin exploring new approaches to increase access to data, it will be increasingly important to explore the use of standard identifiers in that context. Identifiers have a clear role in helping to integrate data from different sources, but there are wider risks around data privacy and ethical considerations around identification of individuals, for example, that will need to happen.

I think we should be building a wider community of practice around use of identifiers in different contexts, and I think pidapalooza could become a great venue to do that.

Posted at 12:33

Leigh Dodds: Talk: Documenting Identifiers for Humans and Machines

This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.

At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.

Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.

A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.

We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.

Registers are reference datasets that provide both unique and/or persistent identifiers for things, and data about those things. The datasets of metadata that describe ORCIDs and DOIs are registers. As are lists of doctors, countries and locations where you can get our car taxed. We’ve explored different models for stewarding registers and ways to build trust
around how they are created and maintained.

In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.

The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.

There’s also been great work and progress around giving identifiers affordance. Making them actionable.

Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.

Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.

But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.

Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users  and communities are likely to encounter these identifiers.

I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.

As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.

Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.

To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.

Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.

Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.

So what documentation should we publish around identifier schemes?

I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.

I think these are:

  • Descriptive information that provides the background to a scheme, such as what it’s for, when it was created, examples of it being used, etc
  • Governance information that describes how the scheme is managed, who operates it and how access is managed
  • Technical notes that describe the syntax and validation rules for the scheme
  • Operational information that helps developers understand how many identifiers there are, when and how new identifiers are assigned
  • Service pointers that signpost to resolvers and other APIs and services that help people use or adopt the identifiers

I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.

Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:

  • how do we make this metadata discoverable, e.g. from datasets and individual identifiers?
  • are there different use cases that might encourage us to separate out some of this information into separate formats and/or types of documentation?
  • what services might we build off the metadata?
  • …etc

I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.

Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.

Thanks for listening.

Posted at 11:41

Leigh Dodds: Talk: Tabular data on the web

This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.

At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.

An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them

We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.

Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.

The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.

Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.

New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.

So how do we integrate data with the web? To ensure that statistics are well described and easy to find?

We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.

What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.

And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:

The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.

One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.

Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.

(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)

This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.

There also aren’t a great many tools that support CSV on the web currently.

It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?

But adoption could be driven from multiple angles. For example:

  • open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
  • open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
  • standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats

Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.

Thanks for listening.

Posted at 10:55

January 25

Dublin Core Metadata Initiative: W3C Data Exchange Working Group - Request for Feedback

The W3C Data Exchange Working Group has issued first drafts of two potential standards relating to profiles. The first is a small ontology for profile resources. This ontology is based on the case where an application profile is made up of one or more documents or resources, such as both human-readable instructions and a SHACL validation document. The ontology links those into a single graph called a "profile" and states the role of each resource.

Posted at 00:00

Dublin Core Metadata Initiative: W3C Data Exchange Working Group - Request for Feedback

The W3C Data Exchange Working Group has issued first drafts of two potential standards relating to profiles. The first is a small ontology for profile resources. This ontology is based on the case where an application profile is made up of one or more documents or resources, such as both human-readable instructions and a SHACL validation document. The ontology links those into a single graph called a "profile" and states the role of each resource.

Posted at 00:00

January 22

Tetherless World Constellation group RPI: TWC at AGU FM 2018

In 2018, AGU celebrated its centennial year. TWC had a good showing at this AGU, with 8 members attending and presenting on a number of projects.

We arrived at DC on Saturday night, to attend the DCO Virtual Reality workshop organized by Louis Kellogg and the DCO Engagement Team, where research from greater DCO community came together to present, discuss and understand how the use of VR can facilitate and improve both research and teaching. Oliver Kreylos and Louis Kellogg spent various session presenting the results of DCO VR project, which involved recreating some of the visualizations used commonly at TWC, i.e the mineral networks. For a preview of using the VR environment, check out these three tweets. Visualizing mineral networks in a VR environment has yielded some promising results, we observed interesting patterns in the networks which need to be explored and validated in the near future.

With a successful pre-AGU workshop behind us, we geared up for the main event. First thing Monday morning, was the “Predictive Analytics” poster session, which Shaunna Morrison, Fang Huang, and Marshall Ma helped me convene. The session, while low on abstracts submitted, was full of very interesting applications of analytics methods in various earth and space science domains.

Fang Huang also co-convened a VGP session on Tuesday, titled “Data Science and Geochemistry“. It was a very popular session, with 38 abstracts. Very encouraging to see divisions other than ESSI have Data Science sessions. This session also highlighted the work of many of TWC’s collaborators from the DTDI project. Kathy Fontaine convened a e-lightning session on Data policy. This new format was very successfully in drawing a large crowd to the event and enabled a great discussion on the topic. The day ended with Fang’s talk, presenting our findings about the network analysis of samples from the cerro negro volcano.

Over the next 2 days, many of TWC’s collaborators presented, but no one from TWC presented until Friday. Friday though was the busiest day for all of us from TWC. Starting with Peter Fox’s talk in the morning, Mark Parsons, Ahmed Eleish, Kathy Fontaine and Brenda Thomson all presented their work during the day. Oh yeah…and I presented too! My poster on the creation of the “Global Earth Mineral Inventory” got good feedback. Last, but definitely not the least, Peter represented the ESSI division during the AGU centennial plenary, where he talked about the future of Big Data and Artificial Intelligence in the Earth Sciences. The video of the entire plenary can be found here.

Overall, AGU18 was great, other than the talk mentioned above, multiple productive meetings and potential collaboration emerged from meeting various scientists and talking to them about their work. It was an incredible learning experience for me and the other students (for whom this was the first AGU).

As for other posters and talks I found interesting. I tweeted a lot about them during AGU. Fortunately, I did make a list of some interesting posters.

Posted at 17:41

January 20

Bob DuCharme: Querying machine learning distributional semantics with SPARQL

Bringing together my two favorite kinds of semantics.

Posted at 14:57

January 11

Libby Miller: Balena’s wifi-connect – easy wifi for Raspberry Pis

When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.

Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:

Provision an SD card with Stretch e.g. using Etcher or manually

Enable ssh e.g. by

touch /Volumes/boot/ssh

Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:

sudo raspi-config

then Localisation Options -> Set wifi country.

Install wifi-connect

bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)

Add their bash script

curl https://raw.githubusercontent.com/balena-io/wifi-connect/master/scripts/start.sh > start-wifi-connect.sh

Add a systemd script to start it on boot.

sudo nano/lib/systemd/system/wifi-connect-start.service

-> contents:

[Unit]
Description=Balena wifi connect service
After=NetworkManager.service

[Service]
Type=simple
ExecStart=/home/pi/start-wifi-connect.sh
Restart=on-failure
StandardOutput=syslog
SyslogIdentifier=wifi-connect
Type=idle
User=root

[Install]
WantedBy=multi-user.target

Enable the systemd service

sudo systemctl enable wifi-connect-start.service

Reboot the pi

sudo reboot

and a wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi

ping raspberrypi.local

.

Posted at 17:21

Libby Miller: Balena’s wifi-connect – easy wifi for Raspberry Pis

When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.

Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:

Provision an SD card with Stretch e.g. using Etcher or manually

Enable ssh e.g. by

touch /Volumes/boot/ssh

Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:

sudo raspi-config

then Localisation Options -> Set wifi country.

Install wifi-connect

bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)

Add a slightly-edited version of their bash script

curl https://gist.githubusercontent.com/libbymiller/e8fe6821e122e0a0ac921c8e557320a9/raw/46138fb4d28b494728e66515e46bd7d736b19132/start.sh > /home/pi/start-wifi-connect.sh

Add a systemd script to start it on boot.

sudo nano /lib/systemd/system/wifi-connect-start.service

-> contents:

[Unit]
Description=Balena wifi connect service
After=NetworkManager.service

[Service]
Type=simple
ExecStart=/home/pi/start-wifi-connect.sh
Restart=on-failure
StandardOutput=syslog
SyslogIdentifier=wifi-connect
Type=idle
User=root

[Install]
WantedBy=multi-user.target

Enable the systemd service

sudo systemctl enable wifi-connect-start.service

Reboot the pi

sudo reboot

A wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi:

ping raspberrypi.local

(You might need to disconnect your ethernet from the Pi before connecting to the Wifi Connect network if you were sharing network that way).

Posted at 17:21

December 27

Egon Willighagen: Creating nanopublications with Groovy

Compound found in Taphrorychus bicolor
(doi:10.1002/JLAC.199619961005).
Published in Liebigs Annalen, see
this post about the history of that journal.
Yesterday I struggled some with creating nanopublications with Groovy. My first attempt was an utter failure, but then I discovered Thomas Kuhn's NanopubCreator and it was downhill from there.

There are two good things about this. First, I now have a code base that I can easily repurpose to make trusty nanopublications (doi:10.1007/978-3-319-07443-6_63) about anything structured as a table (so can you).

Second, I now about almost 1200 CCZero nanopublications that tell you in which species a certain metabolite has been found. Sourced from Wikidata, using their SPARQL end point. This collection is a bit boring that this moment, and most of them are human metabolites, where the source is either Recon 2.2 or WikiPathways. But I expect (hope) to see more DOIs to show up. Think We challenge you to reuse Additional Files.

Finally, you are probably interested in learning what one of the created nanopublications looks like, to I put a Gist online:


Posted at 06:59

December 23

Bob DuCharme: Playing with wdtaxonomy

Those queries from my last blog entry? Never mind!

Posted at 14:51

December 17

Egon Willighagen: From the "Annalen der Pharmacie" to the "European Journal of Organic Chemistry"

2D structure of caffeine, also
known as theine.
One of my hobbies is the history of chemistry. It has a practical use to my current research, as a lot of knowledge about human metabolites is actually quite ancient. One thing I have trouble understanding that in a time where Facebook knows you better than your spouse, we have trouble finding relevant literature without expensive, expert databases, not generally available.

Hell, even the article that established that some metabolite is actually a human metabolite is not found within reasonable time (less than a minute).

This is one of the reasons I started working on Scholia, and the chemistry corner of it specifically. See this ICCS conference poster. The poster outlines some of the reasons why I like it, but one is this link between chemical structures and literature, here for caffeine:


You can see the problem with our chemical knowledge here (in Wikidata): before 1950 it's pretty blank. Hence my question on Twitter what journal to look at. A few suggestions came back, and I decided to focus on the journal that is now called the European Journal of Organic Chemistry but that started in 1832 as the Annalen der Pharmacie. I remember the EurJOC being launched by the KNCV and many other European chemistry societies.

BTW, note here that all these chemistry societies decided it was better to team up with a commercial publisher than to continue publishing it themselves. #Plan_S

Anyway, the full history is not complete, but the route from Annalen to EurJOC now is (each journal name has a different color):


That took me an hour or two, because CrossRef has for all articles the EurJOC journal name. Technically perhaps correct, but metadata-wise the above is much better. Thanks to whomever actually created Wikidata items for each journal and linking them by follows and followed by.

In doing so, you quickly run into many more metadata issues. The best one I found was a paper by Crasts and Friedel, known for the Friedel-Crafts reaction :) Other gems are researcher names like Erlenmeyer-Heidelberg and Demselben and Von Demselben.

Back to caffeine, an active chemical in coffee, a chemical many of us must have in the morning, is actually the same as theine. Tea drinkers also get their dose of caffeine. We all know that. What I did not know, but discovered while doing this work, is that already established that :caffeine owl:sameAs :theine (doi:10.1002/jlac.18380250106). Cool!

Posted at 08:15

December 13

AKSW Group - University of Leipzig: SANSA 0.5 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.5 – the fifth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • A data lake concept for querying heterogeneous data sources has been integrated into SANSA
  • New clustering algorithms have been added and the interface for clustering has been unified
  • Ontop RDB2RDF engine support has been added
  • RDF data quality assessment methods have been substantially improved
  • Dataset statistics calculation has been substantially improved
  • Improved unit test coverage

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects HOBBIT, Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin and Simple-ML.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

 

Posted at 08:25

December 01

Libby Miller: Cat detector with Tensorflow on a Raspberry Pi 3B+

Like this
Download Raspian Stretch with Desktop    

Burn a card with Etcher.

(Assuming a Mac) Enable ssh

touch /Volumes/boot/ssh

Put a wifi password in

nano /Volumes/boot/wpa_supplicant.conf
country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="foo"
  psk="bar"
}

Connect the Pi camera, attach a dial to GPIO pin 12 and ground, boot up the Pi, ssh in, then

sudo apt-get update
sudo apt-get upgrade
sudo raspi-config # and enable camera; reboot

install tensorflow

sudo apt install python3-dev python3-pip
sudo apt install libatlas-base-dev
pip3 install --user --upgrade tensorflow

Test it

python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

get imagenet

git clone https://github.com/tensorflow/models.git
cd ~/models/tutorials/image/imagenet
python3 classify_image.py

install openCV

pip3 install opencv-python
sudo apt-get install libjasper-dev
sudo apt-get install libqtgui4
sudo apt install libqt4-test
python3 -c 'import cv2; print(cv2.__version__)'

install the pieces for talking to the camera

cd ~/models/tutorials/image/imagenet
pip3 install imutils picamera
mkdir results

download edited version classify_image

curl -O https://gist.githubusercontent.com/libbymiller/d542d596566774a35752d134f80b1332/raw/471f066e4dc498501bab7731a07fa0c1926c1575/classify_image_dial.py

Run it, and point at a cat

python3 classify_image_dial.py

Posted at 20:48

November 30

Dublin Core Metadata Initiative: DCMI 2018 Conference Proceedings Published

We are pleased to announce that the full conference proceedings for the DCMI Annual Conference 2018 have been published. Thanks to all who contributed to the programme!

Posted at 00:00

Dublin Core Metadata Initiative: DCMI to Maintain the Bibliographic Ontology (BIBO)

The Dublin Core Metadata Initiative is pleased to announce that it has accepted responsibility for maintaining the Bibliographic Ontology (BIBO). BIBO is a long-established ontology, is very stable and is widely used in bibliographic linked-data. We are also very pleased to announce that Bruce D'Arcus has joined the DCMI Usage Board, and will act as the point of contact for issues relating to BIBO. Bruce had this to say about the decision:

Posted at 00:00

Dublin Core Metadata Initiative: DC-2019 Announced - Hold the date!

We are delighted to announce that the DCMI Annual Conference in 2019 will be hosted by the Korean National Library in Seoul. The conference dates are confirmed: 23rd - 26th September, 2019 (23rd - 25th September, 2019 for the conference-proper, with an extra day (26th) reserved for workshops or similar activities). We are also pleased to announce the following appointments to the conference planning team: Conference Chair: Dr. Sam Oh Co-chairs of the Content and Presentations planning committee: Dr Marcia Zeng and Dr.

Posted at 00:00

November 29

Gregory Williams: Thoughts on HDT

I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.

Benefits

HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.

Problems and Challenges

I ran into a number of outright problems when trying to implement HDT from scratch:

  • The HDT documentation is incomplete/incorrect in places, and required reverse engineering the existing implementations to determine critical format details; questions remain about specifics (e.g. canonical dictionary escaping):

    Here are some of the issues I found during implementation:

    • DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose

    • FourSectionDictionary conflicts with the previous section on the format URI (http://purl.org/HDT/hdt#dictionaryPlain vs. http://purl.org/HDT/hdt#dictionaryFour)

    • The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits

    • “Log64” referenced in BitmapTriples does not seem to be defined anywhere

    • There doesn’t seem to be documentation on exactly how RDF term data (“strings”) is encoded in the dictionary. Example datasets are enough to intuit the format, but it’s not clear why \u and \U escapes are supported, as this adds complexity and inefficiency. Moreover, without a canonical format (including when/if escapes must be used), it is impossible to efficiently implement dictionary lookup

  • The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding

  • There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible

  • The secondary index format seems to be entirely undocumented

In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:

  • The default dictionary encoding format (plain front coding) is inefficient for datatyped literals and unnecessarily allows escaped content, resulting in inefficient parsing

  • Distinct value space for predicate and subject/object dictionary IDs is at odds with many triple stores, and makes interoperability difficult (e.g. dictionary lookup is not just dict[id] -> term, but dict[id, pos] -> term; a single term might have two IDs if it is used as both predicate and subject/object)

  • The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit

  • A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)

  • The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging

Open Questions

Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.

Future Directions

In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.

Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.

Posted at 16:58

November 26

Dublin Core Metadata Initiative: Webinar: SKOS - visión general y modelado de vocabularios controlados

This webinar is scheduled for Wednesday, November 28, 2018, 15:00 UTC (convert this time to your local timezone here) and is free for DCMI members. SKOS (Simple Knowledge Organization Systems) es la recomendación del W3C para representar y publicar conjuntos de datos de clasificaciones, tesauros, encabezamientos de materia, glosarios y otros tipos de vocabularios controlados y sistemas de organización del conocimiento. La primera parte del webinar incluye una visión general de las tecnologías de la web semántica y muestra detalladamente los diferentes elementos del modelo SKOS.

Posted at 00:00

November 25

Leigh Dodds: UnINSPIREd: problems accessing local government geospatial data

This weekend I started a side project which I plan to spend some time on this winter. The goal is to create a web interface that will let people explore geospatial datasets published by the three local authorities that make up the West of England Combined Authority: Bristol City Council, South Gloucestershire Council and Bath & North East Somerset Council.

Through Bath: Hacked we’ve already worked with the council to publish a lot of geospatial data. We’ve also run community mapping events and created online tools to explore geospatial datasets. But we don’t have a single web interface that makes it easy for anyone to explore that data and perhaps mix it with new data that they have collected.

Rather than build something new, which would be fun but time consuming, I’ve decided to try out TerriaJS. Its an open source, web based mapping tool that is already being used to publish the Australian National Map. It should handle doing the West of England quite comfortably. It’s got a great set of features and can connect to existing data catalogues and endpoints. It seems to be perfect for my needs.

I decided to start by configuring the datasets that are already in the Bath: Hacked Datastore, the Bristol Open Data portal, and data.gov.uk. Every council also has to publish some data via standard APIs as part of the INSPIRE regulations, so I hoped to be able to quickly bring a list of existing datasets without having to download and manage them myself.

Unfortunately this hasn’t proved as easy as I’d hoped. Based on what we’ve learned so far about the state of geospatial data infrastructure in our project at the ODI I had reasonably low expectations. But there’s nothing like some practical experience to really drive things home.

Here’s a few of the challenges and issues I’ve encountered so far.

  • The three councils are publishing different sets of data. Why is that?
  • The dataset licensing isn’t open and looks to be inconsistent across the three councils. When is something covered by INSPIRE rather than the PSMA end user agreement?
  • The new data.gov.uk “filter by publisher” option doesn’t return all datasets for the specified publisher. I’ve reported this as a bug, in the meantime I’ve fallen back on searching by name
  • The metadata for the datasets is pretty poor, and there is little supporting documentation. I’m not sure what some of the datasets are intended to represent. What are “core strategy areas“?
  • The INSPIRE service endpoints do include metadata that isn’t exposed via data.gov.uk. For example this South Gloucester dataset includes contact details, data on geospatial extents, and format information which isn’t otherwise available. It would be nice to be able to see this and not have to read the XML
  • None of the metadata appears to tell me when the dataset was last updated. The last modified data on data.gov.uk is (I think) the date the catalogue entry was last updated. Are the Section 106 agreements listed in this dataset from 2010 or are they regularly updated. How can I tell?
  • Bath is using GetMapping to host its INSPIRE datasets. Working through them on data.gov.uk I found that 46 out of the 48 datasets I reviewed have broken endpoints. I’m reasonably certain these used to work. I’ve reported the issue to the council.
  • The two datasets that do work in Bath cannot be used in TerriaJS. I managed to work around the fact that they require a username and password to access but have hit a wall because the GetMapping APIs only seem to support EPSG:27700 (British National Grid) and not EPSG:3857 as used by online mapping tools. So the APIs refuse to serve the data in a way that can be used by the framework. The Bristol and South Gloucestershire endpoints handle this fine. I assume this is either a limitation of the GetMapping service or a misconfiguration. I’ve asked for help.
  • A single Web Mapping Service can expose multiple datasets as individual layers. But apart from Bristol, both Bath and South Gloucestershire are publishing each dataset through its own API endpoint. I hope the services they’re using aren’t charging per end-point, as they’re probably unnecessary? Bristol has chosen to publish a couple of API that bring together several datasets, but these are also available individually through separate APIs.
  • The same datasets are repeated across data catalogues and endpoints. Bristol has its data listed as individual datasets in its own platform, listed as individual datasets in data.gov.uk and also exposed via two different collections which bundle some (or all?) of them together. I’m unclear on the overlap or whether there are differences between them in terms of scope, timeliness, etc. The licensing is also different. Exploring the three different datasets that describe allotments in Bristol, only one actually displayed any data in TerriaJS. I don’t know why
  • The South Gloucestershire web mapping services all worked seamlessly, but I noticed that if I wanted to download the data, then I would need to jump through hoops to register to access it. Obviously not ideal if I do want to work with the data locally. This isn’t required by the other councils. I assume this is a feature of MisoPortal
  • The South Gloucestershire datasets don’t seem to include any useful attributes for the features represented in the data. When you click on the points, lines and polygons in TerriaJS no additional information is displayed. I don’t know yet whether this data just isn’t included in the dataset or if its a bug in the API or in how TerriaJS is requesting it. I’d need to download or explore the data in some other way to find out. However the data that is available from Bath and Bristol also has inconsistencies in how its described, so I suspect there aren’t any agreed standards
  • Neither the GetMapping or MisoPortal APIs support CORS. This means you can’t access the data from Javascript running directly in the browser, which is what TerriaJS does by default. I’ve had to configure those to be accessed via a proxy. “Web mapping services” should work on the web.
  • While TerriaJS doesn’t have a plugin for OpenDataSoft (which powers the Bristol Open Data platform), I found that OpenDataSoft do provide a Web Feature Service interface. So I was able to configure that in TerriaJS to access that. Unfortunately I then found that either there’s a bug in the platform or a problem with the data because most of the points were in the Indian Ocean

The goal of the INSPIRE legislation was to provide a common geospatial data infrastructure across Europe. What I’m trying to do here should be relatively quick and easy to do. Looking at this graph of INSPIRE conformance for the UK, everything looks rosy.

But, based on an admittedly small sample of only three local authorities, the reality seems to be that:

  • services are inconsistently implemented and have not been designed to be used as part of native web applications and mapping frameworks
  • metadata quality is poor
  • there is inconsistent detail about features which makes it hard to aggregate, use and compare data across different areas
  • it’s hard to tell the provenance of data because of duplicated copies of data across catalogues and endpoints. Without modification or provenance information, its unclear whether data is data is up to date
  • licensing is unclear
  • links to service endpoints are broken. At best, this leads to wasted time from data users. At worst, there’s public money being spent on publishing services that no-one can access

It’s important that we find ways to resolve these problems. As this recent survey by the ODI highlights, SMEs, startups and local community groups all need to be able to use this data. Local government needs more support to help strengthen our geospatial data infrastructure.

Posted at 18:07

November 21

Dublin Core Metadata Initiative: Webinar: Lightweight Methodology and Tools for Developing Ontology Networks

This webinar is scheduled for Thursday, November 29, 2018, 15:00 UTC (convert this time to your local timezone here) and is free for DCMI members. The increasing uptake of semantic technologies is giving the ontologies the opportunity to be treated as first-class citizens within software development projects. Together with the deserved visibility and attention that ontologies are getting, it comes the responsibility for ontology development teams to combine their activities with software development practices as seamless as possible.

Posted at 00:00

Copyright of the postings is owned by the original blog authors. Contact us.