Planet RDF

It's triples all the way down

April 19

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Advertisements

Posted at 15:06

Benjamin Nowack: Microdata, semantic markup for both RDFers and non-RDFers

There's been a whole lot of discussion around Microdata , a new approach for embedding machine-readable information into forthcoming HTML5. What I find most attractive about Microdata is the fact that it was designed by HTMLers, not RDFers. It's refreshingly pragmatic, free of other RDF spec legacy, but still capable of expressing most of RDF.

Unfortunately, RDFa lobbyists on the HTML WG mailing list forced the spec out of HTML5 core for the time being. This manoeuver was understandable (a lot of energy went into RDFa, after all), but in my opinion very short-sighted. How many uphill battles did we have, trying to get RDF to the broader developer community? And how many were successful? Atom, microformats, OpenID, Portable Contacts, XRDS, Activity Streams (well, not really), these are examples where RDFers tried, but failed to promote some of their infrastructure into the respective solutions. Now: HTML5, where the initial RDF lobbying actually had an effect and lead to a native mechanism for RDF-in-HTML. Yes, native , not in some separate spec. This would have become part of every HTML5 book, any HTML developer on this planet would have learned about it. Finally a battle won. And what a great one. HTML.

But no, Microdata wasn't developed by an RDF group, so they voted it out again. Now, the really sad thing is, there could have been a solution that would have served everybody sufficiently well, both HTMLers and RDFers. The RDFa group recently realized that RDFa needs to be revised anyway, there is going to be an RDFa 1.1 which will require new parsers. If they'd swallowed their pride, they would most probably have been able to define RDFa 1.1 as a proper superset of Microdata.

Here is a short overview of RDF features supported by Microdata:
  • Explicit resource containers, via @itemscope (in RDFa, the boundaries of a resource are often implicitly defined by @rel or @typeof)
  • Subject declaration, via @itemid (RDFa uses @about)
  • Main subject typing, via @itemtype (RDFa uses @typeof)
  • Predicate declaration, via @itemprop (RDFa uses @property, @rel, and @rev)
  • Literal objects, via node values (RDFa also allows hidden values via @content)
  • Non-literal objects, via @href, @src, etc. (RDFa also allows hidden values via @resource)
  • Object language, via @lang
  • Blank nodes
I won't go into details why hiding semantics in RDFa will be penalized by search engines as soon as spammers discover the possibilities, why reusing RDF/XML's attribute names was probably not a smart move with regard to attracting non-RDFers, why the new @vocab idea is impractical, or why namespace prefixes, as handy as they are in other RDF formats, are not too helpful in an HTML context. Let's simply state that there is a trade-off between extended features (RDFa) and simplicity (Microdata). So, what are the core features that an RDFer would really need beyond Microdata:
  • the possibility to preserve markup, but probably not necessarily as an explicit rdf:XMLLiteral
  • datatypes for literal objects (I personally never used them in practice in the last 6 years that I've been developing RDF apps, but I can see some use cases)
Markup preservation is currently turned on by default in RDFa and can be disabled through @datatype in RDFa, so an RDFer-satisfying RDFa 1.1 spec could probably just be Microdata + @datatype + a few extended parsing rules to end up with the intended RDF. My experience with watching RDF spec creation tells me that the RDFa group won't pick this route (there simply is no " Kill a Feature" mentality in the RDF community), but hey, hope dies last.

I've been using Microdata in two of my recent RDF apps and the CMS module of (ahem, still not documented) Trice, and it's been a great experience. ARC is going to get a "microRDF" extractor that supports the RDF-in-Microdata markup below (Note: this output still requires a 2nd extraction process, as the current Microdata draft's RDF mechanism only produces intermediate RDF triples, which then still have to be post-processed. I hope my related suggestion will become official, but I seem to be the only pro-Microdata RDFer on the HTML list right now, so it may just stay as a convention):

Microdata :
<div itemscope itemtype="

http://xmlns.com/foaf/0.1/
Person">

  <!-- plain props are mapped to the itemtype's context -->
  <img itemprop="
img
" src="mypic.jpg" alt="a pic of me" />
  My name is <span itemprop="
name
"><span itemprop="
nick
">Alec</span> Tronnick</span>
  and I blog at <a itemprop="
weblog
" href="http://alec-tronni.ck/">alec-tronni.ck</a>.

  <!-- other RDF vocabs can be used via full itemprop URIs -->
  <span itemprop="
http://purl.org/vocab/bio/0.1/olb
">
    I'm a crash test dummy for semantic HTML.
  </span>
</div>
Extracted RDF :
@base <http://host/path/>
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix bio: <http://purl.org/vocab/bio/0.1/> .
_:bn1 a foaf:Person ;
      foaf:img <mypic.jpg> ;
      foaf:name "Alec Tronnick" ;
      foaf:nick "Alec" ;
      foaf:weblog <http://alec-tronni.ck/> ;
      bio:olb "I'm a crash test dummy for semantic HTML." .

Posted at 15:06

Benjamin Nowack: Could having two RDF-in-HTMLs actually be handy?

Apart from grumpy rants about the complexity of W3C's RDF specs and semantic richtext editing excitement, I haven't blogged or tweeted a lot recently. That's partly because there finally is increased demand for the stuff I'm doing at semsol (agency-style SemWeb development), but also because I've been working hard on getting my tools in a state where they feel more like typical Web frameworks and apps. Talis ' Fanhu.bz is an example where (I think) we found a good balance between powerful RDF capabilities (data re-purposing, remote models, data augmentation, a crazy army of inference bots) and a non-technical UI (simplistic visual browser, Twitter-based annotation interfaces).

Another example is something I've been working on during the last months: I somehow managed to combine essential parts of Paggr (a drag&drop portal system based on RDF- and SPARQL-based widgets) with an RDF CMS (I'm currently looking for pilot projects). And although I decided to switch entirely to Microdata for semantic markup after exploring it during the FanHubz project, I wonder if there might be room for having two separate semantic layers in this sort of widget-based websites. Here is why:

As mentioned, I've taken a widget-like approach for the CMS. Each page section is a resource on its own that can be defined and extended by the web developer, it can be styled by themers, and it can be re-arranged and configured by the webmaster. In the RDF CMS context, widgets can easily integrate remote data, and when the integrated information is exposed as machine-readable data in the front-end, we can get beyond the "just-visual" integration of current widget pages and bring truly connectable and reusable information to the user interface.

Ideally, both the widgets' structural data and the content can be re-purposed by other apps. Just like in the early days of the Web, we could re-introduce a copy & paste culture of things for people to include in their own sites. With the difference that RDF simplifies copy-by-reference and source attribution. And both developers and end-users could be part of the game this time.

Anyway, one technical issue I encountered is when you have a page that contains multiple page items, but describes a single resource. With a single markup layer (say Microdata), you get a single tree where the context of the hierarchy is constantly switching between structural elements and content items (page structure -> main content -> page layout -> widget structure -> widget content). If you want to describe a single resource, you have to repeatedly re-introduce the triple subject ("this is about the page structure", "this is about the main page topic"). The first screenshot below shows the different (grey) widget areas in the editing view of the CMS. In the second screenshot, you can see that the displayed information (the marked calendar date, the flyer image, and the description) in the main area and the sidebar is about a single resource (an event).

Trice CMS Editor
Trice CMS editing view

Trice CMS Editor
Trice CMS page view with inline widgets describing one resource

If I used two separate semantic layers, e.g. RDFa for the content (the event description) and Microdata for the structural elements (column widths, widget template URIs, widget instance URIs), I could describe the resource and the structure without repeating the event subject in each page item.

To be honest, I'm not sure yet if this is really a problem, but I thought writing it down could kick off some thought processes (which now tend towards "No"). Keeping triples as stand-alone-ish as possible may actually be an advantage (even if subject URIs have to be repeated). No semantic markup solution so far provides full containment for reliable copy & paste, but explicit subjects (or "itemid"s in Microdata-speak) could bring us a little closer.

Conclusions? Err.., none yet. But hey, did you see the cool CMS screenshots?

Posted at 08:06

April 17

Benjamin Nowack: Moving forward back to Self-Employment

My time at Talis Systems officially ended last week. I joined the team during painful times, but I'm glad (and proud) to have been a Talisian at least for one year. I have had a few freelance gigs with Talis before, but being part of the team was a whole different thing. And I could frequently travel to the UK, immprooff my inklish, and discover the nice city of Birmingham. There's a reason why they have that G in GB.

Work-wise, I probably learned more in the last 12 months than during the previous 5 years combined - hat tip to Julian , Leigh and all the other (Ex-)Talis folks. And much of that goes beyond just technical skills. I don't want to bore you, but you can definitely learn a lot about your path through life when you get the opportunity to look at it from a different perspective. Apparently, I first had to become an employee working in a foreign city to see the bigger picture around why I boarded that Semantic Web roller coaster in the first place and where it overlaps with my own ideas and interests.

So I am going back to self-employment. And I am also going to stay in the emerging Data Web market. But I'll approach some things differently this time.

First, change of attitude. To contribute in a personally more healthy way again. I won't argue about technical details and specifications any more. That just turns me into a grumpy person (belated apologies). I doubt that promoting products by advertising their underlying technologies is the best way for establishing and growing a market anyway. That's like trying to heat a room by just burning a lot of matches. Promising, with renewed anticipation after each match, but useless without some larger fire in the end. I would like to help spark off these larger fires. Without constantly burning my fingers (OK, enough fire imagery ;-).

The second change is related, and it is about focus. While I still see many people using the ARC2 toolkit, I have had more encouraging feedback and signs of demand recently around my work for end users (including app developers, in a sense). So my new mission is to improve "information interaction" on the Web, and I'll be offering services in that area.

And it looks like I'm off to a good start. I am already fully booked for the next months.

Posted at 00:06

Benjamin Nowack: Dynamic Semantic Publishing for any Blog (Part 1)

"Dynamic Semantic Publishing" is a new technical term which was introduced by the BBC's online team a few weeks ago. It describes the idea of utilizing Linked Data technology to automate the aggregation and publication of interrelated content objects. The BBC's World Cup website was the first large mainstream website to use this method. It provides hundreds of automatically generated, topically composed pages for individual football entities (players, teams, groups) and related articles.

Now, the added value of such linked "entity hubs" would clearly be very interesting for other websites and blogs as well. They are multi-dimensional entry points to a site and provide a much better and more user-engaging way to explore content than the usual flat archives pages, which normally don't have dimensions beyond date, tag, and author. Additionally, HTML aggregations with embedded Linked Data identifiers can improve search engine rankings, and they enable semantic ad placement, which are attractive by-products.

Entity hub examples

The architecture used by the BBC is optimized for their internal publishing workflow and thus not necessarily suited for small and medium-scale media outlets. So I've started thinking about a lightweight version of the BBC infrastructure, one that would integrate more easily with typical web server environments and widespread blog engines.

How could a generalized approach to dynamic semantic publishing look like?

We should assume setups where direct access to a blog's database tables is not available. Working with already published posts requires a template detector and custom parsers, but it lowers the entry barrier for blog owners significantly. And content importers can be reused to a large extent when sites are based on standard blog engines such as WordPress or Movable Type.

The graphic below (large version) illustrates a possible, generalized approach to dynamic semantic publishing.
Dynamic Semantic Publishing

Process explanation:
  • Step 1 : A blog-specific crawling agent indexes articles linked from central archives pages. The index is stored as RDF, which enables the easy expansion of post URLs to richly annotated content objects.
  • Step 2 : Not-yet-imported posts from the generated blog index are parsed into core structural elements such as title, author, date of publication, main content, comments, Tweet counters, Facebook Likes, and so on. The semi-structured post information is added to the triple store for later processing by other agents and scripts. Again, we need site (or blog engine)-specific code to extract the various possible structures. This step could be accelerated by using an interactive extractor builder, though.
  • Step 3 : Post contents are passed to APIs like OpenCalais or Zemanta in order to extract stable and re-usable entity identifiers. The resulting data is added to the RDF Store.
  • After the initial semantification in step 3, a generic RDF data browser can be used to explore the extracted information. This simplifies general consistency checks and the identification of the site-specific ontology (concepts and how they are related). Alternatively, this could be done (in a less comfortable way) via the RDF store's SPARQL API.
  • Step 4 : Once we have a general idea of the target schema (entity types and their relations), custom SPARQL agents process the data and populate the ontology. They can optionally access and utilize public data.
  • After step 4, the rich resulting graph data allows the creation of context-aware widgets. These widgets ("Related articles", "Authors for this topic", "Product experts", "Top commenters", "Related technologies", etc.) can now be used to build user-facing applications and tools.
  • Use case 1 : Entity hubs for things like authors, products, people, organizations, commenters, or other domain-specific concepts.
  • Use case 2 : Improving the source blog. The typical "Related articles" sections in standard blog engines, for example, don't take social data such as Facebook Likes or re-tweets into account. Often, they are just based on explicitly defined tags. With the enhanced blog data, we can generate aggregations driven by rich semantic criteria.
  • Use case 3 : Authoring extensions: After all, the automated entity extraction APIs are not perfect. With the site-wide ontology in place, we could provide content creators with convenient annotation tools to manually highlight some text and then associate the selection with a typed entity from the RDF store. Or they could add their own concepts to the ontology and share it with other authors. The manual annotations help increase the quality of the entity hubs and blog widgets.

Does it work?

I explored this approach to dynamic semantic publishing with nearly nine thousand articles from ReadWriteWeb. In the next post, I'll describe a "Linked RWW" demo which combines Trice bots, ARC , Prospect , and the handy semantic APIs provided by OpenCalais and Zemanta .

Posted at 00:06

Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a
    ['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a
    ['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision ), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Posted at 00:06

April 16

Benjamin Nowack: I'm joining Talis!

KASABI data marketplace I received a number of very interesting job offers when I began searching for something new last month, but there was one company that stood out, and that is Talis. Not only do I know many people there already, I also find Talis' new strategic focus and products very promising. In addition, they know and use some of my tools already, and I've successfully worked on Talis projects with Leigh and Keith before. The job interview almost felt like coming home (and the new office is just great).

So I'm very happy to say that I'm going to become part of the Kasabi data marketplace team in September where I'll help create and drupalise data management and data market tools.

BeeNode I will have to get up to speed with a lot of new things, and the legal and travel costs overhead for Talis is significant, so I hope I can turn this into a smart investment for them as quickly as possible. I'll even rename my blog if necessary... ;-) For those wondering about the future of my other projects, I'll write about them in a separate post soon.

Can't wait to start!

Posted at 02:44

April 02

schema.org: Schema.org 3.5: Simpler extension model, projects, grants and funding schemas, and new terms for describing educational and occupational credentials

Schema.org version 3.5 has been released. This release moves a number of terms from the experimental "Pending" area into the Schema.org core. It also simplifies and clarifies the Schema.org extension model, reducing our emphasis on using named subdomains for topical groups of schemas. New terms introduced in Pending area include improvements for describing projects, grants and funding agencies; for describing open-ended date ranges (e.g. datasets); and a substantial vocabulary for Educational and Occupational Credentials. Many thanks to all who contributed!

Posted at 16:19

March 27

Libby Miller: An i2c heat sensor with a Raspberry Pi camera

I had a bit of a struggle with this so thought it was worth documenting. The problem is this – the i2c bus on the Raspberry Pi is used by the official camera to initialise it. So if you want to use an i2c device at the same time as the camera, the device will stop working after a few minutes. Here’s more on this problem.

I really wanted to use this heatsensor with mynaturewatch to see if we could exclude some of the problem with false positives (trees waving in the breeze and similar). I’ve not got it working well enough yet to look at this problem in detail. But, I did get it working with the 12c bus with the camera – here’s how.

Screen Shot 2019-03-22 at 12.31.04

It’s pretty straightforward. You need to

  • Create a new i2c bus on some different GPIOs
  • Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one
  • Fin

1. Create a new i2c bus on some different GPIOs

This is super-easy:

sudo nano /boot/config.txt

Add the following line of code, preferable in the section where spi and i2c is enabled.

dtoverlay=i2c-gpio,bus=3,i2c_gpio_delay_us=1

This line will create an aditional i2c bus (bus 3) on GPIO 23 as SDA and GPIO 24 as SCL (GPIO 23 and 24 is defaults)

2. Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one

I am using this sensor, for which I need this circuitpython library (more info), installed using:

pip3 install Adafruit_CircuitPython_AMG88xx

While the pi is switched off, plug in the i2c device using pins 23 for SDA and GPIO 24 for SDL, and then boot it up and check it’s working:

 i2cdetect -y 3

Make two changes:

nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/bcm283x/pin.py

and change the SDA and SCL pins to the new pins

#SDA = Pin(2)
#SCL = Pin(3)
SDA = Pin(23)
SCL = Pin(24)
nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/generic_linux/i2c.py

Change line 21 or thereabouts to use the i2c bus 3 rather than the default, 1:

self._i2c_bus = smbus.SMBus(3)

3. Fin

Start up your camera code and your i2c peripheral. They should run happily together.

Screen Shot 2019-03-25 at 19.12.21

Posted at 20:37

March 24

Bob DuCharme: Changing my blog's domain name and platform

New look, new domain name.

Posted at 14:40

March 22

AKSW Group - University of Leipzig: LDK conference @ University of Leipzig

With the advent of digital technologies, an ever-increasing amount of language data is now available across various application areas and industry sectors, thus making language data more and more valuable. In that context, we are happy to invite you to join the 2nd Language, Data and Knowledge (LDK) conference which will be held in Leipzig from May 20th till 22nd, 2019.

This new biennial conference series aims at bringing together researchers from across disciplines concerned with language data in data science and knowledge-based applications.

In that context, the acquisition, provenance, representation, maintenance, usability, quality as well as legal, organizational and infrastructure aspects of language data are in the centre of research revolving around language data and thus constitute the focus of the conference.

To register and be part of the LDK conference and its associated events, please go to http://2019.ldk-conf.org/registration/.

Keynote Speakers

  • Keynote #1: Christian Bizer, Mannheim University
  • Keynote #2: Christiane Fellbaum, Princeton University
  • Keynote #3: Eduard Werner, Leipzig University

Associated Events

The following events are co-located with LDK 2019:

Workshops on the 20th May 2019

DBpedia Community Meeting on the 23rd May 2019

Looking forward to meeting you at the conference!

Posted at 08:21

March 16

Leigh Dodds: Thinking about the governance of data

I find “governance” to be a tricky word. Particularly when we’re talking about the governance of data.

For example, I’ve experienced conversations with people from a public policy background and people with a background in data management, where its clear that there are different perspectives. From a policy perspective, governance of data could be described as the work that governments do to enforce, encourage or enable an environment where data works for everyone. Which is slightly different to the work that organisations do in order to ensure that data is treated as an asset, which is how I tend to think about organisational data governance.

These aren’t mutually exclusive perspectives. But they operate at different scales with a different emphasis, which I think can sometimes lead to crossed wires or missed opportunities.

As another example, reading this interesting piece of open data governance recently, I found myself wondering about that phrase: “open data governance”. Does it refer to the governance of open data? Being open about how data is governed? The use of open data in governance (e.g. as a public policy tool), or the role of open data in demonstrating good governance (e.g. through transparency). I think the article touched on all of these but they seem quite different things. (Personally I’m not sure there is anything special about the governance of open data as opposed to data in general: open data isn’t special).

Now, all of the above might be completely clear to everyone else and I’m just falling into my usual trap of getting caught up on words and meanings. But picking away at definitions is often useful, so here we are.

The way I’ve rationalised the different data management and public policy perspectives is in thinking about the governance of data as a set of (partly) overlapping contexts. Like this:

 

Governance of data as a set of overlapping contexts

 

Whenever we are managing and using data we are doing so within a nested set of rules, processes, legislation and norms.

In the UK our use of data is bounded by a number of contexts. This includes, for example: legislation from the EU (currently!), legislation from the UK government, legislation defined by regulators, best practices that might be defined how a sector operates, our norms as a society and community, and then the governance processes that apply within our specific organisations, departments and even teams.

Depending on what you’re doing with the data, and the type of data you’re working with, then different contexts might apply. The obvious one being the use of personal data. As data moves between organisations and countries, then different contexts will apply, but we can’t necessarily ignore the broader contexts in which it already sits.

The narrowest contexts, e.g. those within an organisations, will focus on questions like: “how are we managing dataset XYZ to ensure it is protected and managed to a high quality?” The broadest contexts are likely to focus on questions like: “how do we safely manage personal data?

Narrow contexts define the governance and stewardship of individual datasets. Wider contexts guide the stewardship of data more broadly.

What the above diagram hopefully shows is that data, and our use of data, is never free from governance. It’s just that the terms under which it is governed may just be very loosely defined.

This terrible sketch I shared on twitter a while ago shows another way of looking at this. The laws, permissions, norms and guidelines that define the context in which we use data.

Data use in context

One of the ways in which I’ve found this “overlapping contexts” perspective useful, is in thinking about how data moves into and out of different contexts. For example when it is published or shared between organisations and communities. Here’s an example from this week.

IBM have been under fire because they recently released (or re-released) a dataset intended to support facial recognition research. The dataset was constructed by linking to public and openly licensed images already published on the web, e.g. on Flickr. The photographers, and in some cases the people featured in those images, are unhappy about the photographs being used in this new way. In this new context.

In my view, the IBM researchers producing this dataset made two mistakes. Firstly, they didn’t give proper appreciation to the norms and regulations that apply to this data — the broader contexts which inform how it is governed and used, even though its published under an open licence. For example, e.g. people’s expectations about how photographs of them will be used.

An open licence helps data move between organisations — between contexts — but doesn’t absolve anyone from complying with all of the other rules, regulations, norms, etc that will still apply to how it is accessed, used and shared. The statement from Creative Commons helps to clarify that their licenses are not a tool for governance. They just help to support the reuse of information.

This lead to IBM’s second mistake. By creating a new dataset they took on responsibility as its data steward. And being a data steward means having a well-defined, set of data governance processes that are informed and guided by all of the applicable contexts of governance. But they missed some things.

The dataset included content that was created by and features individuals. So their lack of engagement with the community of contributors, in order to discuss norms and expectations was mistaken. The lack of good tools to allow people to remove photos — NBC News created a better tool to allow Flickr users to check the contents of the dataset — is also a shortfall in their duties. Its the combination of these that has lead to the outcry.

If IBM had instead launched an initiative similar where they built this dataset, collaboratively, with the community then they could have avoided this issue. This is the approach that Mozilla took with Voice. IBM, and the world, might even have had a better dataset as a result because people have have opted-in to including more photos. This is important because, as John Wilbanks has pointed out, the market isn’t creating these fairer, more inclusive datasets. We need them to create an open, trustworthy data ecosystem.

Anyway, that’s one example of how I’ve found thinking about the different contexts of governing data, helpful in understanding how to build stronger data infrastructure. What do you think? Am I thinking about this all wrong? What else should I be reading?

 

Posted at 14:48

February 24

Bob DuCharme: curling SPARQL

A quick reference.

Posted at 15:45

February 22

AKSW Group - University of Leipzig: 13th DBpedia community meeting in Leipzig

We are happy to invite you to join the 13th edition of the DBpedia Community Meeting, which will be held in Leipzig. Following the LDK conference, May 20-22, the DBpedia Community will get together on May 23rd, 2019 at Mediencampus Villa Ida. Once again the meeting will be accompanied by a varied program of exciting lectures and showcases.

Highlights/ Sessions

  • Keynote #1: Making Linked Data Fun with DBpedia by Peter Haase, metaphacts
  • Keynote #2: From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph by Heiko Paulheim, Universität Mannheim
  • NLP and DBpedia Session
  • DBpedia Association Hour
  • DBpedia Showcase Session

Call for Contribution

What cool things do you do with DBpedia? Present your tools and datasets at the DBpedia Community Meeting! Please submit your presentations, posters, demos or other forms of contributions through our web form.

Tickets

Attending the DBpedia Community meeting costs 40 €. You need to buy a ticket via eshop.sachsen.de. DBpedia members get free admission. Please contact your nearest DBpedia chapter for a promotion code, or please contact the DBpedia Association.

If you would like to attend the LDK conference, please register here.

We are looking forward to meeting you in Leipzig!

Posted at 11:22

February 15

Dublin Core Metadata Initiative: DCMI 2019: Call for Participation

The Dublin Core Metadata Initiative (DCMI) annual international conference in 2019 will be hosted by the National Library of Korea, in Seoul, South Korea, 23rd - 26th September, 2019. The Organising Committee has just published a call for participation. Please consider submitting a proposal!

Posted at 00:00

February 09

Egon Willighagen: Comparing Research Journals Quality #1: FAIRness of journal articles

What a traditional research article
looks like. Nice layout, hard to
reuse the knowledge from.
Image: CC BY-SA 4.0.
After Plan S was proposed, there finally was a community-wide discussion on the future of publishing. Not everyone is clearly speaking out if they want open access or not, but there's a start for more. Plan S aims to reform the current model. (Interestingly, the argument that not a lot of journals are currently "compliant" is sort of the point of the Plan.) One thing it does not want to reform, is the quality of the good journals (at least, I have not seen that as one of the principles). There are many aspects to the quality of a research journal. There are also many things that disguise themselves as aspects of quality but are not. This series discusses quality of a journal. We skip the trivial ones, like peer review, for now, because I honestly do not believe that the cOAlition S funders want worse peer review.

We start with FAIRness (doi:10.1038/sdata.2016.18). This falls, if you like, under the category of added value. FAIRness does not change the validness of the conclusions of an article, it just improves the rigor of the knowledge dissemination. To me, a quality journal is one that takes knowledge dissemination seriously. All journals have a heritage of being printed on paper, and most journals have been very slows in adopting innovative approaches. So, let's put down some requirements of the journal of 2020.

First the about the article itself:

About findable

  • uses identifiers (DOI) at least at article level, but possibly also for figures and supplementary information
  • provides data of an article (including citations)
  • data is actively distributed (PubMed, Scopus, OpenCitations, etc)
  • maximizes findability by supporting probably more than one open standard
About accessible
  • data can be accessed using open standards (HTTP, etc)
  • data is archived (possibly replicated by others, like libraries)
About interoperable
  • data is using open standards (RDF, XML, etc)
  • data uses open ontologies (many open standards exist, see this preprint)
  • uses linked data approaches (e.g. for citations)
About reusable
  • data is as complete as possible
  • data is available under an Open Science compliant license
  • data is uses modern and used community standards
Pretty straightforward. For author, title, journal, name, year, etc, most journals apply this. Of course, bigger publishers that invested in these aspects many moons ago can be compliant much easier, because they already were.

Second, what about the content of the article? There we start seeing huge differences.

About findable
  • important concepts in the article are easily identified (e.g. with markup)
  • important concepts use (compact) identifiers
Here, the important concepts are entities like cities, genes, metabolites, species, etc, etc. But also reference data sets, software, cited articles, etc. Some journals only use keywords, some journals have policies about use of identifiers for genes and proteins. Using identifiers for data and software is rare, sadly.

About accessible
  • articles can be retrieved by concept identifiers (via open, free standards)
  • article-concept identifier links are archived
  • table and figure data is annotated with concept identifiers
  • table and figure data can be accessed in an automated way
Here we see a clear problem. Publishers have been actively fighting this for years, even to today. Text miners and projects like Europe PMC are stepping in, but severely hampered by copyright law and publishers not wishing to make exception.

About interoperable
  • concept are describes common standards (many available)
  • table and figure data is available as something like CSV, RDF
Currently, the only serious standard used by the majority of (STM?) journals are MeSH terms for keywords and perhaps CrossRef XML for citations. Table and figures are more than just a graphical representations. Some journals are experimenting with this.

About reusable
  • the content of the article has a clear licence, Open Science compliant
  • the content is available with relevant standards of now
This is hard. These community standards are a moving target. For example, how we name concepts changes over time. But also identifiers themselves change over time. But a journal can be specific and accurate, which ensures that even 50 years from now, the context of the content can be determined. Of course, with proper Open Science approaches, translation to then modern community standards is simplified.

There are tons of references I can give here. If you really like these ideas, I recommend:
  1. continue reading my blog with many, many pointers
  2. read (and maybe sign) our Open Science Feedback to the Guidance on the Implementation of Plan S (doi:10.5281/zenodo.2560200), where many of these ideas are part of


Posted at 16:11

February 05

Egon Willighagen: Plan S: Less publications, but more quality, more reusable? Yes, please.

If you look at opinions published in scholarly journals (RSS feed, if you like to keep up), then Plan S is all 'bout the money (as Meja already tried to warn us):


No one wants puppies to die. Similarly, no one wants journals to die. But maybe we should. Well, the journals, not the puppies. I don't know, but it does make sense to me (at this very moment):

The past few decades has seen a significant growth of journals. And before hybrid journals were introduced, publishers tended to start new journals, rather than make journals Open Access. At the same time, the number of articles too has gone up significantly. In fact, the flood of literature is drowning researchers and this problem has been discussed for years. But if we have too much literature, should we not aim for less literature? And do it better instead?

Over the past 13 years I have blogged on many occasions about how we can make journals more reusable. And many open scientist can quote you Linus: "given enough eyeballs, all bugs are shallow". In fact, just worded differently, any researcher will tell you exactly the same, which is why we do peer review.
But the problem here is the first two words: given enough.

What if we just started publishing half of what we do now? If we have an APC-business model, we have immediately halved(!) the publishing cost. We also save ourselves from a lot of peer-review work, reading of marginal articles.

And what if we just the time we freed up for actually making knowledge dissemination better? Make journals articles actually machine readable, put some RDF in them? What if we could reuse supplementary information. What if we could ask our smartphone to compare the claims of one article with that of another, just like we compare two smartphones. Oh, they have more data, but theirs has a smaller error margin. Oh, they tried it at that temperature, which seems to work better than in that other paper.

I have blogged about this topic for more than a decade now. I don't want to wait another 15 years for journal publications to evolve. I want some serious activity. I want Open Science in our Open Access.

This is one of my personal motives to our Open Science Feedback to cOAlition S, and I am happy that 40 people joined in the past 36 hours, from 12 countries. Please have a read, and please share it with others. Let your social network know why the current publishing system needs serious improvement and that Open Science has had the answer for years now.

Help our push and show your support to cOAlition S to trigger exactly this push for better scholarly publishing: https://docs.google.com/document/d/14GycQnHwjIQBQrtt6pyN-ZnRlX1n8chAtV72f0dLauU/edit?usp=sharing

Posted at 07:01

January 30

Tetherless World Constellation group RPI: AGU Fall Meeting 2018

The American Geophysical Union (AGU) Fall Meeting 2018 was the first time I attended a conference that was of such magnitude in all aspects – attendees, arrangements, content and information. It is an overwhelming experience for a first timer but totally worth it. The amount of knowledge and information that one can learn at this event is the biggest takeaway; depends on each person’s abilities but trying to get the most out of it is what one will always aim for.

There were 5 to 6 types of events that were held throughout the day for all 5 days. The ones that stood out for me were the poster sessions, e-lightning talks, oral sessions and the centennial plenary sessions.
The poster sessions helped to see at a glance the research that is going on in the various fields all over the world. No matter how much I tried, I found it hard to cover all the sections that piqued my interest in the poster hall. The e-lightning talks were a good way to strike up a conversation on the topic of the talks and get a discussion going among all the attendees. Being a group discussion structure I felt that there was more interaction as compared to the other venues. The oral sessions were a great place to get to know how people are exploring their areas of interests and the various methods and approaches that they are using for the same. However, I felt that it is hard for the presenter to cover everything that is important and relevant in the given time span. The time constraints are there for a very valid reason but that might lead to someone losing out on leads if the audience doesn’t fully get the concept. Not all presenters were up to the mark. I could feel a stark difference between the TWC presenters (who knew how to get all the right points across) and the rest of the presenters. The centennial plenary sessions were a special this year as AGU is celebrating the centennial year. These sessions highlighted the best of research practices, innovations, achievements and studies. The time slots for this session were very small but the work spoke for itself.

The Exhibit Hall had all the companies and organisations that are in the field or related to it. Google, NASA and AGU had sessions, talks and events being conducted here as well. While Google and NASA were focussing on showcasing the ‘Geo-‘ aspect of their work. AGU was focussing on the data aspect too which was refreshing. They had sessions going on about data from the domain scientists’ point of view. This comes across as fundamental or elementary knowledge to us at TWC but the way they are trying to enable domain scientists to be able to communicate better with data scientists is commendable.  AGU is also working on an initiative called “Make data ‘FAIR’ (Findable Accessible Interoperable Reusable) again’ which is once again trying to spread awareness amongst the domain scientists. The exhibit hall is also a nice place to interact with industry, universities and organisations who have research programs for the doctorate students and postdocs.

In retrospect, I think planning REALLY ahead of time is a good idea so that you know what to ditch and what not to miss. A list of ‘must attend’ could have helped with the decision making process. A group discussion at one of our meetings where everyone shares what they find important, before AGU, could be a good idea. Being just an audience is great and one gets to learn a lot, but contributing to this event would be even better. This event was amazing and has given me a good idea as to how to be prepared the next time I am attending it.

 

Posted at 14:49

January 29

Tetherless World Constellation group RPI: AGU Conference: Know Before You Go

If this is your first American Geophysical Union (AGU) conference, be ready! Below are a few pointers for future first-timers.

The conference I attended was hosted in Washington, D.C. at the Walter E. Washington Convention Center during the week of December 10th, 2018. It brought together over 25,000 people. Until this conference, I had not experienced the pleasure and the power of so many like-minds in one space. The experience, while exhausting, was exhilarating!

One of the top universal concerns at the AGU Conference is scheduling. You should know that I was not naïve to the opportunities and scheduling difficulties prior to 2018, my first year of attendance. I had spent the last several months organizing an application development team that successfully created a faceted browsing app with calendaring for this particular conference using live data. Believe me when I say, “Schedule before you go”. Engage domain scientists and past participants about sessions, presentations, events, and posters that are a must-see. There is so much to learn at the conference. Do not miss the important stuff. The possibilities are endless, and you will need the expertise of those prior attendees. Plan breaks for yourself. Use those breaks to wander the poster hall, exhibit hall, or the vendor displays.

Key Elements in Scheduling Your Week

  • Do not front load your week. You need time to explore.
    • Be prepared to alter your existing schedule, as a result.
  • Plan on being exhausted.
  • Eat to fuel your body and your mind.
    • Relax, but not too much.
  • Plan on networking. To do that, you need to be sharp!
    • The opportunities to network will exceed your wildest expectations.
  • Take business cards – your own, and from people you meet.

Finally, take some time to see the city that holds the conference. There are many experiences to be had that will add to your education.

The Sessions

So. Many. Sessions!

There are e-lightning talks. There are oral sessions.  There are poster sessions. There are town hall sessions. There are scientific workshops. There are tutorial talks. There are keynotes. Wow!

The e-lightning talks are exciting. There are lots of opportunity to interact in this presentation mode. The e-lightning talks are held in the Poster Hall. A small section provides chairs for about 15 – 20 attendees, with plenty of standing room only space. This informal session leads to great discussion amongst attendees. Be sure to put one of these in your schedule!

Oral sessions are what you would expect; people working in the topic, sitting in chairs at the front of the room, each giving a brief talk, then, time permitting, a Q&A session at the end. Remember these panels are filled with knowledge. For the oral sessions that you schedule to attend, read the papers prior to attending. More importantly, have some questions prepared.

//Steps onto soapbox//

  1. If you are female, know the facts! (Nature International Journal of Science, 2018)
  2. Females are less likely to ask a question if a male asked a prior question.
  3. Get up there!
  4. Grab the mic!
  5. Ask the question anyway.
  6. Do NOT wait to speak with the presenters until afterwards. They are feeling just as overwhelmed as you are by all of the opportunities available to them at this conference.
  7. Please read the referenced article in bullet #1. The link is provided at the end of this post.

//Steps down from soapbox//

The poster sessions are a great way to unwind by getting in some walking. There are e-posters which are presented on screens provided by AGU or the venue. There are the usual posters as well. The highlights of attending a poster session, besides the opportunity to stretch your legs, include the opportunity to practice meeting new people, asking in-depth questions on topics of interest, talking to people doing the research, and checking out the data being used for the research. You will want to have a notepad with you for the poster sessions. Don’t just take notes; take business cards! Remember, what makes poster sessions special is that they are an example of the latest research that has not, yet, become a published paper. The person doing the research is quite likely the presenter of the poster.

All those special sessions – the town halls, the scientific workshops, the tutorial talks, and keynotes – these are the ones that you ask prior attendees, past participants, and experts on which ones are the must-see. Get them in your schedule. Pay attention. Take notes. Read the papers behind the sessions; if not the papers, the abstracts as a minimum. Have your questions ready before you go!

Timing

This is really important. Do NOT arrive without your time at this conference well planned. To do that you are going to need to spend several weeks preparing; reading papers, studying schedules, writing questions, and more. In order to have a really successful, time-well-spent type of experience, you are going to need to begin preparing for this immense conference by November 1st.

Oh, how I wish I had listened to all the people that told me this!

Put an hour per day in your calendar, from November 1st until AGU Conference Week, to study and prepare for this conference. I promise you will not regret the time you spent preparing.

The biggest thing to remember and the one thing that all attendees must do is:

Have a great time!

 

 

Works Cited

Nature International Journal of Science. (2018, October 17). Why fewer women than men ask questions at conferences. Retrieved from Nature International Journal of Science Career Brief: https://www.nature.com/articles/d41586-018-07049-x

 

Posted at 18:47

January 26

Leigh Dodds: Impressions from pidapalooza 19

This week I was at the third pidapalooza conference in Dublin. It’s a conference that is dedicated open identifiers: how to create them, steward them, drive adoption and promote their benefits.

Anyone who has spent any time reading this blog or following me on twitter will know that this is a topic close to my heart. Open identifiers are infrastructure.

I’ve separately written up the talk I gave on documenting identifiers to help drive adoption and spur the creation of additional services. I had lots of great nerdy discussions around URIs, identifier schemes, compact URIs, standards development and open data. But I wanted to briefly capture and share a few general impressions.

Firstly, while the conference topic is very much my thing, and the attendees were very much my people (including a number of ex-colleagues and collaborators), I was approaching the event from a very different perspective to the majority of other attendees.

Pidapalooza as a conference has been created by organisations from the scholarly publishing, research and archiving communities. Identifiers are a key part of how the integrity of the scholarly record is maintained over the long term. They’re essential to support archiving and access to a variety of research outputs, with data being a key growth area. Open access and open data were very much in evidence.

But I think I was one of only a few (perhaps the only?) attendee from what I’ll call the “broader” open data community. That wasn’t a complete surprise but I think the conference as a whole could benefit from a wider audience and set of participants.

If you’re working in and around open data, I’d encourage you to go to pidapalooza, submit some talk ideas and consider sponsoring. I think that would be beneficial for several reasons.

Firstly, in the pidapalooza community, the idea of data infrastructure is just a given. It was refreshing to be around a group of people that past the idea of thinking of data as infrastructure and were instead focusing on how to build, govern and drive adoption of that infrastructure. There’s a lot of lessons there that are more generally applicable.

For example I went to a fascinating talk about how EIDR, an identifier for movie and television assets, had helped to drive digital transformation in that sector. Persistent identifiers are critical to digital supply chains (Netflix, streaming services, etc). There are lessons here for other sectors around benefits of wider sharing of data.

I also attended a great talk by the Australian Research Data Commons that reviewed the ways in which they were engaging with their community to drive adoption and best practices for their data infrastructure. They have a programme of policy change, awareness raising, skills development, community building and culture change which could easily be replicated in other areas. It paralleled some of the activities that the Open Data Institute has carried out around its sector programmes like OpenActive.

The need for transparent governance and long-term sustainability were also frequent topics. As was the recognition that data infrastructure takes time to build. Technology is easy, its growing a community and building consensus around an approach that takes time.

(btw, I’d love to spend some time capturing some of the lessons learned by the research and publishing community, perhaps as a new entry to the series of data infrastructure papers that the ODI has previously written. If you’d like to collaborate with or sponsor the ODI to explore that, then drop me a line?)

Secondly, while the pidapalooza community seem to have generally accepted (with a few exceptions) the importance of web identifiers and open licensing of reference data. But that practice is still not widely adopted in other domains. Few of the identifiers I encounter in open government data, for example, are well documented, openly licensed or are supported by a range of APIs and services.

Finally, much of the focus of pidapalooza was on identifying research outputs and related objects: papers, conferences, organisations, datasets, researchers, etc. I didn’t see many discussions around the potential benefits and consequences of use of identifiers in research datasets. Again, this focus follows from the community around the conference.

But as the research, data science and machine-learning communities begin exploring new approaches to increase access to data, it will be increasingly important to explore the use of standard identifiers in that context. Identifiers have a clear role in helping to integrate data from different sources, but there are wider risks around data privacy and ethical considerations around identification of individuals, for example, that will need to happen.

I think we should be building a wider community of practice around use of identifiers in different contexts, and I think pidapalooza could become a great venue to do that.

Posted at 12:33

Leigh Dodds: Talk: Documenting Identifiers for Humans and Machines

This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.

At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.

Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.

A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.

We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.

Registers are reference datasets that provide both unique and/or persistent identifiers for things, and data about those things. The datasets of metadata that describe ORCIDs and DOIs are registers. As are lists of doctors, countries and locations where you can get our car taxed. We’ve explored different models for stewarding registers and ways to build trust
around how they are created and maintained.

In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.

The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.

There’s also been great work and progress around giving identifiers affordance. Making them actionable.

Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.

Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.

But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.

Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users  and communities are likely to encounter these identifiers.

I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.

As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.

Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.

To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.

Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.

Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.

So what documentation should we publish around identifier schemes?

I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.

I think these are:

  • Descriptive information that provides the background to a scheme, such as what it’s for, when it was created, examples of it being used, etc
  • Governance information that describes how the scheme is managed, who operates it and how access is managed
  • Technical notes that describe the syntax and validation rules for the scheme
  • Operational information that helps developers understand how many identifiers there are, when and how new identifiers are assigned
  • Service pointers that signpost to resolvers and other APIs and services that help people use or adopt the identifiers

I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.

Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:

  • how do we make this metadata discoverable, e.g. from datasets and individual identifiers?
  • are there different use cases that might encourage us to separate out some of this information into separate formats and/or types of documentation?
  • what services might we build off the metadata?
  • …etc

I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.

Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.

Thanks for listening.

Posted at 11:41

Leigh Dodds: Talk: Tabular data on the web

This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.

At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.

An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them

We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.

Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.

The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.

Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.

New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.

So how do we integrate data with the web? To ensure that statistics are well described and easy to find?

We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.

What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.

And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:

The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.

One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.

Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.

(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)

This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.

There also aren’t a great many tools that support CSV on the web currently.

It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?

But adoption could be driven from multiple angles. For example:

  • open data tools, portals and data publishers could start to generate best practice CSVs. That would be easy to implement
  • open data portals could also readily adopt CSV on the Web metadata, most already support DCAT
  • standards developers could adopt CSV on the Web as their primary means of defining schemas for tabular formats

Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.

Thanks for listening.

Posted at 10:55

January 25

Dublin Core Metadata Initiative: W3C Data Exchange Working Group - Request for Feedback

The W3C Data Exchange Working Group has issued first drafts of two potential standards relating to profiles. The first is a small ontology for profile resources. This ontology is based on the case where an application profile is made up of one or more documents or resources, such as both human-readable instructions and a SHACL validation document. The ontology links those into a single graph called a "profile" and states the role of each resource.

Posted at 00:00

Dublin Core Metadata Initiative: W3C Data Exchange Working Group - Request for Feedback

The W3C Data Exchange Working Group has issued first drafts of two potential standards relating to profiles. The first is a small ontology for profile resources. This ontology is based on the case where an application profile is made up of one or more documents or resources, such as both human-readable instructions and a SHACL validation document. The ontology links those into a single graph called a "profile" and states the role of each resource.

Posted at 00:00

January 22

Tetherless World Constellation group RPI: TWC at AGU FM 2018

In 2018, AGU celebrated its centennial year. TWC had a good showing at this AGU, with 8 members attending and presenting on a number of projects.

We arrived at DC on Saturday night, to attend the DCO Virtual Reality workshop organized by Louis Kellogg and the DCO Engagement Team, where research from greater DCO community came together to present, discuss and understand how the use of VR can facilitate and improve both research and teaching. Oliver Kreylos and Louis Kellogg spent various session presenting the results of DCO VR project, which involved recreating some of the visualizations used commonly at TWC, i.e the mineral networks. For a preview of using the VR environment, check out these three tweets. Visualizing mineral networks in a VR environment has yielded some promising results, we observed interesting patterns in the networks which need to be explored and validated in the near future.

With a successful pre-AGU workshop behind us, we geared up for the main event. First thing Monday morning, was the “Predictive Analytics” poster session, which Shaunna Morrison, Fang Huang, and Marshall Ma helped me convene. The session, while low on abstracts submitted, was full of very interesting applications of analytics methods in various earth and space science domains.

Fang Huang also co-convened a VGP session on Tuesday, titled “Data Science and Geochemistry“. It was a very popular session, with 38 abstracts. Very encouraging to see divisions other than ESSI have Data Science sessions. This session also highlighted the work of many of TWC’s collaborators from the DTDI project. Kathy Fontaine convened a e-lightning session on Data policy. This new format was very successfully in drawing a large crowd to the event and enabled a great discussion on the topic. The day ended with Fang’s talk, presenting our findings about the network analysis of samples from the cerro negro volcano.

Over the next 2 days, many of TWC’s collaborators presented, but no one from TWC presented until Friday. Friday though was the busiest day for all of us from TWC. Starting with Peter Fox’s talk in the morning, Mark Parsons, Ahmed Eleish, Kathy Fontaine and Brenda Thomson all presented their work during the day. Oh yeah…and I presented too! My poster on the creation of the “Global Earth Mineral Inventory” got good feedback. Last, but definitely not the least, Peter represented the ESSI division during the AGU centennial plenary, where he talked about the future of Big Data and Artificial Intelligence in the Earth Sciences. The video of the entire plenary can be found here.

Overall, AGU18 was great, other than the talk mentioned above, multiple productive meetings and potential collaboration emerged from meeting various scientists and talking to them about their work. It was an incredible learning experience for me and the other students (for whom this was the first AGU).

As for other posters and talks I found interesting. I tweeted a lot about them during AGU. Fortunately, I did make a list of some interesting posters.

Posted at 17:41

January 20

Bob DuCharme: Querying machine learning distributional semantics with SPARQL

Bringing together my two favorite kinds of semantics.

Posted at 14:57

January 11

Libby Miller: Balena’s wifi-connect – easy wifi for Raspberry Pis

When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.

Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:

Provision an SD card with Stretch e.g. using Etcher or manually

Enable ssh e.g. by

touch /Volumes/boot/ssh

Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:

sudo raspi-config

then Localisation Options -> Set wifi country.

Install wifi-connect

bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)

Add their bash script

curl https://raw.githubusercontent.com/balena-io/wifi-connect/master/scripts/start.sh > start-wifi-connect.sh

Add a systemd script to start it on boot.

sudo nano/lib/systemd/system/wifi-connect-start.service

-> contents:

[Unit]
Description=Balena wifi connect service
After=NetworkManager.service

[Service]
Type=simple
ExecStart=/home/pi/start-wifi-connect.sh
Restart=on-failure
StandardOutput=syslog
SyslogIdentifier=wifi-connect
Type=idle
User=root

[Install]
WantedBy=multi-user.target

Enable the systemd service

sudo systemctl enable wifi-connect-start.service

Reboot the pi

sudo reboot

and a wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi

ping raspberrypi.local

.

Posted at 17:21

Libby Miller: Balena’s wifi-connect – easy wifi for Raspberry Pis

When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.

Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:

Provision an SD card with Stretch e.g. using Etcher or manually

Enable ssh e.g. by

touch /Volumes/boot/ssh

Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:

sudo raspi-config

then Localisation Options -> Set wifi country.

Install wifi-connect

bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)

Add a slightly-edited version of their bash script

curl https://gist.githubusercontent.com/libbymiller/e8fe6821e122e0a0ac921c8e557320a9/raw/46138fb4d28b494728e66515e46bd7d736b19132/start.sh > /home/pi/start-wifi-connect.sh

Add a systemd script to start it on boot.

sudo nano /lib/systemd/system/wifi-connect-start.service

-> contents:

[Unit]
Description=Balena wifi connect service
After=NetworkManager.service

[Service]
Type=simple
ExecStart=/home/pi/start-wifi-connect.sh
Restart=on-failure
StandardOutput=syslog
SyslogIdentifier=wifi-connect
Type=idle
User=root

[Install]
WantedBy=multi-user.target

Enable the systemd service

sudo systemctl enable wifi-connect-start.service

Reboot the pi

sudo reboot

A wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi:

ping raspberrypi.local

(You might need to disconnect your ethernet from the Pi before connecting to the Wifi Connect network if you were sharing network that way).

Posted at 17:21

December 27

Egon Willighagen: Creating nanopublications with Groovy

Compound found in Taphrorychus bicolor
(doi:10.1002/JLAC.199619961005).
Published in Liebigs Annalen, see
this post about the history of that journal.
Yesterday I struggled some with creating nanopublications with Groovy. My first attempt was an utter failure, but then I discovered Thomas Kuhn's NanopubCreator and it was downhill from there.

There are two good things about this. First, I now have a code base that I can easily repurpose to make trusty nanopublications (doi:10.1007/978-3-319-07443-6_63) about anything structured as a table (so can you).

Second, I now about almost 1200 CCZero nanopublications that tell you in which species a certain metabolite has been found. Sourced from Wikidata, using their SPARQL end point. This collection is a bit boring that this moment, and most of them are human metabolites, where the source is either Recon 2.2 or WikiPathways. But I expect (hope) to see more DOIs to show up. Think We challenge you to reuse Additional Files.

Finally, you are probably interested in learning what one of the created nanopublications looks like, to I put a Gist online:


Posted at 06:59

December 23

Bob DuCharme: Playing with wdtaxonomy

Those queries from my last blog entry? Never mind!

Posted at 14:51

Copyright of the postings is owned by the original blog authors. Contact us.