Planet RDF

It's triples all the way down

June 25

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Advertisements

Posted at 16:09

June 21

Benjamin Nowack: Dynamic Semantic Publishing for any Blog (Part 2: Linked ReadWriteWeb)

The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.



In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Trice Bot Console

Here is a quick re-cap of the proposed dynamic semantic publishing process , followed by a detailed description of the individual components:
  • Index and monitor the archives pages, build a registry of post URLs.
  • Load and parse posts into raw structures (title, author, content, ...).
  • Extract named entities from each post's main content section.
  • Build a site-optimized schema (an "ontology") from the data structures generated so far.
  • Align the extracted data structures with the target ontology.
  • Re-purpose the final dataset (widgets, entity hubs, semantic ads, authoring tools)

Archives indexer and monitor

The archives indexer fetches the by-month archives , extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store .

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

Archives triples via SPARQL

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Raw post structures

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

RWW through Paggr Prospect

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

RWW entity types

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

RWW ontology

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:
  • Consolidate author aliases ("richard-macmanus-1 = richard-macmanus-2" etc.).
  • Normalize author tags, Zemanta tags, OpenCalais tags, and OpenCalais "industry terms" to a single "tag" field.
  • Consolidate the various type identifiers into canonical ones.
  • For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.
  • Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).
Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Dynamic RWW Entity Hub

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch .

Posted at 11:06

June 20

Benjamin Nowack: I'm joining Talis!

KASABI data marketplace I received a number of very interesting job offers when I began searching for something new last month, but there was one company that stood out, and that is Talis. Not only do I know many people there already, I also find Talis' new strategic focus and products very promising. In addition, they know and use some of my tools already, and I've successfully worked on Talis projects with Leigh and Keith before. The job interview almost felt like coming home (and the new office is just great).

So I'm very happy to say that I'm going to become part of the Kasabi data marketplace team in September where I'll help create and drupalise data management and data market tools.

BeeNode I will have to get up to speed with a lot of new things, and the legal and travel costs overhead for Talis is significant, so I hope I can turn this into a smart investment for them as quickly as possible. I'll even rename my blog if necessary... ;-) For those wondering about the future of my other projects, I'll write about them in a separate post soon.

Can't wait to start!

Posted at 21:06

Benjamin Nowack: Want to hire me?

I have been happily working as a self-employed semantic web developer for the last seven years. With steady progress, I dare to say, but the market is still evolving a little bit too slowly for me (well, at least here in Germany) and I can't invest any longer. So I am looking for new challenges and an employer who would like to utilize my web technology experience (semantic or not). I have created a new personal online profile with detailed information about me, my skills, and my work.

My dream job would be in the social and/or data web area, I'm particularly interested in front-end development for data-centric or stream-oriented environments. I also love implementing technical specifications (probably some gene defect).

The potential show-stopper: I can't really relocate, for private reasons. I am happy to (tele)commute or travel, though. And I am looking for a full-time employment (or a full-time, longer-term contract). I am already applying for jobs, mainly here in D�sseldorf so far, but I thought I'd send out this post as well. You never know :)

Posted at 21:06

June 11

Benjamin Nowack: Trice' Semantic Richtext Editor

In my previous post I mentioned that I'm building a Linked Data CMS. One of its components is a rich-text editor that allows the creation (and embedding) of structured markup.

An earlier version supported limited Microdata annotations, but now I've switched the mechanism and use an intermediate, but even simpler approach based on HTML5's handy data-* attributes. This lets you build almost arbitrary markup with the editor, including Microformats, Microdata, or RDFa. I don't know yet when the CMS will be publicly available (3 sites are under development right now), but as mentioned, I'd be happy about another pilot project or two. Below is a video demonstrating the editor and its easy customization options.

Posted at 22:07

June 10

Benjamin Nowack: Semantic WYSIWYG in-place editing with Swipe

Several months ago (ugh, time flies) I posted a screencast demo'ing a semantic HTML editor. Back then I used a combination of client-side and server-side components, which I have to admit led to quite a number of unnecessary server round-trips.

In the meantime, others have shown that powerful client-side editors can be implemented on top of HTML5, and so I've now rewritten the whole thing and turned it into a pure JavaScript tool as well. It now supports inline WYSIWYG editing and HTML5 Microdata annotations.

The code is still at beta stage, but today I put up an early demo website which I'll use as a sandbox. The editor is called Swipe (like the dance move, but it's an acronym, too). What makes Swipe special is its ability to detect the caret coordinates even when the cursor is inside a text node, which is usually not possible with W3C range objects. This little difference enables several new possibilities, like precise in-place annotations or "linked-data-as-you-type" functionality for user-friendly entity suggestions. More to come soon...

Swipe - Semantic WYSIWYG in-place editor

Posted at 07:06

May 31

Leigh Dodds: The words we use for data

I’ve been on leave this week so, amongst the gardening and relaxing I’ve had a bit of head space to think.  One of the things I’ve been thinking about is the words we choose to use when talking about data. It was Dan‘s recent blog post that originally triggered it. But I was reminded of it this week after seeing more people talking past each other and reading about how the Guardian has changed the language it uses when talking about the environment: Climate crisis not climate change.

As Dan pointed out we often need a broader vocabulary when talking about data.  Talking about “data” in general can be helpful when we want to focus on commonalities. But for experts we need more distinctions. And for non-experts we arguably need something more tangible. “Data”, “algorithm” and “glitch” are default words we use but there are often better ones.

It can be difficult to choose good words for data because everything can be treated as data these days. Whether it’s numbers, text, images or video everything can be computed on, reported and analysed. Which makes the idea of data even more nebulous for many people.

In Metaphors We Live By, George Lakoff and Mark Johnson discuss how the range of metaphors we use in language, whether consciously or unconsciously, impacts how we think about the world. They highlight that careful choice of metaphors can help to highlight or obscure important aspects of the things we are discussing.

The example that stuck with me was that when we are describing debates. We often do so in terms of things to be won, or battles to be fought (“the war of words”). What if we thought of debates as dances instead? Would that help us focus on compromise and collaboration?

This is why I think that data as infrastructure is such a strong metaphor. It helps to highlight some of the most important characteristics of data: that it is collected and used by communities, needs to be supported by guidance, policies and technologies and, most importantly, needs to be invested in and maintained to support a broad variety of uses. We’ve all used roads and engaged with the systems that let us make use of them. Focusing on data as information, as zeros and ones, brings nothing to the wider debate.

If our choice of metaphors and words can help to highlight or hide important aspects of a discussion, then what words can we use to help focus some of our discussions around data?

It turns out there’s quite a few.

For example there are “samples” and “sampling“.  These are words used in statistics but their broader usage has the same meaning. When we talk about sampling something, whether its food or drink, music or perfume it’s clear that we’re not taking the whole thing. Talking about sampling might help us be to clearer that often when we’re collecting data we don’t have the whole picture. We just have a tester, a taste. Hopefully one which is representative of the whole. We can make choices about when, where and how often we take samples.  We might only be allowed to take a few.

Polls” and “polling” are similar words. We sample people’s opinions in a poll. While we often use these words in more specific ways, they helpfully come with some understanding that this type of data collection and analysis is imperfect. We’re all very familiar at this point with the limitations of polls.

Or how about “observations” and “observing“?  Unlike “sensing” which is a passive word, “observing” is more active and purposeful. It implies that someone or something is watching. When we want to highlight that data is being collected about people or the environment “taking observations” might help us think about who is doing the observing, and why. Instead of “citizen sensing” which is a passive way of describing participatory data collection, “citizen observers” might place a bit more focus on the work and effort that is being contributed.

Catalogues” and “cataloguing” are words that, for me at least, imply maintenance and value-added effort. I think of librarians cataloguing books and artefacts. “Stewards” and “curators” are other important roles.

AI and Machine Learning are often being used to make predictions. For example, of products we might want to buy, or whether we’re going to commit a crime. Or how likely it is that we might have a car accident based on where we live. These predictions are imperfect. But we talk about algorithms as “knowing”, “spotting”, “telling” or “helping”. But they don’t really do any of those things.

What they are doing is making a “forecast“. We’re all familiar with weather forecasts and their limits. So why not use the same words for the same activity? It might help to highlight the uncertainty around the uses of the data and technology, and reinforce the need to use these forecasts as context.

In other contexts we talk about using data to build models of the world. Or to build “digital twins“. Perhaps we should just talk more about “simulations“? There are enough people playing games these days that I suspect there’s a broader understanding of what a simulation is: a cartoon sketch of some aspect of the real world that might be helpful but which has its limits.

Other words we might use are “ratings” and “reviews” to help to describe data and systems that create rankings and automated assessments. Many of us have encountered ratings and reviews and understand that they are often highly subjective and need interpretation?

Or how about simply “measuring” as a tangible example of collecting data? We’ve all used a ruler or measuring tape and know that sometimes we need to be careful about taking measurements: “Measure twice, cut once”.

I’m sure there are lots of others. I’m also well aware that not all of these terms will be familiar to everyone. And not everyone will associate them with things in the same way as I do. The real proof will be testing words with different audiences to see how they respond.

I think I’m going to try to deliberately use a broad range of language in my talks and writing and see how it fairs.

What terms do you find most useful when talking about data?

Posted at 18:05

May 30

Leigh Dodds: How can we describe different types of dataset? Ten dataset archetypes

As a community, when we are discussing recommendations and best practices for how data should be published and governed, there is a natural tendency for people to focus on the types of data they are most familiar with working with.

This leads to suggestions that every dataset should have an API, for example. Or that every dataset should be available in bulk. While good general guidance, those approaches aren’t practical in every case. That’s because we also need to take into account a variety of other issues, including:

  • the characteristics of the dataset
  • the capabilities of the publishing organisation and the funding their have available
  • the purpose behind publishing the data
  • and the ethical, legal and social contexts in which it will be used

I’m not going to cover all of that in this blog post.

But it occurred to me that it might be useful to describe a set of dataset archetypes, that would function a bit like user personas. They might help us better answer some of the basic questions people have around data, discuss recommendations around best practices, inform workshop exercises or just test our assumptions.

To test this idea I’ve briefly described ten archetypes. For each one I’ve tried to describe some it’s features, identified some specific examples, and briefly outlined some of the challenges that might apply in providing sustainable access to it.

Like any characterisation detail is lost. This is not an exhaustive list. I haven’t attempted to list every possible variation based on size, format, timeliness, category, etc. But I’ve tried to capture a range that hopefully illustrate some different characteristics. The archetypes reflect my own experiences, you will have different thoughts and ideas. I’d love to read them.

The Study

The Study is a dataset that was collected to support a research project. The research group collected a variety of new data as part of conducting their study. The dataset is small, focused on a specific use case and there are no plans to maintain or update it further as the research group does not have any ongoing funded to collect or maintain the dataset. The data is provided as is for others to reuse, e.g. to confirm the original analysis of the data or to use it on other studies. To help others, and as part of writing some academic papers that reference the dataset, the research group has documented their methodology for collecting the data. The dataset is likely published in an academic data portal or alongside the academic papers that reference it.

Examples: water quality samples, field sightings of animals, laboratory experiment results, bibliographic data from a literature review, photos showing evidence of plant diseases, consumer research survey results

The Sensor Feed

The Sensor Feed is a stream of sensor readings that are produced by a collection of sensors that have been installed across a city. New readings are added to the stream at regular intervals. The feed is provided to allow a variety of applications to tap into the raw sensor readings.. The data points are as directly reported by the individual sensors and are not quality controlled. The individual sensors may been updated, re-calibrated or replaced over time. The readings are part of the operational infrastructure of the city so can be expected to be available over at least the medium term. This mean the dataset is effectively unbounded: new observations will continue to be reported until the infrastructure is decommissioned.

Examples: air quality readings, car park occupancy, footfall measurements, rain gauges, traffic light queuing counts, real-time bus locations

The Statistical Index

The Statistical Index is intended to provide insights into the performance of specific social or economic policies by measuring some aspect of a local community or economy. For example a sales or well-being index. The index draws on a variety of primary datasets, e.g. on commercial activities, which are then processed according to a documented methodology to generate the index. The Index is stewarded by an organisation and is expected to be available over the long term. The dataset is relatively small and is reported against specific geographic areas (e.g. from The Register) to support comparisons. The Index is updated on a regular basis, e.g. monthly or annually. Use of the data typically involves comparing across time and location at different levels of aggregation.

Examples: street safety survey, consumer price indices, happiness index, various national statistical indexes

The Register

The Register is a set of reference data that is useful for adding context to other datasets. It consists of a list of specific things, e.g. locations, cars, services with an unique identifier and some basic descriptive metadata for each of the entries on the list. The Register is relatively small, but may grow over time. It is stewarded by an organisation tasked with making the data available for others. The steward, or custodian, provides some guarantees around the quality of the data.  It is commonly used as a means to link, validate and enrich other datasets and is rarely used in isolation other than in reporting on changes to the size and composition of the register.

Examples: licensed pubs, registered doctors, lists of MOT stations, registered companies, a taxonomy of business types, a statistical geography, addresses

The Database

The Database is a copy or extract of the data that underpins a specific application or service. The database contains information about a variety of different types of things, e.g. musicians, the albums and songs. It is a relatively large dataset that can be used to perform a variety of different types of query and to support a variety of uses. As it is used in a live service it is regularly updated, undergoes a variety of quality checks, and is growing over time in both volume and scope. Some aspects of The Database may reference one or more Registers or could be considered as Registers in themselves.

Examples: geographic datasets that include a variety of different types of features (e.g. OpenStreetMap, MasterMap), databases of music (e.g. MusicBrainz) and books (e.g. OpenLibrary), company product and customer databases, Wikidata

The Description

The Description is a collection of a few data points relating to a single entity. Embedded into a single web page, it provides some basic information about an event, or place, or company. Individually it may be useful in context, e.g. to support a social interaction or application share. The owner of the website provides some data about the things that are discussed or featured on the website, but does not have access to a full dataset. The individual item descriptions are provided by website contributors using a CRM to add content to the website. If available in aggregate, the individual descriptions might make a useful Database or Register.

Examples: descriptions of jobs, events, stores, video content, articles

The Personal Records

The Personal Records are a history of the interactions of a single person with a product or service. The data provides insight into the individual person’s activities.  The data is a slice of a larger Dataset that contains data for a larger number of people. As the information contains personal information it has to be secure and the individual has various rights over the collection and use of the data as granted by GDPR (or similar local regulation). The dataset is relatively small, is focused on a specific set of interactions, but is growing over time. Analysing the data might provide useful insight to the individual that may help them change their behaviour, increase their health, etc.

Examples: bank transactions, home energy usage, fitness or sleep tracker, order history with an online service, location tracker, health records

The Social Graph

The Social Graph is a dataset that describes the relationships between a group of individuals. It is typically built-up by a small number of contributions made by individuals that provide information about their relationships and connections to others. They may also provide information about those other people, e.g. names, contact numbers, service ratings, etc. When published or exported it is typically focused on a single individual, but might be available in aggregate. It is different to Personal Records as its specifically about multiple people, rather than a history of information about an individual (although Personal Records may reference or include data about others).  The graph as a whole is maintained by an organisation that is operating a social network (or service that has social features).

Examples: social networks data, collaboration graphs, reviews and trip histories from ride sharing services, etc

The Observatory

The Observatory is a very large dataset produce by a coordinated large-scale data collection exercise, for example by a range of earth observation satellites. The data collection is intentionally designed to support a variety of down-stream uses, which informs the scale and type of data collected. The scale and type of data can makes it difficult to use because of the need for specific tools or expertise. But there are a wide range of ways in which the raw data can be processed to create other types of data products, to drive a variety of analyses, or used to power a variety of services.  It is refreshed and re-released as required by the needs and financial constraints of the organisations collaborating on collecting and using the dataset.

Examples: earth observation data, LIDAR point clouds, data from astronomical surveys or Large Hadron Collider experiments

The Forecast

The Forecast is used to predict the outcome of specific real-world events, e.g. a weather or climate forecast. It draws on a variety of primary datasets which are then processed and anlysed to produce the output dataset. The process by which the predictions are made are well-documented to provide insight into the quality of the output. As the predictions are time-based the dataset has a relatively short “shelf-life” which means that users need to quickly access the most recent data for a specific location or area of interest. Depending on the scale and granularity, Forecast datasets can be very large, making them difficult to distribute in a timely manner.

Example: weather forecasts

Let me know what you think of these. Do they provide any useful perspective? How would you use or improve them?

Posted at 13:05

May 18

Benjamin Nowack: Linked Data Entity Extraction with Zemanta and OpenCalais

I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a
    ['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a
    ['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision ), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Posted at 22:06

May 16

Benjamin Nowack: Contextual configuration - Semantic Web development for visually minded webmasters

Let's face it, building semantic web sites and apps is still far from easy. And to some extent, this is due to the configuration overhead. The RDF stack is built around declarative languages (for simplified integration at various levels), and as a consequence, configuration directives often end up in some form of declarative format, too. While fleshing out an RDF-powered website, you have to declare a ton of things. From namespace abbreviations to data sources and API endpoints, from vocabularies to identifier mappings, from queries to object templates, and what have you.

Sadly, many of these configurations are needed to style the user interface, and because of RDF's open world context, designers have to know much more about the data model and possible variations than usually necessary. Or webmasters have to deal with design work. Not ideal either. If we want to bring RDF to mainstream web developers, we have to simplify the creation of user-optimized apps. The value proposition of semantics in the context of information overload is pretty clear, and some form of data integration is becoming mandatory for any modern website. But the entry barrier caused by large and complicated configuration files (Fresnel anyone?) is still too high. How can we get from our powerful, largely generic systems to end-user-optimized apps? Or the other way round: How can we support frontend-oriented web development with our flexible tools and freely mashable data sets? (Let me quickly mention Drupal here, which is doing a great job at near-seamlessly integrating RDF. OK, back to the post.)

Enter RDF widgets. Widgets have obvious backend-related benefits like accessing, combining and re-purposing information from remote sources within a manageable code sandbox. But they can also greatly support frontend developers. They simplify page layouting and incremental site building with instant visual feedback (add a widget, test, add another one, re-arrange, etc.). And, more importantly in the RDF case, they can offer a way to iteratively configure a system with very little technical overhead. Configuration options could not only be scoped to the widget at hand, but also to the context where the widget is currently viewed. Let's say you are building an RDF browser and need resource templates for all kinds of items. With contextual configuration, you could simply browse the site and at any position in the ontology or navigation hierarchy, you would just open a configuration dialog and define a custom template, if needed. Such an approach could enable systems that worked out of the box (raw, but usable) and which could then be continually optimized, possibly even by site users.

A lot of "could" and "would" in the paragraphs above, and the idea may sound quite abstract without actually seeing it. To illustrate the point I'm trying to make I've prepared a short video (embedded below). It uses Semantic CrunchBase and Paggr Prospect (our new faceted browser builder) as an example use case for in-context configuration.

And if you are interested in using one of our solutions for your own projects, please get in touch !



Paggr Prospect (part 1)


Paggr Prospect (part 2)

Posted at 23:06

Benjamin Nowack: 2011 Resolutions and Decisions

All right, this post could easily have become another rant about the ever-growing complexity of RDF specifications, but I'll turn it into a big shout-out to the Semantic Web community instead. After announcing the end of investing further time into ARC's open-source branch, I received so many nice tweets and mails that I was reminded of why I started the project in the first place: The positive vibe in the community, and the shared vision. Thank you very much everybody for the friendly reactions, I'm definitely very moved.

Some explanations: I still share the vision of machine-readable, integration-ready web content, but I have to face the fact that the current approach is getting too expensive for web agencies like mine. Luckily, I could spot a few areas where customer demands meet the cost-efficient implementation of certain spec subsets. (Those don't include comprehensive RDF infrastructure and free services here, though. At least not yet, and I just won't make further bets). The good news: I will continue working with semantic web technologies, and I'm personally very happy to switch focus from rather frustrating spec chasing to customer-oriented solutions and products with defined purposes . The downside: I have to discontinue a couple of projects and services in order to concentrate my energy and reduce (opportunity) costs. These are:
  • The ARC website, mailing list, and other forms of free support. The code and documentation get a new home on GitHub , though. The user community is already thinking about setting up a mailing list on their own. Development of ARC is going to continue internally, based on client projects (it's not dying).
  • Trice as an open-source project (lesson learned from ARC)
  • Semantic CrunchBase. I had a number of users but no paying ones. It was also one those projects that happily burn your marketing budget while at the same time having only negative effects on the company's image because the funds are too small to provide a reliable service (similar to the flaky DBPedia SPARQL service which makes the underlying RDF store look like a crappy product although it is absolutely not).
  • Knowee, Smesher and similar half-implemented and unfunded ideas.
Looking forward to a more simplified and streamlined 2011. Lots of success to all of you, and thanks again for the nice mails!

Posted at 07:07

May 09

Leigh Dodds: That thing we call “open”

I’ve been involved in a few conversations recently about what “open” or “being open” means in different situations.

As I’ve noted previously when people say “open” they often mean very different things. And while there may be a clear definitions of “open”, people don’t
often use the terms correctly. And some phrases like “open API” are still, well, open to interpretation.

In this post I’m going to summarise some of the ways in which I tend to think about making something “open”.

Let me know if I’m missing something so I can plug gaps in my understanding.

Openness of a “thing”

Digital objects: books, documents, images, music, software and datasets can all be open.

Making things open in this sense is the most well documented, but still the most consistently misunderstood. There are clear definitions for open content
and data, open source, etc. Open in these contexts provide various freedoms to use, remix, share, etc.

People often confuse something being visible or available to them as being open, but that’s not the same thing at all. Being able to see or read something doesn’t give you any legal permissions at all.

It’s worth noting that the definitions of open “things” in different communities are often overlapping. For example, the Creative Commons licences allow works to be licensed in ways that enable a wide variety of legal reuses. But the Open Definition only recognises a subset of those as being open, rather than shared.

Putting an open licence on something also doesn’t necessarily grant you the full freedom to reuse that thing. For example I could open source some machine learning software but it might only be practically reusable if you can train it on some data that I’ve chosen not to share.

Or I might use an open licence like the Open Government Licence that allows me to put an open licence on something whilst ignoring the existence of any third-party rights. No need to do my homework. Reuser beware.

Openness of a process

Processes can be open. It might be better to think about transparency (e.g. of how the process is running) or the ability to participate in a process in this context.

Anything that changes and evolves over time will have a process by which those changes are identified, agreed, prioritised and applied. We sometimes call that governance. The definition of an open standard includes defining both the openness of the standard (the thing) as well as the process.

Stewardship, of a software project, or a dataset, or a standard are also examples of where it might be useful for a process to be open. Questions we can ask of open processes are things like:

  • Can I contribute to the main codebase of a software package, rather than just fork it?
  • Can I get involved in the decision making around how a piece of software or standard evolves?
  • Can I directly fix errors in a dataset?
  • Can I see what decisions have been, or are being made that relate to how something is evolving?

When we’re talking about open data or open source, often we’re really talking about openness of the “thing”. But when we’re making things open to make them
better, I think we’re often talking about being open to contributions and participation. Which needs something more than a licence on a thing.

There’s probably a broader category of openness here which relates to how open a process is socially. Words like inclusivity and diversity spring to mind.

Your standards process isn’t really open to all if all of your meetings are held face to face in Hawaii.

Openness of a product, system or platform

Products, platforms and systems can be open too. Here we can think of openness as relating to the degree to which the system

  • is built around open standards and open data (made from open things)
  •  is operated using open processes
  • is available for wider access and use

We can explore this by asking questions like:

  • Is it designed to run on open infrastructure or is it tied to particular cloud infrastructure or hardware?
  • Are the interfaces to the system built around open standards?
  • Can I get access to an API? Or is it invite only?
  • How do the terms of service shape the acceptable uses of the system?
  • Can I use its outputs, e.g. the data returned by a platform or an API, under an open licence?
  • Can we observe how well the system or platform is performing, or measure its impacts in different ways (e.g. socially, economically, environmentally)

Openness of an ecosystem

Ecosystems can be open too. In one sense an open ecosystem is “all of the above”. But there are properties of an ecosystem that might itself indicate aspects of openness:

  • Is there a choice in providers, or is there a monopoly provider of services or data?
  • How easy is it for new organisations to engage with the ecosystem, e.g to provide
    competing or new services?
  • Can we measure the impacts and operations of the ecosystem?

When we’re talking about openness of an ecosystem we’re usually talking about markets and sectors and regulation and governance.

Applying this in practice

So when  thinking about whether something is “open” the first thing I tend to do is consider which of the above categories apply. In some cases its actually several.

This is evident in my attempt to define “open API“.

For example we’re doing some work @ODIHQ to explore the concept of a digital twin. According to the Gemini Principles a digital twin should be open. Here we can think of an individual digital twin as an object (a piece of software or a model), or a process (e.g. as an open source project), or an operational system or platform depending on how its made available.

We’re also looking at cities. Cities can be open in the sense of the openness of their processes of governance and decision making. They might also be considered as platforms for sharing data and connecting sofrware. Or as ecosystems of the same.

Posted at 20:05

Sebastian Trueg: Protecting And Sharing Linked Data With Virtuoso

Disclaimer: Many of the features presented here are rather new and can not be found in  the open-source version of Virtuoso.

Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.

Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.

We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:

Virtuoso Sparql Endpoint

Sparql Result

This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:

$ isql-v localhost:1112 dba dba
Connected to OpenLink Virtuoso
Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('http://www.openlinksw.com/schemas/virtrdf#PrivateGraphs', 'urn:trueg:demo');

Done. -- 2 msec.

Now our new named graph urn:trueg:demo is private and its contents cannot be seen by anyone. We can easily test this by logging out and trying to query the graph:

Sparql Query
Sparql Query Result

But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Share Demo Graph with trueg's LinkedIn account" ;
  acl:agent <http://www.linkedin.com/in/trueg> ;
  acl:accessTo <urn:trueg:demo> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:PrivateGraphs .

Virtuoso makes use of the ACL ontology proposed by the W3C and extends on it with several custom classes and properties in the OpenLink ACL Ontology. Most of this little Turtle snippet should be obvious: we create an Authorization resource which grants Read access to urn:trueg:demo for agent http://www.linkedin.com/in/trueg. The only tricky part is the scope. Virtuoso has the concept of ACL scopes which group rules by their resource type. In this case the scope is private graphs, another typical scope would be DAV resources.

Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:

$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules

As a result we get the full rule resource including additional properties added by the API.

Finally we will login using my LinkedIn identity and are granted read access to the graph:

SPARQL Endpoint Login
sparql6
sparql7
sparql8

We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.

Technical Footnote

Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:

sparql
prefix oplacl: <http://www.openlinksw.com/ontology/acl#>
with <urn:virtuoso:val:config>
delete {
  oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs .
}
insert {
  oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs .
};

This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:

@prefix acl: <http://www.w3.org/ns/auth/acl#> .
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> .
<#rule> a acl:Authorization ;
  rdfs:label "Allow Anyone to SPARQL Read" ;
  acl:agentClass foaf:Agent ;
  acl:accessTo <urn:virtuoso:access:sparql> ;
  oplacl:hasAccessMode oplacl:Read ;
  oplacl:hasScope oplacl:Query .

I will explain these technical concepts in more detail in another article.

Posted at 10:06

Sebastian Trueg: Sharing Files With Whomever Is Simple

Dropbox, Google Drive, OneDrive, Box.com – they all allow you to share files with others. But they all do it via the strange concept of public links. Anyone who has this link has access to the file. On first glance this might be easy enough but what if you want to revoke read access for just one of those people? What if you want to share a set of files with a whole group?

I will not answer these questions per se. I will show an alternative based on OpenLink Virtuoso.

Virtuoso has its own WebDAV file storage system built in. Thus, any instance of Virtuoso can store files and serve these files via the WebDAV API (and an LDP API for those interested) and an HTML UI. See below for a basic example:

Virtuoso DAV Browser

This is just your typical file browser listing – nothing fancy. The fancy part lives under the hood in what we call VAL – the Virtuoso Authentication and Authorization Layer.

We can edit the permissions of one file or folder and share it with anyone we like. And this is where it gets interesting: instead of sharing with an email address or a user account on the Virtuoso instance we can share with people using their identifiers from any of the supported services. This includes Facebook, Twitter, LinkedIn, WordPress, Yahoo, Mozilla Persona, and the list goes on.

For this small demo I will share a file with my LinkedIn identity http://www.linkedin.com/in/trueg. (Virtuoso/VAL identifier people via URIs, thus, it has schemes for all supported services. For a complete list see the Service ID Examples in the ODS API documentation.)

Virtuoso Share File

Now when I logout and try to access the file in question I am presented with the authentication dialog from VAL:

VAL Authentication Dialog

This dialog allows me to authenticate using any of the supported authentication methods. In this case I will choose to authenticate via LinkedIn which will result in an OAuth handshake followed by the granted read access to the file:

LinkedIn OAuth Handshake

 

Access to file granted

It is that simple. Of course these identifiers can also be used in groups, allowing to share files and folders with a set of people instead of just one individual.

Next up: Sharing Named Graphs via VAL.

Posted at 10:06

Sebastian Trueg: Digitally Sign Emails With Your X.509 Certificate in Evolution

Digitally signing Emails is always a good idea. People can verify that you actually sent the mail and they can encrypt emails in return. A while ago Kingsley showed how to sign emails in Thunderbird.I will now follow up with a short post on how to do the same in Evolution.

The process begins with actually getting an X.509 certificate including an embedded WebID. There are a few services out there that can help with this, most notably OpenLink’s own YouID and ODS. The former allows you to create a new certificate based on existing social service accounts. The latter requires you to create an ODS account and then create a new certificate via Profile edit -> Security -> Certificate Generator. In any case make sure to use the same email address for the certificate that you will be using for email sending.

The certificate will actually be created by the web browser, making sure that the private key is safe.

If you are a Google Chrome user you can skip the next step since Evolution shares its key storage with Chrome (and several other applications). If you are a user of Firefox you need to perform one extra step: go to the Firefox preferences, into the advanced section, click the “Certificates” button, choose the previously created certificate, and export it to a .p12 file.

Back in Evolution’s settings you can now import this file:

To actually sign emails with your shiny new certificate stay in the Evolution settings, choose to edit the Mail Account in question, select the certificate in the Secure MIME (S/MIME) section and check “Digitally sign outgoing messages (by default)“:

The nice thing about Evolution here is that in contrast to Thunderbird there is no need to manually import the root certificate which was used to sign your certificate (in our case the one from OpenLink). Evolution will simply ask you to trust that certificate the first time you try to send a signed email:

That’s it. Email signing in Evolution is easy.

Posted at 10:06

Davide Palmisano: SameAs4J: little drops of water make the mighty ocean

Few days ago Milan Stankovich contacted the Sindice crew informing us that he wrote a simply Java library to interact with the public Sindice HTTP APIs. We always appreciate such kind of community efforts lead to collaboratively make Sindice a better place on the Web. Agreeing with Milan, we decided to put some efforts on his initial work to make such library the official open source tool for Java programmers.
That reminded me that, few months ago, I did for sameas.org the same thing Milan did for us. But (ashamed) I never informed those guys about what I did.
Sameas.org is a great and extremely useful tool on the Web that makes concretely possible to interlink different Linked data clouds. Simple to use (both for humans via HTML and for machines with a simple HTTP/JSON API) and extremely reactive, it allows to get all the owl:sameAs object for a given URI. And, moreover, it’s based on Sindice.com.
Do you want to know the identifier of http://dbpedia.org/resource/Rome in Freebase or Yago? Just ask it to Sameas.org.

So, after some months I just refined a couple of things, added some javadocs, set up a Maven repository and made SameAs4j publicly available (MIT licensed) to everyone on Google Code.
It’s a simple but reliable tiny set of Java classes that allows you to interact with sameas.org programatically in your Java Semantic Web applications.

Back to the beginning: every pieces of open source software is like a little drop of water which makes the mighty ocean, so please submit any issue or patch if interested.

Posted at 10:06

Davide Palmisano: RWW 2009 Top 10 Semantic Web products: one year later…


Just few days ago the popular ReadWriteWeb published a list of the 2009 Top Ten Semantic Web products as they did one year ago with the 2008 Top Ten.

This two milestones are a good opportunity to make something similar to a balance. Or just to do a quick overview on what’s changed in the “Web of Data”, only one year later.

The 2008 Top Ten foreseen the following applications, listed in the same ReadWriteWeb order and enriched with some personal opinions.

Yahoo Search Monkey

It’s great. Search Monkey represents the first kind of next-generation search engines due its capability to be fully customized by third party developers. Recently, a breaking news woke up the “sem webbers” of the whole planet: Yahoo started to show structured data exposed with RDFa in the search results page. That news bounced all over the Web and those interested in SEO started to appreciate Semantic Web technologies for their business. But, unfortunately, at the moment I’m writing, RDFa is not showed anymore on search results due to an layout update that broke this functionality. Even if there are rumors on a imminent fixing of this, the main problem is the robustness and the reliability of that kind of services: investors need to be properly guaranteed on the effectiveness of their investments.

Powerset

Probably, this neat application has became really popular when it has been acquired by Microsoft. It allows to make simple natural language queries like “film where Kevin Spacey acted” and, a first glance, the results seems really much better than other traditional search engines. Honestly I don’t really know what are the technologies they are using to do this magic. But, it would be nice to compare their results with an hypothetical service that translates such human text queries in a set of SPARQL queries over DBpedia. Anyone interested in do that? I’ll be more than happy to be engaged in a project like that.

Open Calais

With a large and massive branding operation these guys built the image of this service as it be the only one fitting everyone’s need when dealing with semantic enrichment of unstructured free-texts. Even this is partly true (why don’t mentioning the Apache UIMA Open Calais annotator?), there are a lot of other interesting services that are, for certain aspects, more intriguing than the Reuters one. Don’t believe me? Let’s give a try to AlchemyAPI.

Dapper

I have to admit my ignorance here. I never heard about it, but it looks very very interesting. Certainly this service that offers, mainly, some sort of semantic advertisement is more than promising. I’ll keep an eye on it.

Hakia

Down at the moment I’m writing. 😦

Tripit

Many friends of mine are using it and this could be enough to give it popularity. Again, I don’t know if they are using some of the W3C Semantic Web technologies to models their data. RDF or not, this is a neat example of semantic web application with a good potential: is this enough to you?

BooRah

Another case of personal ignorance. This magic is, mainly, a restaurant review site. BooRah uses semantic analysis and natural language processing to aggregate reviews from food blogs. Because of this, BooRah can recognize praise and criticism in these reviews and then rates restaurants accordingly to them. One criticism? The underlying data are perhaps not so much rich. Sounds impossible to me that searching for “Pizza in Italy” returns nothing.

Blue Organizer (or GetGlue?)

It’s not a secret that I consider Glue one of the most innovative and intriguing stuff on the Web. And when it appeared on the ReadWriteWeb 10 Top Semantic Web applications was far away from what is now. Just one year later, GetGlue (Blue Organizer seems to be the former name) appears as a growing and live community of people that realized how is important to wave the Web with the aim of a tool that act as a content cross-recommender. Moreover GetGlue provides a neat set of Web APIs that I’m widely using within the NoTube project.

Zemanta

A clear idea, a powerful branding and a well designed set of services accessible with Web APIs make Zemanta one of the most successful product on the stage. Do I have to say anything more? If you like Zemanta I suggest you to keep an eye also on Loomp, a nice stuff presented at the European Semantic Technology Conference 2009.

UpTake.com

Mainly, a semantic search engine over a huge database containing more than 400,000 hotels in the US. Where’s the semantic there? Uptake.com crawls and semantically extracts the information implicitly hidden in those records. A good example of how innovative technologies could be applied to well-know application domains as the hotels searching one.

On year later…

Indubitably, 2009 has been ruled by the Linked Data Initiative, as I love to call it. Officially Linked Data is about “using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods” and, if we look to its growing rate, could be simple to bet on it success.

Here is the the 2009 top-ten where I omitted GetGlue, Zemanta and OpenCalais since they already appeared also in the 2008 edition:

Google Search Options and Rich Snippets

When this new feature of Google has been announced the whole Semantic Web community realized that something very powerful started to move along. Google Rich Snippet makes use of the RDFa contained in the HTML Web pages to power rich snippets feature.

Feedly

It’s a very very nice feeds aggregator built upon Google Reader, Twitter and FriendFeed. It’s easy to use, nice and really useful (well, at least it seems so to me) but, unfortunately, I cannot see where is the Semantic aspects here.

Apture

This JavaScript cool stuff allows publishers to add contextual information to links via pop-ups which display when users hover over or click on them. Watching HTML pages built with the aid of this tool, Apture closely remembers me the WordPress Snap-Shot plugin. But Apture seems richer than Snap-Shot since it allows the publishers to directly add links and other stuff they want to display when the pages are rendered.

BBC Semantic Music Project

Built upon Musicbrainz.org (one of the most representative Linked Data cloud) it’s a very remarkable initiative. Personally, I’m using it within the NoTube project to disambiguate Last.fm bands. Concretely, given a certain Last.fm band identifier, I make a query to the BBC /music that returns me a URI. With this URI I ask the sameas.org service to give me other URIs referring to the same band. In this way I can associate to every Last.fm bands a set of Linked Data URIs where obtain a full flavor of coherent data about them.

Freebase

It’s an open, semantically marked up shared database powered by Metaweb.com a great company based in San Francisco. Its popularity is growing fast, as ReadWriteWeb already noticed. Somehow similar to Wikipedia, Freebase provides all the mechanisms necessary to syndicate its data in a machine-readable form. Mainly, with RDF. Moreover, other Linked Data clouds started to add owl:sameAs links to Freebase: do I have to add something else?

Dbpedia

DBpedia is the nucleus of the Web of Data. The only thing I’d like to add is: it deserves to be on the ReadWriteWeb 2009 top-ten more than the others.

Data.gov

It’s a remarkable US government initiative to “increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”. It’s a start and I dream to see something like this even here in Italy.

So what’s up in the end?

It’s my opinion that the 2009 has been the year of Linked Data. New clouds born every month, new links between the already existent ones are established and a new breed of developers are being aware of the potential and the threats of Linked Data consuming applications. It seems that the Web of Data is finally taking shape even if something strange is still in the air. First of all, if we give a closer look to the ReadWriteWeb 2009 Top Ten I have to underline that 3 products on 10 already were also in the 2008 chart. Maybe the popular blog liked to stress on the progresses that these products made but it sound a bit strange to me that they forgot nice products such as the FreeMix, Alchemy API, Sindice, OpenLink Virtuoso and the BestBuy.com usage of GoodRelations ontology. Secondly, 3 products listed in the 2009 chart are public-funded initiatives that, even if is reasonable due to the nature of the products, it leave me with the impression that private investors are not in the loop yet.

What I expect from the 2010, then?

A large and massive rush to using RDFa for SEO porpoises, a sustained grow of Linked Data clouds and, I really hope, the rise of a new application paradigm grounded to the consumption of such interlinked data.

Posted at 10:06

Davide Palmisano: the italian political activism and the semantic web

Beppe Grillo

A couple of years ago, during his live show, the popular italian blogger and activist Beppe Grillo provided a quick demonstration about how the Web concretely realizes the “six degrees of separation”. The italian blogger, today a Web enthusiast, shown that it was possible to him to get in contact with someone very famous using a couple of different websites: imdb, Wikipedia and few others. Starting from a movie where he acted, he could reach the movie producer and the producer could be in contact with another actor due to previous work with this latter and so on. 
This demonstration consisted in a series of links that were opened leading to some Web pages containing information where extract the relationships that the showman wants to achieve.
This gig came back to my mind while I was thinking on how, what I call the “Linked Data Philosophy”, is impacting on the traditional Web and I imagined what Beppe Grillo could show nowadays.
Just the following, simple, trivial and short SPARQL query:
<insert here the SPARQL query>
Although Beppe is a great comedian it may be hard also for him making people laugh with this. But, the point here is not about laughs but about data: in this sense, the Web of Data is providing an outstanding and an extremely powerful way to access to incredible twine of machine readable interlinked data.
Recently, another nice and remarkable italian initiative appeared on the Web: OpenParlamento.it. It’s, mainly, a service where the Italian congressmen are displayed and they are positioned on a chart basing on the similarity of their votes on law proposals.
Ok. Cool. But how the Semantic Web could improve this stuff?
First of all, it would be very straightforward to provide a SPARQL endpoint providing some good RDF for this data. Like the following example:
<rdf:RDF>
<foaf:name>Mario Rossi</foaf:name>
<foaf:gender>male</foaf:gender>
<openp:politicalGroup rdf:resource=”http://openparlamento.it/groups/Democratic_Party”/&gt;
</rdf:Description>
</rdf:RDF>
where names, descriptions, political belonging and more are provided. Moreover a property called openp:similarity could be used to map closer congressmen, using the same information of the already cited chart. 
Secondly, all the information about congressmen are published on the official Italian chambers web site. Wrapping this data, OpenParlamento.it could provide an extremely exhaustive set of official information and, more important, links to DBpedia will be the key to get a full set of machine processable data also from other Linked Data clouds.
How to benefits from all of this? Apart the fact of employing a cutting-edge technology to syndicate data, everyone who wants link the data provided by OpenParlamento.it on his web pages can easily do it using RDFa. Like the follow example, where a fragment of an HTML page representing a news on the above congressman:
<div>
</div>
contains some RDFa linking that page to the OpenParlamento.it cloud.
With these technologies as a basis, a new breed of applications (like web crawlers, for those interested in SEO) will access and process these data in a new, fashionable and extremely powerful way.

A couple of years ago, during his live show, the popular italian blogger and activist Beppe Grillo provided a quick demonstration about how the Web concretely realizes the “six degrees of separation”. The italian blogger, today a Web enthusiast, shown that it was possible to him to get in contact with someone very famous using a couple of different websites: imdb, Wikipedia and few others. Starting from a movie where he acted, he could reach the movie producer and the producer could be in contact with another actor due to previous work with this latter and so on. 

This demonstration consisted in a series of links that were opened leading to some Web pages containing information where extract the relationships that the showman wants to achieve.

This gig came back to my mind while I was thinking on how, what I call the “Linked Data Philosophy”, is impacting on the traditional Web and I imagined what Beppe Grillo could show nowadays.

Just the following, simple, trivial and short SPARQL query:

construct {
    ?actor1 foaf:knows ?actor2
}
    where {
    ?movie dbpprop:starring ?actor1.
    ?movie dbpprop:starring ?actor2.
    ?movie a dbpedia-owl:Film.
    FILTER(?actor1 = <http://dbpedia.org/resource/Beppe_Grillo&gt;)
}

Although Beppe is a great comedian it may be hard also for him making people laugh with this. But, the point here is not about laughs but about data: in this sense, the Web of Data is providing an outstanding and an extremely powerful way to access to incredible twine of machine readable interlinked data.

Recently, another nice and remarkable italian initiative appeared on the Web: OpenParlamento.it. It’s, mainly, a service where the Italian congressmen are displayed and they are positioned on a chart basing on the similarity of their votes on law proposals.

Ok. Cool. But how the Semantic Web could improve this stuff?

First of all, it would be very straightforward to provide a SPARQL endpoint providing some good RDF for this data. Like the following example:

<rdf:RDF>
    <rdf:Description rdf:about=”http://openparlamento.it/senate/Mario_Rossi”&gt;
        <rdf:type rdf:resource=”http://openparlamento.it/ontology/Congressman”/&gt;
        <foaf:name>Mario Rossi</foaf:name>
        <foaf:gender>male</foaf:gender>
        <openp:politicalGroup
            rdf:resource=”http://openparlamento.it/groups/Democratic_Party”/&gt;
        <owl:sameas rdf:resource=”http://dbpedia.org/resource/Mario_Rossi”/&gt;
    </rdf:Description>
</rdf:RDF>

where names, descriptions, political belonging and more are provided. Moreover a property called openp:similarity could be used to map closer congressmen, using the same information of the already cited chart. 

Secondly, all the information about congressmen are published on the official Italian chambers web site. Wrapping this data, OpenParlamento.it could provide an extremely exhaustive set of official information and, more important, links to DBpedia will be the key to get a full set of machine processable data also from other Linked Data clouds.

How to benefits from all of this? Apart the fact of employing a cutting-edge technology to syndicate data, everyone who wants link the data provided by OpenParlamento.it on his web pages can easily do it using RDFa.

With these technologies as a basis, a new breed of applications (like web crawlers, for those interested in SEO) will access and process these data in a new, fashionable and extremely powerful way.

Is the time for those guys to embrace the Semantic Web , isn’t it?

Posted at 10:06

Libby Miller: An i2c heat sensor with a Raspberry Pi camera

I had a bit of a struggle with this so thought it was worth documenting. The problem is this – the i2c bus on the Raspberry Pi is used by the official camera to initialise it. So if you want to use an i2c device at the same time as the camera, the device will stop working after a few minutes. Here’s more on this problem.

I really wanted to use this heatsensor with mynaturewatch to see if we could exclude some of the problem with false positives (trees waving in the breeze and similar). I’ve not got it working well enough yet to look at this problem in detail. But, I did get it working with the 12c bus with the camera – here’s how.

Screen Shot 2019-03-22 at 12.31.04

It’s pretty straightforward. You need to

  • Create a new i2c bus on some different GPIOs
  • Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one
  • Fin

1. Create a new i2c bus on some different GPIOs

This is super-easy:

sudo nano /boot/config.txt

Add the following line of code, preferable in the section where spi and i2c is enabled.

dtoverlay=i2c-gpio,bus=3,i2c_gpio_delay_us=1

This line will create an aditional i2c bus (bus 3) on GPIO 23 as SDA and GPIO 24 as SCL (GPIO 23 and 24 is defaults)

2. Tell the library you are using for the non-camera i2c peripheral to use these instead of the default one

I am using this sensor, for which I need this circuitpython library (more info), installed using:

pip3 install Adafruit_CircuitPython_AMG88xx

While the pi is switched off, plug in the i2c device using pins 23 for SDA and GPIO 24 for SDL, and then boot it up and check it’s working:

 i2cdetect -y 3

Make two changes:

nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/bcm283x/pin.py

and change the SDA and SCL pins to the new pins

#SDA = Pin(2)
#SCL = Pin(3)
SDA = Pin(23)
SCL = Pin(24)
nano /home/pi/.local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/generic_linux/i2c.py

Change line 21 or thereabouts to use the i2c bus 3 rather than the default, 1:

self._i2c_bus = smbus.SMBus(3)

3. Fin

Start up your camera code and your i2c peripheral. They should run happily together.

Screen Shot 2019-03-25 at 19.12.21

Posted at 10:06

Libby Miller: Neue podcast in a box, part 1

Ages ago I wrote a post on how to create a physical podcast player (“podcast in a box”) using Radiodan. Since then, we’ve completely rewritten the software, so those instructions can be much improved and simplified. Here’s a revised technique, which will get you as far as reading an RFID card. I might write a part 2, depending on how much time I have.

You’ll need:

  • A Pi 3B or 3B+
  • An 8GB or larger class 10 microSD card
  • A cheapo USB soundcard (e.g.)
  • A speaker with a 3.5mm jack
  • A power supply for the Pi
  • An MFC522 RFID reader
  • A laptop and microSD card reader / writer

The idea of Radiodan is that as much as possible happens inside web pages. A server runs on the Pi. One webpage is opened headlessly on the Pi itself (internal.html) – this page will play the audio; another can be opened on another machine to act as a remote control (external.html).

They are connected using websockets, so each can access the same messages – the RFID service talks to the underlying peripheral on the Pi, making the data from the reader available.

Here’s what you need to do:

1. Set up the the Pi as per these instructions (“setting up your Pi”)

You need to burn a microSD card with the latest Raspian with Desktop to act as the Pi’s brain, and the easiest way to do this is with Etcher. Once that’s done, the easiest way to do the rest of the install is over ssh, and the quickest way to get that in place is to edit two files while the card is still in your laptop (I’m assuming a Mac):

Enable ssh by typing:

touch /Volumes/boot/ssh

Add your wifi network to boot by adding a file called

/Volumes/boot/wpa_supplicant.conf

contents: (replace AP_NAME and AP_PASSWORD with your wifi details)

country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="AP_NAME"
  psk="AP_PASSWORD"
  key_mgmt=WPA-PSK
}

Then eject the card, put the card in the Pi, attach all the peripherals except for the RFID reader and switch it on. While on the same wifi network, you should be able to ssh to it like this:

ssh pi@raspberrypi.local

password: raspberry.

Then install the Radiodan software using the provisioning script like this:

curl https://raw.githubusercontent.com/andrewn/neue-radio/master/deployment/provision | sudo bash

2. Enable SPI on the Pi

Don’t reboot yet; type:

sudo raspi-config

Under interfaces, enable SPI, then shut the Pi down

sudo halt

and unplug it.

3. Test Radiodan and configure it

If all is well and you have connected a speaker via a USB soundcard, you should hear it say “hello” as it boots.

Please note: Radiodan does not work with the default 3.5mm jack on the Pi. We’re not sure yet why. But USB soundcards are very cheap, and work well.

There’s one app available by default for Radiodan on the Pi. To use it,

  1. Navigate to http://raspberrypi.local/radio
  2. Use the buttons to play different audio clips. If you can hear things, then it’s all working

 

radiodan_screenshot1

shut the Pi down and unplug it from the mains.

4. Connect up the RFID reader to the Pi

like this

Then start the Pi up again by plugging it in.

5. Add the piab app

Dan has made a very fancy mechanism for using Samba to drag and drop apps to the Pi, so that you can develop on your laptop. However, because we’re using RFID (which only works on the Pi), we may as well do everything on there. So, ssh to it again:

ssh pi@raspberrypi.local
cd /opt/radiodan/rde/apps/
git clone http://github.com/libbymiller/piab

This is currently a very minimal app, which just allows you to see all websocket messages going by, and doesn’t do anything else yet.

6. Enable the RFID service and piab app in the Radiodan web interface

Go to http://raspberrypi.local:5020

Enable “piab”, clicking ‘update’ beneath it. Enable the RFID service, clicking ‘update’ beneath it. Restart the manager (red button) and then install dependencies (green button), all within the web page.

radiodan_screenshot2

radiodan_screenshot4

Reboot the Pi (e.g. ssh in and sudo reboot). This will enable the RFID service.

7. Test the RFID reader

Open http://raspberrypi.local:5000/piab and open developer tools for that page. Place a card on the RFID reader. You should see a json message in the console with the RFID identifier.

radiodan_screenshot5

The rest is a matter of writing javascript / html code to:

  • Associate a podcast feed with an RFID (e.g. a web form in external.html that allows the user to add a podcast feed url)
  • Parse the podcast feed when the appropriate card id is detected by the reader
  • Find the latest episode and play it using internal.html (see the radio app example for how to play audio)
  • Add more fancy options, such as remembering where you were in an episode, stopping when the card is removed etc.

As you develop, you can see the internal page on http://raspberrypi.local:5001 and the external page on http://raspberrypi.local:5000. You can reload the app using the blue button on http://raspberrypi.local:5020.

Many more details about the architecture of Radiodan are available; full installation instructions and instructions for running it on your laptop are here; docs are here; code is in github.

Posted at 10:06

Libby Miller: #Makevember

@chickengrylls#makevember manifesto / hashtag has been an excellent experience. I’ve made maybe five nice things and a lot of nonsense, and a lot of useless junk, but that’s fine – I’ve learned a lot, mostly about servos and other motors. There’s been tons of inspiration too (check out these beautiful automata, some characterful paper sculptures, Richard’s unsuitable materials, my initial inspiration’s set of themes on a tape, and loads more). A lovely aspect was all the nice people and beautiful and silly things emerging out of the swamp of Twitter.

Screen Shot 2017-12-01 at 16.31.14

Of my own makes, my favourites were this walking creature, with feet made of crocodile clips (I was amazed it worked); a saw-toothed vertical traveller, such a simple little thing; this fast robot (I was delighted when it actually worked); some silly stilts; and (from October) this blimp / submarine pair.

I did lots of fails too – e.g. a stencil, a raspberry blower. Also lots of partial fails that got scaled back – AutoBez 1, 2, and 3; Earth-moon; a poor-quality under-water camera. And some days I just ran out of inspiration and made something crap.

Why’s it so fun? Well there’s the part about being more observant, looking at materials around you constantly to think about what to make, though that’s faded a little. As I’ve got better I’ve had more successes and when you actually make something that works, that’s amazing. I’ve loved seeing what everyone else is making, however good or less-good, whether they spent ages or five minutes on it. It feels very purposeful too, having something you have to do every day.

Downsides: I’ve spent far too long on some of these. I was very pleased with both Croc Nest, and Morse, but both of them took ages. The house is covered in bits of electronics and things I “might need” despite spending some effort tidying, but clearly not enough (and I need to have things to hand and to eye for inspiration). Oh, and I’m addicted to Twitter again. That’s it really. Small price to pay.

Posted at 10:06

Libby Miller: Libbybot eleven – webrtc / pi3 / presence robot

The libbybot posable presence robot’s latest instructions are here. It’s a lot more detailed than previous versions and much more reliable (and includes details for construction, motors, server etc).

It’s not a work project, but I do use it at work (picture by David Man).

Image_uploaded_from_iOS

Posted at 10:06

Peter Mika: Semantic Search Challenge sponsored by Yahoo! Labs

Together with my co-chairs Marko Grobelnik, Thanh Tran Duc and Haofen Wang, we again got the opportunity of organizing the 4th Semantic Search Workshop, the premier event for research on retrieving information from structured data collections or text collections annotated with metadata. Like last year, the Workshop will take place at the WWW conference, to be held March 29, 2011, in Hyderabad, India. If you wish to submit a paper, there are still a few days left: the deadline is Feb 26, 2011. We welcome both short and long submissions.

In conjunction with the workshop, and with a number of co-organizers helping us, we are also launching  a Semantic Search Challenge (sponsored by Yahoo! Labs), which is hosted at semsearch.yahoo.com. The competition will feature two tracks. The first track (entity retrieval) is the same task we evaluated last year: retrieving resources that match a keyword query, where the query contains the name of an entity, with possibly some context (such as “starbucks barcelona”). We are adding this year a new task (list retrieval) which represents the next level of difficulty: finding resources that belong to a particular set of entities, such as “countries in africa”. These queries are more complex to answer since they don’t name a particular entity. Unlike in other similar competitions, the task is to retrieve the answers from a real (messy…) dataset crawled from the Semantic Web. There is a small prize ($500) to win in each track.

The entry period will start March 1, and run through March 15. Please consider participating in either of these tracks: it’s early days in Semantic Search, and there is so much to discover.

Posted at 10:06

Peter Mika: Microformats and RDFa deployment across the Web

I have presented on previous occasions (at Semtech 2009, SemTech 2010, and later at FIA Ghent 2010, see slides for the latter, also in ISWC 2009) some information about microformat and RDFa deployment on the Web. As such information is hard to come by, this has generated some interest from the audience. Unfortunately, Q&A time after presentations is too short to get into details, hence some additional background on how we obtained this data and what it means for the Web. This level of detail is also important to compare this with information from other sources, where things might be measured differently.

The chart below shows the deployment of certain microformats and RDFa markup on the Web, as percentage of all web pages, based on an analysis of 12 billion web pages indexed by Yahoo! Search. The same analysis has been done at three different time-points and therefore the chart also shows the evolution of deployment.

Microformats and RDFa deployment on the Web (% of all web pages)

The data is given below in a tabular format.

Date RDFa eRDF tag hcard adr hatom xfn geo hreview
09-2008 0.238 0.093 N/A 1.649 N/A 0.476 0.363 N/A 0.051
03-2009 0.588 0.069 2.657 2.005 0.872 0.790 0.466 0.228 0.069
10-2010 3.591 0.000 2.289 1.058 0.237 1.177 0.339 0.137 0.159

There are a couple of comments to make:

  • There are many microformats (see microformats.org) and I only include data for the ones that are most common on the Web. To my knowledge at least, all other microformats are less common than the ones listed above.
  • eRDF has been a predecessor to RDFa, and has been obsoleted by it. RDFa is more fully featured than eRDF, and has been adopted as a standard by the W3C.
  • The data for the tag, adr and geo formats is missing from the first measurement.
  • The numbers cannot be aggregated to get a total percentage of URLs with metadata. The reason is that a webpage may contain multiple microformats and/or RDFa markup. In fact, this is almost always the case with the adr and geo microformats, which are typically used as part of hcard. The hcard microformat itself can be part of hatom markup etc.
  • Not all data is equally useful, depending on what you are trying to do. The tag microformat, for example, is nothing more than a set of keywords attached to a webpage. RDFa itself covers data using many different ontologies.
  • The data doesn’t include “trivial” RDFa usage, i.e. documents that only contain triples from the xhtml namespace. Such triples are often generated by RDFa parsers even when the page author did not intend to use RDFa.
  • This data includes all valid RDFa, and not just namespaces or vocabularies supported by Yahoo! or any other company.

The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hatom microformat.

These results make me optimistic that the Semantic Web is here already in large ways. I don’t expect that a 100% of webpages will ever adopt microformats or RDFa markup, simply because not all web pages contain structured data. As this seems interesting to watch, I will try to publish updates to the data and include the update chart here or in future presentations.

Enhanced by Zemanta

Posted at 10:06

Michael Hausenblas: Elephant filet

End of January I participated in a panel discussion on Big Data, held during the CISCO live event in London. One of my fellow panelists, I believe it was Sean McKeown of CISCO, said there something along the line:

… ideally the cluster is at 99% utilisation, concerning CPU, I/O, and network …

This stuck in my head and I gave it some thoughts. In the following I will elaborate a bit on this in the context of where Hadoop is used in a shared setup, for example in hosted offerings or, say, within an enterprise that runs different systems such as Storm, Lucene/Solr, and Hadoop on one cluster.

In essence, we witness two competing forces: from the perspective of a single user who expects performance vs. the view of the cluster owner or operator who wants to optimise throughput and maximise utilisation. If you’re not familiar with these terms you might want to read up on Cary Millsap’s Thinking Clearly About Performance (part 1 | part 2).

Now, in such as shared setup we may experience a spectrum of loads: from compute intensive over I/O intensive to communication intensive, illustrated in the following, not overly scientific figure:
Utilisations

Here are a some observations and thoughts for potential starting points of deeper research or experiments.

Multitenancy. We see more and more deployments that require strong support for multitenancy; check out the CapacityScheduler, learn from best practices or use a distribution that natively supports the specification of topologies. Additionally, you might still want to keep an eye on Serengeti – VMware’s Hadoop virtualisation project – that seems to have gone quiet in the past months, but I still have hope for it.

Software Defined Networks (SDN). See Wikipedia’s definition for it, it’s not too bad. CISCO, for example, is very active in this area and only recently there was a special issue in the recent IEEE Communications Magazine (February 2013) covering SDN research. I can perfectly see – and indeed this was also briefly discussed on our CISCO live panel back in January – how SDN can enable new ways to optimise throughput and performance. Imagine a SDN that is dynamically workload-aware in the sense of that it knows the difference of a node that runs a task tracker vs. a data node vs. a Solr shard – it should be possible to transparently better the operational parameters and everyone involved, both the users as well as the cluster owner benefit from it.

As usual, I’m very interested in what you think about the topic and looking forward learning about resources in this space from you.

Posted at 10:06

Michael Hausenblas: MapR, Europe and me

MapRYou might have already heard that MapR, the leading provider of enterprise-grade Hadoop and friends, is launching its European operations.

Guess what? I’m joining MapR Europe as of January 2013 in the role of Chief Data Engineer EMEA and will support our technical and sales teams throughout Europe. Pretty exciting times ahead!

As an aside: as I recently pointed out, I very much believe that Apache Drill and Hadoop offer great synergies and if you want to learn more about this come and join us at the Hadoop Summit where my Drill talk has been accepted for the Hadoop Futures session.

Posted at 10:06

Michael Hausenblas: Hosted MapReduce and Hadoop offerings

Hadoop in the cloud

Today’s question is: where are we regarding MapReduce/Hadoop in the cloud? That is, what are the offerings of Hadoop-as-a-Service or other hosted MapReduce implementations, currently?

A year ago, InfoQ ran a story Hadoop-as-a-Service from Amazon, Cloudera, Microsoft and IBM which will serve us as a baseline here. This article contains the following statement:

According to a 2011 TDWI survey, 34% of the companies use big data analytics to help them making decisions. Big data and Hadoop seem to be playing an important role in the future.

One year later, we learn from a recent MarketsAndMarkets study, Hadoop & Big Data Analytics Market – Trends, Geographical Analysis & Worldwide Market Forecasts (2012 – 2017) that …

The Hadoop market in 2012 is worth $1.5 billion and is expected to grow to about $13.9 billion by 2017, at a [Compound Annual Growth Rate] of 54.9% from 2012 to 2017.

In the past year there have also been some quite vivid discussions around the topic ‘Hadoop in the cloud’.

So, here are some current offerings and announcements I’m aware of:

… and now it’s up to you dear reader – I would appreciate it if you could point me to more offerings and/or announcements you know of, concerning MapReduce and Hadoop in the cloud!

Posted at 10:06

Michael Hausenblas: Interactive analysis of large-scale datasets

The value of large-scale datasets – stemming from IoT sensors, end-user and business transactions, social networks, search engine logs, etc. – apparently lies in the patterns buried deep inside them. Being able to identify these patterns, analyzing them is vital. Be it for detecting fraud, determining a new customer segment or predicting a trend. As we’re moving from the billions to trillions of records (or: from the terabyte to peta- and exabyte scale) the more ‘traditional’ methods, including MapReduce seem to have reached the end of their capabilities. The question is: what now?

But a second issue has to be addressed as well: in contrast to what current large-scale data processing solutions provide for in batch-mode (arbitrarily but in line with the state-of-the-art defined as any query that takes longer than 10 sec to execute) the need for interactive analysis increases. Complementary, visual analytics may or may not be helpful but come with their own set of challenges.

Recently, a proposal for a new Apache Incubator group called Drill has been made. This group aims at building a:

… distributed system for interactive analysis of large-scale datasets […] It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Drill’s design is supposed to be informed by Google’s Dremel and wants to efficiently process nested data (think: Protocol Buffers). You can learn more about requirements and design considerations from Tomer Shiran’s slide set.

In order to better understand where Drill fits in in the overall picture, have a look at the following (admittedly naïve) plot that tries to place it in relation to well-known and deployed data processing systems:

BTW, if you want to test-drive Dremel, you can do this already today; it’s an IaaS service offered in Google’s cloud computing suite, called BigQuery.

Posted at 10:06

Michael Hausenblas: Schema.org + WebIntents = Awesomeness

Imagine you search for a camera, say a Canon EOS 60D, and in addition to the usual search results you’re as well offered a choice of actions you can perform on it, for example share the result with a friend, write a review for the item or, why not directly buy it?

Enhancing SERP with actions

Sounds far fetched? Not at all. In fact, all the necessary components are available and deployed. With Schema.org we have a way to describe the things we publish on our Web pages, such as books or cameras and with WebIntents we have a technology at hand that allows us to interact with these things in a flexible way.

Here are some starting points in case you want to dive into WebIntents a bit:

PS: I started to develop a proof of concept for mapping Schema.org terms to WebIntents and will report on the progress, here. Stay tuned!

Posted at 10:06

Michael Hausenblas: Turning tabular data into entities

Two widely used data formats on the Web are CSV and JSON. In order to enable fine-grained access in an hypermedia-oriented fashion I’ve started to work on Tride, a mapping language that takes one or more CSV files as inputs and produces a set of (connected) JSON documents.

In the 2 min demo video I use two CSV files (people.csv and group.csv) as well as a mapping file (group-map.json) to produce a set of interconnected JSON documents.

So, the following mapping file:

{
 "input" : [
  { "name" : "people", "src" : "people.csv" },
  { "name" : "group", "src" : "group.csv" }
 ],
 "map" : {
  "people" : {
   "base" : "http://localhost:8000/people/",
   "output" : "../out/people/",
   "with" : { 
    "fname" : "people.first-name", 
    "lname" : "people.last-name",
    "member" : "link:people.group-id to:group.ID"
   }
  },
  "group" : {
   "base" : "http://localhost:8000/group/",
    "output" : "../out/group/",
    "with" : {
     "title" : "group.title",
     "homepage" : "group.homepage",
     "members" : "where:people.group-id=group.ID link:group.ID to:people.ID"
    }
   }
 }
}

… produces JSON documents representing groups. One concrete example output is shown below:

Posted at 10:06

Copyright of the postings is owned by the original blog authors. Contact us.