Planet RDF

It's triples all the way down

January 06

Shelley Powers: Bb's Semantic Feed: Semantic Markup: Oh, look. It's not just us Semantic Web Dweebs who noticed.

A List Apart has a new article out on the Semantics in HTML5. John Allsopp writes

We’ll start by posing the question: “why are we inventing these new elements?” A reasonable answer would be: “because HTML lacks semantic richness, and by adding these elements, we increase the semantic richness of HTML—that can’t be bad, can it?”

By adding these elements, we are addressing the need for greater semantic capability in HTML, but only within a narrow scope. No matter how many elements we bolt on, we will always think of more semantic goodness to add to HTML. And so, having added as many new elements as we like, we still won’t have solved the problem. We don’t need to add specific terms to the vocabulary of HTML, we need to add a mechanism that allows semantic richness to be added to a document as required. In technical terms, we need to make HTML extensible. HTML 5 proposes no mechanism for extensibility.

On reading of which, I hurt my head by banging it, suddenly and with force, against my desk.

Several times.

Posted at 18:36

Alexandre Passant: Website update

This website is now powered by Drupal.
As I felt the need to edit more than blog posts (i.e. publication pages, etc), WordPress was not enough and so I switched tp Drupal 6 thanks to the WP2Drupal6 plug-in.

read more

Posted at 14:04

January 05

Dublin Core Metadata Initiative: DCMI Usage Board publishes criteria for review of Application Profiles

2009-01-05. The DCMI Usage Board has published a structured set of criteria for reviewing application profiles based on the DCMI Abstract Model.

Posted at 23:59

Dublin Core Metadata Initiative: New repository of DCMI Conference Papers with proceedings 2001-2008 conferences

2009-01-05, A completely new version of the DCMI Conference Paper repository has been installed at the National Library of Korea. The new repository includes the proceedings of all conferences from Tokyo 2001 until Berlin 2008.

Posted at 23:59

Dublin Core Metadata Initiative: DCMI incorporated in Singapore

2009-01-05, The Dublin Core Metadata Initiative (DCMI) has completed the legal steps for incorporation as a public, not-for-profit Company limited by Guarantee in Singapore. The founding members of the new legal entity are the National Library Board Singapore and the National Library of Finland. The other DCMI Affiliates, the Joint Information Systems Commission (JISC) in the UK, the National Library, National Archives and the State Services Commission of New Zealand and the National Library of Korea, will become Members in the weeks ahead. Read more...

Posted at 23:59

Clark and Parsia: Pellet 2.0 RC4

We’re welcoming 2009 by making a new release candidate for Pellet 2.0 available for download. Pellet 2.0 RC4 resolves several issues present in previous release candidates, which are documented more fully in a Pellet Trac report.

The resolved issue that was most likely to have frustrated users was broken import behavior when using the the Pellet command line with the Jena loader. In addition, our effort to support the built-ins listed in the SWRL submission is closer to complete now that the team has added implementation of the built-ins for URIs.

Special thanks to the users who reported issues since most of the changes in this release were made in response to user identified problems. Keep up the good work by sending your bug reports to the Pellet users mailing list.

Posted at 18:16

January 03

Kingsley Idehen: Linked Data Web Collaborators: Introducing Structured Dynamics

As indicated in posts from Fred Giasson and Mike Bergman, the Zitgist incubation effort that contributed to the delivery of vital Linked Data Web infrastructure components such as TalkDigger (discourse discovery and participation), PingTheSemanticWeb (ground-zero data source for most Semantic Web search engines), UMBEL (binding layer for Upper and Lower Ontologies amongst other things), Music Ontology (enabling meaningful description of Music), and Bibliographic Ontology (enabling meaningful description of Bibliographic content), is now ready to continue its business development and technology growth as a going concern known as Structured Dynamics.

With great joy and pride, I wish Structured Dynamics all the success they deserve. Naturally, the collaborations and close relationship between OpenLink Software and its latest technology partner will continue -- especially as we collectively work towards a more comprehendible and pragmatic Web of Linked Data for developers (across Web 1.0, 2.0, 3.0, and beyond), end-users (information- and knowledge-workers), and entrepreneurs (driven by quality and tangible value contribution).

Related

Posted at 04:03

January 02

Kingsley Idehen: My Hopes for Linked Data in 2009 (Update #2)

Happy New Year!

In 2009 I hope the following happens re. "Linked Data":

  1. We realize it's a Meme
  2. We collectively connect the Meme to the concept of granular hyperlinks between data entities/objects (datum to datum linkage aka. Hyperdata Linking)
  3. We generally connect the Meme to technology ancestry such as the Entity-Attribute-Value with Classes & Relationships (EAV/CR) data model (then broader commonality with erstwhile unrelated realms will be unveiled e.g., Entity Frameworks from Microsoft, Core Data from Apple, SimpleDB from Amazon, and the Freebase Graph Model DB amongst others)
  4. We instinctively connect the Meme to the concept of Entity Oriented Data Access and Management (RDF based Linked Data is basically EAV/CR scheme that uses HTTP based Pointers for Entity, Attribute, and Relationship Identifiers)
  5. We naturally connect the Meme with the notion that an identifier for a unit of data (aka. Datum) should be the conduit to a negotiable representation of said Datum's description (i.e., it's attribute and relationship properties in HTML, XHTML, RDFa, Turtle, N3, RDF/XML etc., for example)
  6. We ultimately connect the Meme with a conceptual-level approach to data integration across disparate data sources (also known as Master Data Management (MDM) ).

2009 is about a reboot on a monumental scale. We need new thinking, new technology, new approaches, and new solutions. No matter what route we take, we can't negate the importance of "Data". When dealing with organic or inorganic computers systems -- Data is simply everything!

The ability of individuals and enterprises to access, mesh, and disseminate data to relevant nodes across public and private networks will ultimately determine the winners and losers in the new frontier, ushered in by 2009.

Do not take data access and data management technology for granted. User interfaces come and ago, application logic comes and goes, but your data stays with you forever. If you are mystified by data access technology then make 2009 the year of data access technology demystification :-)

Related

Posted at 18:39

Orri Erling: Linked Data & The Year 2009 (updated)

As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.

Sir Tim said it at WWW08 in Beijinglinked data and the linked data web is the semantic web and the Web done right.

The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.

The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.

Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.

It is against this backdrop that this year will play out.

As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].

Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.

I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.

Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?

The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.

Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.

Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.

Linking Open Data project logo
In hoc signo vinces.

In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.

For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.

For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.

For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.

For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.

2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.

Exciting 2009 to all.

Posted at 16:17

December 31

Uldis Bojars: Happy New Year - 2009 !!!

Happy New Year - 2009 !!!

Wishing everyone a Happy New Year!!!

This photo features a moment from the “Staro Rīga” - Riga light festival.

Posted at 12:35

December 30

Bob DuCharme: Turtles all the way down

A nice early version, without the turtles.

Posted at 19:13

December 29

Kingsley Idehen: Is Linked Data Always Relevant?

I pose the question above because I stumbled across an interesting claim about OpenLink Software and its representatives expressed in the ReadWriteWeb post titled: XBRL: Mashing Up Financial Statements, where the following claim is made:

"..There is evidence that they promote LINKED DATA at any expense without understanding the rationale behind other approaches...".

To answer the question above, Linked Data is always relevant as long as we are actually talking about "Data" which is simply the case all of the time, irrespective of interaction medium.

If XBRL can be disconnected in anyway from Linked Data, I desperately would like to be enlightened (as per my comments to the post). Why wouldn't anyone desire the ability to navigate the linked data inherent in any financial report? Every entity in an XBRL instance document is an entity, directly or indirectly related to other entities. Why "Mash" the data when you can harmonize XBRL data via a Generic Financial Dictionary (schema or ontology) such that descriptions of Balance Sheet, P&L, and other entities are navigable via their attributes and relationships? In short, why "Mash" (code based brute force joining across disparately shaped data) when you can "Mesh" (natural joining of structured data entities)?

"Linked Data" is about the ability to connect all our observations (data)? , perceptions (information), and inferences / conclusions (knowledge) across a spectrum of interaction media. And it just so happens that the RDF data model (Entity-Attribute-Vaue + Class Relationships + HTTP based Object Identifiers), a range of RDF data model serialization formats, and SPARQL (Query Language and Web Service combo) actually make this possible, in a manner consistent with the essence of the global space we know as the World Wide Web.

Related

Posted at 22:32

December 27

: Multi Model Forms With Rails


Ruby on Rails still doesn’t have a good story to tell with regard to multi-model forms. Multi-model forms are HTML forms that have fields from more than one model, which the user edits and submits as one. Rails should take this single collection of fields, split it up into multiple models, and create, edit, or delete as necessary. Unfortunately, this still requires quite a bit of work.

I’ve collected a set of links that may help those that are new handle multiple models in HTML forms with Rails. None of these are the “official” answer to this complex question. However, it appears that Rails core will one day have an out-of-the-box solution to this. Read on.

Ryan Bates, of Railscasts fame, has probably the best known solution to this problem. He suggests Complex Forms Part 1, Complex Forms Part 2, and Complex Forms Part 3. However, as even he mentions, it doesn’t work with Rails 2. Refer to these screencasts, and their sometimes useful comments, as background study.

Ryan offers an alternative for Rails 2 and above. He extended his original screencasts, fixed them for Rails 2, and published them in the Advanced Rails Recipes book, published by the Pragmatic Programmers. You’ll want to check out Recipe 13, titled “Handle Multiple Models in One Form” (but you really want the whole book, lots of little gems in there).

An alternative to Ryan’s methods can be found at attribute_fu, and Rails plugin by James Golick. attribute_fu is similar to Ryan Bates’ code, however it’s packaged as a plugin for easy install and use, and uses more conventions to cut down on code. Complete with form helpers and extensions to has_many, you should try this plugin. I don’t know if it works with deeply nested models, though.

In July of 2008, there was a glimmer of hope that the Rails core was going to get a blessed solution to this problem. Ryan Daigle, or Ryan’s Scraps, reported that Nested Model Mass Assignment was added into Rails! However, it was pulled by the core team because they felt it wasn’t ready for prime time. There was a lot of discussion about if and when Rails would have nested model mass assignment. Hopefully it will arrive after 2.2. This is a good discussion and links to other plugins or proposals to handle this tricky problem.

Ryan Bates followed the debate and collected and summarized the various different methods and proposals to handle nested model and nested forms with Rails.

I, for one, will be trying attribute_fu. I’ve used the Advanced Rails Recipes solution, which certainly works. However, it always felt like too much work for me.

Update: it seems that I wrote about handling collections of models with Rails back in 2007.

      

Posted at 04:15

December 26

Bob DuCharme: A belated Christmas wish: a SPARQL endpoint for Digg RDF

Or consider it a lazy semweb wish.

Posted at 19:47

Tetherless World Constellation group RPI: Get a senior scientist blogging (my response)

During some random Web surfing (something I don’t get nearly enough time to do these days), I ran into the Science blogging Challenge (aka “get a senior scientist blogging”) and it got me thinking about how I got blogging, and more recently how I got twittering (which seems to fit my insane life style better).  I sent the following entry to the competition, nominating a few people who were instrumental in getting me blogging and more recently getting me to tweet.

Here’s what I said:

My motivation to start blogging actually came because of a different senior scientist starting his blog — In Jan 06, one of my colleagues started a blog - and it got some big notice — since the blogger was Tim Berners-Lee that made some sense,  My first real blog (I had contributed blog comments and done an occasional “guest shot” on other peoples blogs) was called “Time to get a blog“  and mentions the influence of Tim’s bloggin.  I cannot tell you who convinced Tim to blog, but I know that Danny Weitzner, whose blog is at http://people.w3.org/~djweitzner/blog/, was one of the influences.
However, Tim’s starting to blog is the thing that got me to finally do it, but the person who really got me blogging is Jennifer Golbeck, (who blogs in a bunch of different places)  who is the one who convinced me to get my act together and walk the walk if I was going to claim to be a Professor of All Things Web, as I now try to be - she’s also the one who got me signed up on orkut, facebook (beta) and a bunch of other social networking sites long before it became popular - and if I’m not mistaken she’s probably the person who got me my gmail invitation way back when - so Jen should definitely be someone considered in the “I got a senior scientist to blog” category.
Meanwhile, the propagation continues - Peter Fox, who attended this past Sci Foo, and is an occasional blogger has joined my lab, and he and I are trying to convince several of our colleagues, esp. Deborah McGuinness, to get blogging.
I’d also like to point out that while blogging continues to be interesting to look at as a mechanism for propagating science, I’m finding these days that microblogging (i’m jahendler on twitter) has been gaining popularity, especially among the Social Scientists - and it may be an even better way for some of the busy senior scientists you’re trying to reach out to  (if they can just learn to use the messaging on their cell phones).  I credit “eingang” (Michelle Hoyle - http://einiverse.eingang.org/) for getting me twittering, and I notice that a quick message from my phone during a lecture or seminar is a good way to share a thought or a pointer  (although I find it also is fun to add personal observations and such - so it humanizes the scientists who use it)
So anyway - there are three entries for the contest
Danny Weitzner for helping to get Tim Berners-Lee blogging
Jen Golbeck for getting me blogging
Michele Hoyle for getting me micro-blogging
cheers
Jim H.

Posted at 15:28

December 25

Andrew Matthews: Quantum Reasoners Hold Key to Future Web

Last year, a company called DWave Systems announced their quantum computer (the ‘Orion’) - another milestone on the road to practical quantum computing. Their controversial claims seem worthy in their own right but they are particularly important to the semantic web (SW) community. The significance to the SW community was that their quantum computer solved [...]

Posted at 14:09

December 24

Ebiquity research group UMBC: WWGD: Understanding Google’s Technology Stack

It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.

Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.

The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.

“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”

Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.

Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!

Posted at 16:15

December 23

Kingsley Idehen: Bio2Rdf EC2 AMI is now Ready! (Updated)

Adding to the collection of Amazon EC2 AMI based knowledgebases already unveiled for DBpedia and NeuroCommons, we now have a Bio2Rdf knowledgebase AMI.

What is Bio2Rdf?

A community developed knowledgebase comprised of Bio Informatics data from across 30 or so public data sources. The standard deployment of Bio2Rdf includes a a federation of SPARQL endpoints provided by project members and collaborators.

What is the Bio2Rdf EC2 AMI?

An Amazon EC2 hosted variant of the Bio2Rdf knowledgebase. In addition to providing a SPARQL endpoint, the data exposed by the Amazon AMI is published in compliance with Linked Data publishing best practices espoused by the Linking Open Data community (LOD).

Benefits?

The ability to instantiate a personal or service-specific variant of this powerful knowledgebase via the Amazon EC2 Cloud. Instead of a 22+ hour error prone odyssey - you simply get down to the task of data analysis and integration within 1.5 hrs (when setting up you AMI for the first time).

How do I get going?

Just follow the instructions in the Bio2Rdf EC2 AMI installation guide.

Related

Posted at 15:37

W3C QA Blog Semantic Web News: RDFa and SVG Tiny (and the RDFa distiller)

W3C has just published the SVG Tiny 1.2 recommendation. Others are much more experts than me to describe the changes in the core functionality compared to the 1.1 version, so I let them do that. However, there is an interesting aspect of the new recommendation regarding the Semantic Web, too. Indeed, SVG Tiny 1.2 has adopted RDFa as one of the means to add metadata to the SVG file itself. The semantics of the RDFa attributes are the same as for XHTML; in fact, the SVG document simply refers to the RDFa specification. Nevertheless, the fact that the host language is SVG does lead to two small differences:
  1. SVG uses xml:base, whereas XHTML1+RDFa disallows it in favor of the base element
  2. SVG inherits from earlier versions the possibility to add RDF/XML directly into the SVG content via the metadata element. An SVG+RDFa distiller ought to understand this RDF graph and merge it with the graph produced by the regular RDFa processing.
The RDFa distiller has been updated to distill SVG+RDFa files, too. To account for those two differences a separate (host=xml) option has been introduced, although the distiller would work out of the box for most of the SVG cases (ie, for those that do not make use of those two features). As an example, I have updated the SVG version of the horizontal SW cube to SVG 1.2 Tiny. It uses the metadata element for the description of the copyright statements, but reuses SVG’s title and desc elements to generate the corresponding dc:title and dc:description RDF statements using RDFa’s @property attribute. Using the RDFa distiller, one can get to the RDF content. Cool…

Posted at 09:49

December 22

Ebiquity research group UMBC: Videos of Semantic Web talks and tutorials from ISWC 2008 now online

High quality videos of tutorials and talks from the Seventh International Semantic Web Conference are now available on the excellent VideoLectures.net site. It’s a great opportunity to benefit from the conference if you were not able to attend or, even if you were, to see presentations you were not able to attend.

Videolectures captured the slides for most of the presentations (which are available for downloading) and their site shows both the the speaker’s video and slides in synchronization. Videolectures used three camera crews in parallel so were able to capture almost all of the presentations. Here are some highlights from the ~90 videos to whet your appetite.

Posted at 14:45

Bob DuCharme: Adding metadata value with Pellet

A nice new feature of Pellet 2.0.

Posted at 14:37

Ebiquity research group UMBC: Tom Briggs Ph.D.: Constraint Generation and Reasoning in OWL

Tom Briggs defended his PhD dissertation last month on discovering domain and range constraints in OWL and the final copy is now available.

Thomas H. Briggs, Constraint Generation and Reasoning in OWL, 2008.

The majority of OWL ontologies in the emerging SemanticWeb are constructed from properties that lack domain and range constraints. Constraints in OWL are different from the familiar uses in programming languages and databases. They are actually type assertions that are made about the individualswhich are connected by the property. Because they are type assertions these assertions can add vital information to the individuals involved and give information on how the defining property may be used. Three different automated generation techniques are explored in this research: disjunction, least-common named subsumer, and vivification. Each algorithm is compared for the ability to generalize, and the performance impacts with respect to the reasoner. A large sample of ontologies from the Swoogle repository are used to compare real-world performance of these techniques. Using generated facts is a type of default reasoning. This may conflict with future assertions to the knowledge base. While general default reasoning is non-monotonic and undecidable a novel approach is introduced to support efficient contraction of the default knowledge. Constraint generation and default reasoning, together, enable a robust and efficient generation of domain and range constraints which will result in the inference of additional facts and improved performance for a number of Semantic Web applications.

Posted at 08:00

December 21

Ebiquity research group UMBC: Disco: a Map reduce framework in Python and Erlang

Disco is a Python-friendly, open-source Map-Reduce framework for distributed computing with the slogan “massive data - minimal code”. Disco’s core is written in Erlang, a functional language designed for concurrent programming, and users typically write Disco map and reduce jobs in Python. So what’s wrong with using Hadoop? Nothing, according to the Disco site, but…

“We see that platforms for distributed computing will be of such high importance in the future that it is crucial to have a wide variety of different approaches which produces healthy competition and co-evolution between the projects. In this respect, Hadoop and Disco can be seen as complementary projects, similar to Apache, Lighttpd and Nginx.

It is a matter of taste whether Erlang and Python are more suitable for the task than Java. We feel much more productive with Python than with Java. We also feel that Erlang is a perfect match for the Disco core that needs to handle tens of thousands of tasks in parallel.

Thanks to Erlang, the Disco core remarkably compact, currently less than 2000 lines of code. It is relatively easy to understand how the core works, and start experimenting with it or adapt it to new environments. Thanks to Python, it is easy to add new features around the core which ensures that Disco can respond quickly to real-world needs.”

The Disco tutorial uses the standard word counting task to show how to set up and use Disco on both a local cluster and Amazon EC2. There is also homedisco, which lets programmers develop, debug, profile and test Disco functions on one local machine before running on a cluster. The word counting example from the tutorial is certainly nicely compact:

from disco.core import Disco, result_iterator

def fun_map(e, params):
    return [(w, 1) for w in e.split()]

def fun_reduce(iter, out, params):
    s = {}
    for w, f in iter:
        s[w] = s.get(w, 0) + int(f)
    for w, f in s.iteritems():
        out.add(w, f)

results = Disco(”disco://localhost”).new_job(
                name = “wordcount”,
                input = ["http://discoproject.org/chekhov.txt"],
                map = fun_map,
                reduce = fun_reduce).wait()

for word, frequency in result_iterator(results):
        print word, frequency

Posted at 17:45

Tetherless World Constellation group RPI: What sweet spot?

I wanted to leave a blog comment on the Clark and Parsia blog with respect to the entry Kendall wrote in the entry entitled “Our Approach to Modeling, Fidelity, and KR.”  However, to leave such a comment I would have to log in, and I have way too many accounts right now, so I thought I’d write my response as a new entry (and by the time I finished, this was too long to be just a comment).

I don’t disagree with the overall “spectrum” that Kendall offers, but his point is that they have picked a point in the middle, and since they are in the middle they can model more than the scalers and scale more than the modelers.   The problem is that the middle is very, very wide, and thus there are many places in this space that such a claim could be made.  So, for example, a large triple store that can do a small amount of inferencing, say Garlik’s JXT as one example,  would scale even better and could still be able to claim to do more modeling than a pure triple store.

On the other end, the idea that decidability is somehow a sweet spot (despite known exponential behaviors for DL) over a more highly modeled, but perhaps heuristic (or incomplete) logic.  In this case the system could claim both to have more expressivity than a DL system, but also to be more scalable (just couldn’t gaurantee to have all the answers).  In fact, right now the systems that probably have the highest score in modeling power vs. scalability would fall in this camp.  The thing is their answer sets would be somewhat different.

I my opinion, the real problem with this blog entry is the idea that there is one sweet spot (Kendall called it the “sweet spot”) which implies that there is a general best answer.  This is the point I cannot really live with, and have spent much of my recent career trying to debunk.  Depending on what you are trying to do, there are many possible sweet spots.  There are a set of problems for which what C&P are doing is exactly the right thing, but there are also many where they are not.

And that is the key thing, we in the field have to get much better at understanding where the tradeoffs are and what various kinds of applications require.  Google taught us years ago that sometimes finding a good answer quickly can be an incredibly powerful thing.  Expert systems taught us that for many application complex modeling is too expensive.  Yet there are systems running in real applications that are using expert level modeling, because sometimes it is the thing you need despite the cost (and the ROI is high enough).

The other problem I have with the argument made actually has nothing to do with the issues of logic and such.  The traditional database community for has for a long time made a similar claim, which is that there is a particular place in the expressivity/scalability place that is “the” correct place.  They have spent years claiming that particular sweet spot is the only one that is interesting — it certainly has proven to be a very important  one, making way more commercial success than the DL stuff.  However, lately we’ve been learning that there exist problems where we need more expressivity, and thus other things have to be explored — the people in the DB community who’ve started looking at graph stores are, indeed, seeing that there are some applications, both in enterprises and especially on the Web, where the small amount of added expressivity makes a huge difference.  (Anyone who has witnessed my debates with Ullmann have certainly heard this argued…)

Anyway, when I gave the first talk at the DARPA Agent Markup Language (DAML) program, lo these many years, I showed a slide with the word “THE” under a kill ring and stated that in the Web there is no the - and whether to the database community, the adherents of DL, the people who cite my work, or anyone else — remember you are exploring one sweet spot that can be important to some set of applications, but there are many others, and we all win when we remember that.

Cheers - Jim Hendler

p.s. Clearly this is not meant in any way to be an anti-C&P comment, I was just riffing off of what Kendall wrote.

Posted at 03:58

December 20

W3C Semantic Web News: New RIF specification releases

The W3C Rule Interchange Format (RIF) Working Group published five new Working Drafts today. Since the Last Call Working Draft of RIF Basic Logic Dialect (BLD), the group has been developing other key dialects, components, and test cases. The new publications are: The Working Group is nearing Last Call on these remaining elements of RIF, and welcomes feedback from rulesystem users and designers.

Posted at 08:46

December 19

Clark and Parsia: Our Approach to Modeling, Fidelity, and KR

SSI data link modules
Image via Wikipedia

For some people, the point of the Semantic Web is distributed, web-friendly knowledge management and knowledge representation. Generally we’re in that camp. But that camp breaks down into several factions, and it’s useful to be clear about which faction we’re in.

There is a spectrum that runs from Maximum Fidelity to Maximum Scalability. Given our roots in Description Logic, we lie somewhere in-between these two poles. Notice that I have intentionally avoided calling these “extremes”; they are endpoints, and perfectly respectable, useful ones, depending on who you are and what you’re trying to achieve.

The Max Fidelity folks want to model as closely as possible some world-chunk in as fine-grained and faithful manner as is possible. This often means that they are at least first order logic fans, and sometimes higher-order logic users. They debate edge cases, corner cases, alternate and competing semantics and logics in an attempt to ever more faithfully mirror reality. The price they pay is, generally, computability. For some use cases, that price is perfectly acceptable. For other use cases, that price is entirely too high, since the most perfect representation of the world is useless if you can’t practically compute with it—at least, that’s how Max Fidelity often looks to us.

At the far end of the spectrum we have Max Scalability folks, for whom the point of the Semantic Web is rather more the “Web” than the “Semantic” part—we might playfully call them the “semantic WEB” crowd, in order to reflect their ideal ratio. Here the point isn’t to model perfectly; but, rather, to do something with lots and lots of data, ideally Webfuls of data. This means, in the argot of current tech choices, that they tend to be RDF and Linked Data fans and users, since that’s just about the only approach to doing anything at all interesting with Webfuls of data. The price they pay, of course, is expressivity. For some use cases, that’s just fine, since you don’t always need a lot or even much semantic fidelity to get the job done. Sometimes we build applications for customer that take this approach. But, as above, for other use cases, this is simply a killer, because without enough or the right semantics, you don’t get the right kind of help from the machine in figuring out complex stuff.

So what do we have so far? First, we have a notional (and idealized) spectrum that runs from Webfuls of data to, roughly, at least first order logic. Second, we have obviously tons of interesting use cases at (probably) every point along this spectrum. And, third, we have the suggestion that we aim for some kind of sweet spot in the middle—where “sweet spot” and “in the middle” are not absolute notions, but are interest-relative and goal-specific, and where the interests and goals we care about are, surprise-surprise, ours.

(In other words, I’ve setup a little fantasy where we are the Heroes—where we naturally occupy the “sweet spot”—but then, since I’m not a complete jerk, I’ve ironized or called into question that very fantasy in an effort to suggest that we, just like everyone else, try to spin things to make ourselves look smart, cool, and useful.)

And—will miracles never cease?—that’s just about where Description Logic fits along such an idealized spectrum. Technically, it’s the decidable subset of first order logic, which means that we try to balance Fidelity and Scalability in a way where we can get some of both.

The Max Fidelity folks are forever poking us with sticks to the effect that we can’t model world-chunks nearly as faithfully as they can. Well, no crap, of course we can’t! Then the Max Scalability folks poke us with different sticks to the effect that we can’t scale to Webfuls of data—again, no duh!

And then we poke back at both camps—hey, they started it!—to the effect that we can model far better than Max Scalers and we can scale far further than Max Fideliters (yes, I just made that word up…Rock!)...

Finally, a word about how this positioning issue plays out in our approach to modeling. In short, we model such that we get the right inferences, since getting the inferences is typically what our kind of applications (analysis, decision support kinds of apps, in short) are all about. So that means some edge or corner cases, even if they fit into DL, get ignored or dropped out or even distorted when there’s no point—given requirements analysis—to fidelity for its own sake. And it means, on the flip side, that we don’t worry too much that that inference over Webfuls of data is not realistically achievable anytime soon. Fast enough for the customer’s data is sufficient scalability in most cases for us.

Reblog this post [with Zemanta]

Posted at 16:41

Clark and Parsia: Why Reasoning Matters: Motivations

German KUKA Industrial robots doing vehicle un...
Image via Wikipedia

The perceived utility of automated reasoning for a wide range of applications matters to us greatly, which makes sense, given that our biz proposition is “semantic infrastructure OEM”. In other words, we’re trying to make money by licensing reasoning infrastructure, and related pieces, for semantic applications to other developers to use in their apps. With the right APIs and tool maturity, as well as supporting materials, our customers should be able to treat automated reasoning as a black box—not a black art.

A problem with demonstrating automated reasoning’s utility is that automated reasoning is complex, with non-trivial logical background and framework, including oodles of domain-specific vocabulary. Another problem is that automated reasoning is, in the end, just a kind of mechanical term rewriting often according to, considered individually, quite trivial rules. (Pellet isn’t really a rules engine, but we’ll talk about that another time.)

That means that for toy cases, which is what most people new to the subject are ready for, it seems dull and unimpressive. And for the hard cases? Well, most people aren’t ready for hard cases, so they simply tune out. And who can blame them, really? It’s like my example about Emma and Jack. I mean, that example really sucked, but what’s the alternative?

This is not an easy problem to solve.

My approach, rather than showing more toy or real examples, is just to talk about the utility of automated reasoning in plain language, in an attempt to communicate not so much specific details as the general mindset or approach to solving particular sorts of problems using automated reasoning. This approach to marketing mirrors our approach to technology development: both are iterative and experimental, but not just for us. As the man said, even a blind pig occasionally finds an acorn.

Reblog this post [with Zemanta]

Posted at 16:21

Clark and Parsia: APQC in Houston

JPL logo
Image via Wikipedia

I don’t have slides for my time at the APQC in Houston, I was not slated to present, so no cool slide widget with my presentation in this post.  I was merely there to observe and learn, and maybe answer some questions about POPS.

As Kendall mentioned, POPS was nominated as a best practice as part of NASA JPL’s overall efforts in Knowledge Management. The meeting at APQC was for all the nominees to give a short talk and to hear the overall findings of the study conducted by APQC, which in this case was on Expertise Location and Social Networking.

I got to see some great presentations by folks from IBM, Sun, Pratt-Whitney, Rockwell Collins, and Mitre and get a lot of insight into what they’re doing with Expertise Location and Social Networking; challenges they faced in the past, lessons learned, and what they’re doing now, and in the future, to continue their efforts in these areas.

It was a great experience, the people from APQC were fantastic, very friendly and put on a great event, and all the nominees and study partners, a group which included L3, Marathon Oil, ExxonMobil, and Wyeth, were all great and added a lot to the discussions.

Hopefully I get the chance to participate or work with APQC in the future.

Reblog this post [with Zemanta]

Posted at 16:03

W3C QA Blog Semantic Web News: Small update of the RDFa distiller sofware

I have made a small update on the pyRdfa Python package that drives the RDFa distiller. The main differences between this version and the previous are:
  • via a private communication Dan Brickley made me think on the following: what is the proper return format when an exception is found during processing (e.g., if there is a parsing error)? Up until the previous version the distiller caught the exception and returned and HTML page with the error message. Which is fine if one runs, say, the software from the distiller page. But what should happen if the caller expects an RDF graph and only an RDF graph? Surely the return should be, well, a graph containing the errors… so I have implemented this. I.e., in case of an exception, the distiller now looks into the accept header and, unless the header indicates html with a higher priority, a graph is returned (in RDF/XML or Turtle, depending on the caller’s settings).
  • Richard Cyganiak found a bug on the way the @property attributes with XML subtrees are handled. It is a bit of an edge case; we realized that there is not even a test case for this in the RDFa test suite:-( But there are some nice use cases that indeed went wrong (like the one by Florian Schmedding). (You can look at the mail archive for further details if interested.)
Nice holidays, everyone!

Posted at 13:28

Ebiquity research group UMBC: Eigenfactor.org measures and visualizes journal impact

eigenfactor.org is a fascinating site that is exploring new ways to measure and visualize the importance or journals to scientific communities. The site is a result of work by the Bergstrom lab in the Department of Biology at the University of Washington. The project defines two metrics for scientific journals based on a page-rank like algorithm applied to citation graphs.

“A journal’s Eigenfactor score is our measure of the journal’s total importance to the scientific community. With all else equal, a journal’s Eigenfactor score doubles when it doubles in size. Thus a very large journal such as the Journal of Biological Chemistry which publishes more than 6,000 articles annually, will have extremely high Eigenfactor scores simply based upon its size. Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson’s Journal Citation Reports (JCR) is 100.

A journal’s Article Influence score is a measure of the average influence of each of its articles over the first five years after publication. Article Influence measures the average influence, per article, of the papers in a journal. As such, it is comparable to Thomson Scientific’s widely-used Impact Factor. Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports (JCR) database has an article influence of 1.00.”

For example, here are the ISI-indexed journals in the AI subject category ranked by the Article Influence score for 2006.

The site makes good use of GoogleDoc’s motion charts to visualize the changes of metrics for top journals in a subject area. You can also interactively explore maps that show the influence of different subject categories on one another as estimated from journal citations.

Map of Science

The details of the approach and algorithms are available in various papers by Bergstrom and his colleagues, such as

M. Rosvall and C. T. Bergstrom, Maps of random walks on complex networks reveal community structure, Proceedings of the National Academy of Sciences USA. 105:1118-1123. Also arXiv physics.soc-ph/0707.0609v3 [PDF]

(spotted on Steve Hsu’s blog)

Posted at 06:01

Copyright © The PANTS Collective. A Useful Production. Contact us.