It's triples all the way down
A List Apart has a new article out on the Semantics in HTML5. John Allsopp writes
We’ll start by posing the question: “why are we inventing these new elements?” A reasonable answer would be: “because HTML lacks semantic richness, and by adding these elements, we increase the semantic richness of HTML—that can’t be bad, can it?”
By adding these elements, we are addressing the need for greater semantic capability in HTML, but only within a narrow scope. No matter how many elements we bolt on, we will always think of more semantic goodness to add to HTML. And so, having added as many new elements as we like, we still won’t have solved the problem. We don’t need to add specific terms to the vocabulary of HTML, we need to add a mechanism that allows semantic richness to be added to a document as required. In technical terms, we need to make HTML extensible. HTML 5 proposes no mechanism for extensibility.
On reading of which, I hurt my head by banging it, suddenly and with force, against my desk.
Several times.
Posted at 18:36
This website is now powered by Drupal.
As I felt the need to edit more than blog posts (i.e. publication
pages, etc), WordPress was not enough and so I switched tp Drupal 6
thanks to the WP2Drupal6
plug-in.
Posted at 14:04
2009-01-05. The DCMI Usage Board has published a structured set of criteria for reviewing application profiles based on the DCMI Abstract Model.
Posted at 23:59
2009-01-05, A completely new version of the DCMI Conference Paper repository has been installed at the National Library of Korea. The new repository includes the proceedings of all conferences from Tokyo 2001 until Berlin 2008.
Posted at 23:59
2009-01-05, The Dublin Core Metadata Initiative (DCMI) has completed the legal steps for incorporation as a public, not-for-profit Company limited by Guarantee in Singapore. The founding members of the new legal entity are the National Library Board Singapore and the National Library of Finland. The other DCMI Affiliates, the Joint Information Systems Commission (JISC) in the UK, the National Library, National Archives and the State Services Commission of New Zealand and the National Library of Korea, will become Members in the weeks ahead. Read more...
Posted at 23:59
We’re welcoming 2009 by making a new release candidate for Pellet 2.0 available for download. Pellet 2.0 RC4 resolves several issues present in previous release candidates, which are documented more fully in a Pellet Trac report.
The resolved issue that was most likely to have frustrated users was broken import behavior when using the the Pellet command line with the Jena loader. In addition, our effort to support the built-ins listed in the SWRL submission is closer to complete now that the team has added implementation of the built-ins for URIs.
Special thanks to the users who reported issues since most of the changes in this release were made in response to user identified problems. Keep up the good work by sending your bug reports to the Pellet users mailing list.
Posted at 18:16
As indicated in posts from Fred Giasson and Mike Bergman, the Zitgist incubation effort that contributed to the delivery of vital Linked Data Web infrastructure components such as TalkDigger (discourse discovery and participation), PingTheSemanticWeb (ground-zero data source for most Semantic Web search engines), UMBEL (binding layer for Upper and Lower Ontologies amongst other things), Music Ontology (enabling meaningful description of Music), and Bibliographic Ontology (enabling meaningful description of Bibliographic content), is now ready to continue its business development and technology growth as a going concern known as Structured Dynamics.
With great joy and pride, I wish Structured Dynamics all the success they deserve. Naturally, the collaborations and close relationship between OpenLink Software and its latest technology partner will continue -- especially as we collectively work towards a more comprehendible and pragmatic Web of Linked Data for developers (across Web 1.0, 2.0, 3.0, and beyond), end-users (information- and knowledge-workers), and entrepreneurs (driven by quality and tangible value contribution).
Posted at 04:03
Happy New Year!
In 2009 I hope the following happens re. "Linked Data":
2009 is about a reboot on a monumental scale. We need new thinking, new technology, new approaches, and new solutions. No matter what route we take, we can't negate the importance of "Data". When dealing with organic or inorganic computers systems -- Data is simply everything!
The ability of individuals and enterprises to access, mesh, and disseminate data to relevant nodes across public and private networks will ultimately determine the winners and losers in the new frontier, ushered in by 2009.
Do not take data access and data management technology for granted. User interfaces come and ago, application logic comes and goes, but your data stays with you forever. If you are mystified by data access technology then make 2009 the year of data access technology demystification :-)
Posted at 18:39
As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.
Sir Tim said it at WWW08 in Beijing — linked data and the linked data web is the semantic web and the Web done right.
The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.
The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.
Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.
It is against this backdrop that this year will play out.
As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].
Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.
I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.
Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?
The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.
Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.
Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.
In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.
For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.
For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.
For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.
For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.
2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.
Exciting 2009 to all.
Posted at 16:17
Wishing everyone a Happy New Year!!!
This photo features a moment from the “Staro Rīga” - Riga light festival.
Posted at 12:35
Posted at 19:13
I pose the question above because I stumbled across an interesting claim about OpenLink Software and its representatives expressed in the ReadWriteWeb post titled: XBRL: Mashing Up Financial Statements, where the following claim is made:
"..There is evidence that they promote LINKED DATA at any expense without understanding the rationale behind other approaches...".
To answer the question above, Linked Data is always relevant as long as we are actually talking about "Data" which is simply the case all of the time, irrespective of interaction medium.
If XBRL can be disconnected in anyway from Linked Data, I desperately would like to be enlightened (as per my comments to the post). Why wouldn't anyone desire the ability to navigate the linked data inherent in any financial report? Every entity in an XBRL instance document is an entity, directly or indirectly related to other entities. Why "Mash" the data when you can harmonize XBRL data via a Generic Financial Dictionary (schema or ontology) such that descriptions of Balance Sheet, P&L, and other entities are navigable via their attributes and relationships? In short, why "Mash" (code based brute force joining across disparately shaped data) when you can "Mesh" (natural joining of structured data entities)?
"Linked Data" is about the ability to connect all our observations (data)? , perceptions (information), and inferences / conclusions (knowledge) across a spectrum of interaction media. And it just so happens that the RDF data model (Entity-Attribute-Vaue + Class Relationships + HTTP based Object Identifiers), a range of RDF data model serialization formats, and SPARQL (Query Language and Web Service combo) actually make this possible, in a manner consistent with the essence of the global space we know as the World Wide Web.
Posted at 22:32
Ruby on Rails still doesn’t have a good story to tell with regard to multi-model forms. Multi-model forms are HTML forms that have fields from more than one model, which the user edits and submits as one. Rails should take this single collection of fields, split it up into multiple models, and create, edit, or delete as necessary. Unfortunately, this still requires quite a bit of work.
I’ve collected a set of links that may help those that are new handle multiple models in HTML forms with Rails. None of these are the “official” answer to this complex question. However, it appears that Rails core will one day have an out-of-the-box solution to this. Read on.
Ryan Bates, of Railscasts fame, has probably the best known solution to this problem. He suggests Complex Forms Part 1, Complex Forms Part 2, and Complex Forms Part 3. However, as even he mentions, it doesn’t work with Rails 2. Refer to these screencasts, and their sometimes useful comments, as background study.
Ryan offers an alternative for Rails 2 and above. He extended his original screencasts, fixed them for Rails 2, and published them in the Advanced Rails Recipes book, published by the Pragmatic Programmers. You’ll want to check out Recipe 13, titled “Handle Multiple Models in One Form” (but you really want the whole book, lots of little gems in there).
An alternative to Ryan’s methods can be found at attribute_fu, and Rails plugin by James Golick. attribute_fu is similar to Ryan Bates’ code, however it’s packaged as a plugin for easy install and use, and uses more conventions to cut down on code. Complete with form helpers and extensions to has_many, you should try this plugin. I don’t know if it works with deeply nested models, though.
In July of 2008, there was a glimmer of hope that the Rails core was going to get a blessed solution to this problem. Ryan Daigle, or Ryan’s Scraps, reported that Nested Model Mass Assignment was added into Rails! However, it was pulled by the core team because they felt it wasn’t ready for prime time. There was a lot of discussion about if and when Rails would have nested model mass assignment. Hopefully it will arrive after 2.2. This is a good discussion and links to other plugins or proposals to handle this tricky problem.
Ryan Bates followed the debate and collected and summarized the various different methods and proposals to handle nested model and nested forms with Rails.
I, for one, will be trying attribute_fu. I’ve used the Advanced Rails Recipes solution, which certainly works. However, it always felt like too much work for me.
Update: it seems that I wrote about handling collections of models with Rails back in 2007.
Posted at 04:15
Posted at 19:47
During some random Web surfing (something I don’t get nearly enough time to do these days), I ran into the Science blogging Challenge (aka “get a senior scientist blogging”) and it got me thinking about how I got blogging, and more recently how I got twittering (which seems to fit my insane life style better). I sent the following entry to the competition, nominating a few people who were instrumental in getting me blogging and more recently getting me to tweet.
Here’s what I said:
My motivation to start blogging actually came because of a
different senior scientist starting his blog — In Jan 06, one of my
colleagues started a blog - and it got some big notice — since the
blogger was Tim Berners-Lee that made some sense, My first
real blog (I had contributed blog comments and done an occasional
“guest shot” on other peoples blogs) was called “Time to get a
blog“ and mentions the influence of Tim’s bloggin.
I cannot tell you who convinced Tim to blog, but I know that Danny
Weitzner, whose blog is at http://people.w3.org/~djweitzner/blog/,
was one of the influences.
However, Tim’s starting to blog is the thing that got me to finally
do it, but the person who really got me blogging is Jennifer
Golbeck, (who blogs in a bunch of different places) who
is the one who convinced me to get my act together and walk the
walk if I was going to claim to be a Professor of All Things Web,
as I now try to be - she’s also the one who got me signed up on
orkut, facebook (beta) and a bunch of other social networking sites
long before it became popular - and if I’m not mistaken she’s
probably the person who got me my gmail invitation way back when -
so Jen should definitely be someone considered in the “I got a
senior scientist to blog” category.
Meanwhile, the propagation continues - Peter Fox, who attended this
past Sci Foo, and is an occasional blogger has joined my lab, and
he and I are trying to convince several of our colleagues, esp.
Deborah McGuinness, to get blogging.
I’d also like to point out that while blogging continues to be
interesting to look at as a mechanism for propagating science, I’m
finding these days that microblogging (i’m jahendler on twitter)
has been gaining popularity, especially among the Social Scientists
- and it may be an even better way for some of the busy senior
scientists you’re trying to reach out to (if they can just
learn to use the messaging on their cell phones). I credit
“eingang” (Michelle Hoyle - http://einiverse.eingang.org/) for
getting me twittering, and I notice that a quick message from my
phone during a lecture or seminar is a good way to share a thought
or a pointer (although I find it also is fun to add personal
observations and such - so it humanizes the scientists who use
it)
So anyway - there are three entries for the contest
Danny Weitzner for helping to get Tim Berners-Lee blogging
Jen Golbeck for getting me blogging
Michele Hoyle for getting me micro-blogging
cheers
Jim H.
Posted at 15:28
Posted at 14:09
It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.
Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.
The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.
“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”
Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.
Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!
Posted at 16:15
Adding to the collection of Amazon EC2 AMI based knowledgebases already unveiled for DBpedia and NeuroCommons, we now have a Bio2Rdf knowledgebase AMI.
A community developed knowledgebase comprised of Bio Informatics data from across 30 or so public data sources. The standard deployment of Bio2Rdf includes a a federation of SPARQL endpoints provided by project members and collaborators.
An Amazon EC2 hosted variant of the Bio2Rdf knowledgebase. In addition to providing a SPARQL endpoint, the data exposed by the Amazon AMI is published in compliance with Linked Data publishing best practices espoused by the Linking Open Data community (LOD).
The ability to instantiate a personal or service-specific variant of this powerful knowledgebase via the Amazon EC2 Cloud. Instead of a 22+ hour error prone odyssey - you simply get down to the task of data analysis and integration within 1.5 hrs (when setting up you AMI for the first time).
Posted at 15:37
base
elementmetadata element. An SVG+RDFa distiller ought to
understand this RDF graph and merge it with the graph produced by
the regular RDFa processing.host=xml) option has
been introduced, although the distiller would work out of the box
for most of the SVG cases (ie, for those that do not make use of
those two features). As an example, I have updated the SVG version
of the horizontal SW cube
to SVG 1.2 Tiny. It uses the metadata element for the
description of the copyright statements, but reuses SVG’s
title and desc elements to generate the
corresponding dc:title and dc:description
RDF statements using RDFa’s @property attribute. Using
the RDFa distiller, one can get to the
RDF content. Cool…

Posted at 09:49
High quality videos of tutorials and talks from the Seventh International Semantic Web Conference are now available on the excellent VideoLectures.net site. It’s a great opportunity to benefit from the conference if you were not able to attend or, even if you were, to see presentations you were not able to attend.
Videolectures captured the slides for most of the presentations (which are available for downloading) and their site shows both the the speaker’s video and slides in synchronization. Videolectures used three camera crews in parallel so were able to capture almost all of the presentations. Here are some highlights from the ~90 videos to whet your appetite.
Posted at 14:45
Posted at 14:37
Tom Briggs defended his PhD dissertation last month on discovering domain and range constraints in OWL and the final copy is now available.
Thomas H. Briggs, Constraint Generation and Reasoning in OWL, 2008.
The majority of OWL ontologies in the emerging SemanticWeb are constructed from properties that lack domain and range constraints. Constraints in OWL are different from the familiar uses in programming languages and databases. They are actually type assertions that are made about the individualswhich are connected by the property. Because they are type assertions these assertions can add vital information to the individuals involved and give information on how the defining property may be used. Three different automated generation techniques are explored in this research: disjunction, least-common named subsumer, and vivification. Each algorithm is compared for the ability to generalize, and the performance impacts with respect to the reasoner. A large sample of ontologies from the Swoogle repository are used to compare real-world performance of these techniques. Using generated facts is a type of default reasoning. This may conflict with future assertions to the knowledge base. While general default reasoning is non-monotonic and undecidable a novel approach is introduced to support efficient contraction of the default knowledge. Constraint generation and default reasoning, together, enable a robust and efficient generation of domain and range constraints which will result in the inference of additional facts and improved performance for a number of Semantic Web applications.
Posted at 08:00
Disco is a
Python-friendly, open-source Map-Reduce framework for distributed
computing with the slogan “massive data - minimal code”.
Disco’s core is written in Erlang,
a functional language designed for concurrent programming, and
users typically write Disco map and reduce jobs in Python. So
what’s wrong with using Hadoop? Nothing,
according to the Disco site, but…
“We see that platforms for distributed computing will be of such high importance in the future that it is crucial to have a wide variety of different approaches which produces healthy competition and co-evolution between the projects. In this respect, Hadoop and Disco can be seen as complementary projects, similar to Apache, Lighttpd and Nginx.
It is a matter of taste whether Erlang and Python are more suitable for the task than Java. We feel much more productive with Python than with Java. We also feel that Erlang is a perfect match for the Disco core that needs to handle tens of thousands of tasks in parallel.
Thanks to Erlang, the Disco core remarkably compact, currently less than 2000 lines of code. It is relatively easy to understand how the core works, and start experimenting with it or adapt it to new environments. Thanks to Python, it is easy to add new features around the core which ensures that Disco can respond quickly to real-world needs.”
The Disco tutorial uses the standard word counting task to show how to set up and use Disco on both a local cluster and Amazon EC2. There is also homedisco, which lets programmers develop, debug, profile and test Disco functions on one local machine before running on a cluster. The word counting example from the tutorial is certainly nicely compact:
from disco.core import Disco, result_iterator def fun_map(e, params): return [(w, 1) for w in e.split()] def fun_reduce(iter, out, params): s = {} for w, f in iter: s[w] = s.get(w, 0) + int(f) for w, f in s.iteritems(): out.add(w, f) results = Disco(”disco://localhost”).new_job( name = “wordcount”, input = ["http://discoproject.org/chekhov.txt"], map = fun_map, reduce = fun_reduce).wait() for word, frequency in result_iterator(results): print word, frequency
Posted at 17:45
I wanted to leave a blog comment on the Clark and Parsia blog with respect to the entry Kendall wrote in the entry entitled “Our Approach to Modeling, Fidelity, and KR.” However, to leave such a comment I would have to log in, and I have way too many accounts right now, so I thought I’d write my response as a new entry (and by the time I finished, this was too long to be just a comment).
I don’t disagree with the overall “spectrum” that Kendall offers, but his point is that they have picked a point in the middle, and since they are in the middle they can model more than the scalers and scale more than the modelers. The problem is that the middle is very, very wide, and thus there are many places in this space that such a claim could be made. So, for example, a large triple store that can do a small amount of inferencing, say Garlik’s JXT as one example, would scale even better and could still be able to claim to do more modeling than a pure triple store.
On the other end, the idea that decidability is somehow a sweet spot (despite known exponential behaviors for DL) over a more highly modeled, but perhaps heuristic (or incomplete) logic. In this case the system could claim both to have more expressivity than a DL system, but also to be more scalable (just couldn’t gaurantee to have all the answers). In fact, right now the systems that probably have the highest score in modeling power vs. scalability would fall in this camp. The thing is their answer sets would be somewhat different.
I my opinion, the real problem with this blog entry is the idea that there is one sweet spot (Kendall called it the “sweet spot”) which implies that there is a general best answer. This is the point I cannot really live with, and have spent much of my recent career trying to debunk. Depending on what you are trying to do, there are many possible sweet spots. There are a set of problems for which what C&P are doing is exactly the right thing, but there are also many where they are not.
And that is the key thing, we in the field have to get much better at understanding where the tradeoffs are and what various kinds of applications require. Google taught us years ago that sometimes finding a good answer quickly can be an incredibly powerful thing. Expert systems taught us that for many application complex modeling is too expensive. Yet there are systems running in real applications that are using expert level modeling, because sometimes it is the thing you need despite the cost (and the ROI is high enough).
The other problem I have with the argument made actually has nothing to do with the issues of logic and such. The traditional database community for has for a long time made a similar claim, which is that there is a particular place in the expressivity/scalability place that is “the” correct place. They have spent years claiming that particular sweet spot is the only one that is interesting — it certainly has proven to be a very important one, making way more commercial success than the DL stuff. However, lately we’ve been learning that there exist problems where we need more expressivity, and thus other things have to be explored — the people in the DB community who’ve started looking at graph stores are, indeed, seeing that there are some applications, both in enterprises and especially on the Web, where the small amount of added expressivity makes a huge difference. (Anyone who has witnessed my debates with Ullmann have certainly heard this argued…)
Anyway, when I gave the first talk at the DARPA Agent Markup Language (DAML) program, lo these many years, I showed a slide with the word “THE” under a kill ring and stated that in the Web there is no the - and whether to the database community, the adherents of DL, the people who cite my work, or anyone else — remember you are exploring one sweet spot that can be important to some set of applications, but there are many others, and we all win when we remember that.
Cheers - Jim Hendler
p.s. Clearly this is not meant in any way to be an anti-C&P comment, I was just riffing off of what Kendall wrote.
Posted at 03:58
Posted at 08:46
For some people, the point of the Semantic Web is distributed, web-friendly knowledge management and knowledge representation. Generally we’re in that camp. But that camp breaks down into several factions, and it’s useful to be clear about which faction we’re in.
There is a spectrum that runs from Maximum Fidelity to Maximum Scalability. Given our roots in Description Logic, we lie somewhere in-between these two poles. Notice that I have intentionally avoided calling these “extremes”; they are endpoints, and perfectly respectable, useful ones, depending on who you are and what you’re trying to achieve.
The Max Fidelity folks want to model as closely as possible some world-chunk in as fine-grained and faithful manner as is possible. This often means that they are at least first order logic fans, and sometimes higher-order logic users. They debate edge cases, corner cases, alternate and competing semantics and logics in an attempt to ever more faithfully mirror reality. The price they pay is, generally, computability. For some use cases, that price is perfectly acceptable. For other use cases, that price is entirely too high, since the most perfect representation of the world is useless if you can’t practically compute with it—at least, that’s how Max Fidelity often looks to us.
At the far end of the spectrum we have Max Scalability folks, for whom the point of the Semantic Web is rather more the “Web” than the “Semantic” part—we might playfully call them the “semantic WEB” crowd, in order to reflect their ideal ratio. Here the point isn’t to model perfectly; but, rather, to do something with lots and lots of data, ideally Webfuls of data. This means, in the argot of current tech choices, that they tend to be RDF and Linked Data fans and users, since that’s just about the only approach to doing anything at all interesting with Webfuls of data. The price they pay, of course, is expressivity. For some use cases, that’s just fine, since you don’t always need a lot or even much semantic fidelity to get the job done. Sometimes we build applications for customer that take this approach. But, as above, for other use cases, this is simply a killer, because without enough or the right semantics, you don’t get the right kind of help from the machine in figuring out complex stuff.
So what do we have so far? First, we have a notional (and idealized) spectrum that runs from Webfuls of data to, roughly, at least first order logic. Second, we have obviously tons of interesting use cases at (probably) every point along this spectrum. And, third, we have the suggestion that we aim for some kind of sweet spot in the middle—where “sweet spot” and “in the middle” are not absolute notions, but are interest-relative and goal-specific, and where the interests and goals we care about are, surprise-surprise, ours.
(In other words, I’ve setup a little fantasy where we are the Heroes—where we naturally occupy the “sweet spot”—but then, since I’m not a complete jerk, I’ve ironized or called into question that very fantasy in an effort to suggest that we, just like everyone else, try to spin things to make ourselves look smart, cool, and useful.)
And—will miracles never cease?—that’s just about where Description Logic fits along such an idealized spectrum. Technically, it’s the decidable subset of first order logic, which means that we try to balance Fidelity and Scalability in a way where we can get some of both.
The Max Fidelity folks are forever poking us with sticks to the effect that we can’t model world-chunks nearly as faithfully as they can. Well, no crap, of course we can’t! Then the Max Scalability folks poke us with different sticks to the effect that we can’t scale to Webfuls of data—again, no duh!
And then we poke back at both camps—hey, they started it!—to the effect that we can model far better than Max Scalers and we can scale far further than Max Fideliters (yes, I just made that word up…Rock!)...
Finally, a word about how this positioning issue plays out in our approach to modeling. In short, we model such that we get the right inferences, since getting the inferences is typically what our kind of applications (analysis, decision support kinds of apps, in short) are all about. So that means some edge or corner cases, even if they fit into DL, get ignored or dropped out or even distorted when there’s no point—given requirements analysis—to fidelity for its own sake. And it means, on the flip side, that we don’t worry too much that that inference over Webfuls of data is not realistically achievable anytime soon. Fast enough for the customer’s data is sufficient scalability in most cases for us.
Posted at 16:41
The perceived utility of automated reasoning for a wide range of applications matters to us greatly, which makes sense, given that our biz proposition is “semantic infrastructure OEM”. In other words, we’re trying to make money by licensing reasoning infrastructure, and related pieces, for semantic applications to other developers to use in their apps. With the right APIs and tool maturity, as well as supporting materials, our customers should be able to treat automated reasoning as a black box—not a black art.
A problem with demonstrating automated reasoning’s utility is that automated reasoning is complex, with non-trivial logical background and framework, including oodles of domain-specific vocabulary. Another problem is that automated reasoning is, in the end, just a kind of mechanical term rewriting often according to, considered individually, quite trivial rules. (Pellet isn’t really a rules engine, but we’ll talk about that another time.)
That means that for toy cases, which is what most people new to the subject are ready for, it seems dull and unimpressive. And for the hard cases? Well, most people aren’t ready for hard cases, so they simply tune out. And who can blame them, really? It’s like my example about Emma and Jack. I mean, that example really sucked, but what’s the alternative?
This is not an easy problem to solve.
My approach, rather than showing more toy or real examples, is just to talk about the utility of automated reasoning in plain language, in an attempt to communicate not so much specific details as the general mindset or approach to solving particular sorts of problems using automated reasoning. This approach to marketing mirrors our approach to technology development: both are iterative and experimental, but not just for us. As the man said, even a blind pig occasionally finds an acorn.
Posted at 16:21
I don’t have slides for my time at the APQC in Houston, I was not slated to present, so no cool slide widget with my presentation in this post. I was merely there to observe and learn, and maybe answer some questions about POPS.
As Kendall mentioned, POPS was nominated as a best practice as part of NASA JPL’s overall efforts in Knowledge Management. The meeting at APQC was for all the nominees to give a short talk and to hear the overall findings of the study conducted by APQC, which in this case was on Expertise Location and Social Networking.
I got to see some great presentations by folks from IBM, Sun, Pratt-Whitney, Rockwell Collins, and Mitre and get a lot of insight into what they’re doing with Expertise Location and Social Networking; challenges they faced in the past, lessons learned, and what they’re doing now, and in the future, to continue their efforts in these areas.
It was a great experience, the people from APQC were fantastic, very friendly and put on a great event, and all the nominees and study partners, a group which included L3, Marathon Oil, ExxonMobil, and Wyeth, were all great and added a lot to the discussions.
Hopefully I get the chance to participate or work with APQC in the future.
Posted at 16:03
@property attributes with
XML subtrees are handled. It is a bit of an edge case; we realized
that there is not even a test case for this in the RDFa test suite:-(
But there are some nice use cases that indeed went wrong (like the
one by Florian Schmedding). (You can look at the
mail archive for further details if interested.)
Posted at 13:28
eigenfactor.org is a fascinating site that is exploring new ways to measure and visualize the importance or journals to scientific communities. The site is a result of work by the Bergstrom lab in the Department of Biology at the University of Washington. The project defines two metrics for scientific journals based on a page-rank like algorithm applied to citation graphs.
“A journal’s Eigenfactor score is our measure of the journal’s total importance to the scientific community. With all else equal, a journal’s Eigenfactor score doubles when it doubles in size. Thus a very large journal such as the Journal of Biological Chemistry which publishes more than 6,000 articles annually, will have extremely high Eigenfactor scores simply based upon its size. Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson’s Journal Citation Reports (JCR) is 100.
…
A journal’s Article Influence score is a measure of the average influence of each of its articles over the first five years after publication. Article Influence measures the average influence, per article, of the papers in a journal. As such, it is comparable to Thomson Scientific’s widely-used Impact Factor. Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports (JCR) database has an article influence of 1.00.”
For example, here are the ISI-indexed journals in the AI subject category ranked by the Article Influence score for 2006.
The site makes good use of GoogleDoc’s motion charts to visualize the changes of metrics for top journals in a subject area. You can also interactively explore maps that show the influence of different subject categories on one another as estimated from journal citations.

The details of the approach and algorithms are available in various papers by Bergstrom and his colleagues, such as
M. Rosvall and C. T. Bergstrom, Maps of random walks on complex networks reveal community structure, Proceedings of the National Academy of Sciences USA. 105:1118-1123. Also arXiv physics.soc-ph/0707.0609v3 [PDF]
(spotted on Steve Hsu’s blog)
Posted at 06:01