It's triples all the way down
The last few months have been challenging in term of amount of work to get done, in focusing on deliverables and in getting ready for the release of conStruct and structWSF sources codes, documentations, tutorials, web sites and demos.
I am now really happy to be able to finally announce the release of both software code sources along with a new development community website where users and developers can exchange ideas about these two news projects.
The biggest milestone of the last months is now behind us. However, this is just the beginning of everything!
I think that many things have been written about these two projects already. I don�t want to write any tutorial at this point. So the only thing I will do right now is to point you the more relevant documentation, web sites, blog posts and demos about each project. The next step will be to write about specific use cases, features, etc.
The community Web site is a place where developers and users of structWSF and conStruct can meet to talk about both projects, to report bugs and issues, to submit new enhancements, to find tips and tricks, etc.
I would suggest you to create a new user profile on the community Web site if you are interested in communicating with other members.
structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).
The structWSF middleware framework is fully RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.
conStruct is a distro of the Drupal framework that aims to set a new standard in data integration and as a structured content system (SCS). With conStruct, you can let your data and its structure drive your applications. You can easily interoperate your diverse internal information with public content on the Web. And you can leverage a platform designed from the ground up for knowledge management and collaboration.
Posted at 19:59
ComputerWorld has an article on the “nosql” movement and a recent nosql meetup held in San Francisco, No to SQL? Anti-database movement gains steam. Nosql systems are distributed, non-relational data stores that typically use a simple key-value approach to indexing and retrieving data and use a simple procedural query API rather than a sophisticated declarative query language.
“The inaugural get-together of the burgeoning NoSQL community crammed 150 attendees into a meeting room at CBS Interactive. Like the Patriots, who rebelled against Britain’s heavy taxes, NoSQLers came to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.
“Relational databases give you too much. They force you to twist your object data to fit a RDBMS [relational database management system],” said Jon Travis, principal engineer at Java toolmaker SpringSource, one of the 10 presenters at the NoSQL confab (PDF). NoSQL-based alternatives “just give you what you need,” Travis said.”
There were presentation on nine different ‘nosql’ databases: Voldemort, Cassandra, Dynomite, HBase, Hypertable, CouchDB, VPork, MongoDb as well as general presentations by Google’s Jonas Karlsson, and Cloudera’s Todd Lipcon.
Johan Oskarsson of Last.fm wrote a debriefing post on his blog.
“The relatively young but rapidly growing “nosql” community met last Thursday in San Francisco. The idea was to give attendees a solid introduction to how distributed, non relational databases work as well as an overview of the various projects out there.”
and provides links to the presentation slides and videos. You can also search for NOSQL on Vimeo to get the videos.
I learned of this meeting on Hacker News, where you can find some interesting comments.
Of course their are many popular key-value stores that are not designed to support the highly-scalable distributed needs of many Web applications. I found, for example, that as a persistent RDF store for rdflib, Sleepycat out performed MySQL.
Posted at 14:17
De retour de vacances, mon mémoire de thèse (pdf, licence CC BY-NC-ND) et les transparents de la soutenance (qui s'est très bien passée ;-) sont enfin en ligne.
Merci encore à tous ceux qui sont venus / m'ont encouragé / questionnés et avec qui j'ai pu travailler ces dernières années dans le cadre de cette thèse ... l'aventure ne fait que commencer !
Posted at 23:50
On Monday of this week I attended a hearing in New York City organized by the Technology and Government Committee of the New York City Council. On the agenda was a proposal (Int. No. 991) regarding the use of open standards for publishing New York city government data. I picked up a printed copy of the proposal and a summary when I walked into the hearing. To my surprise the handout referred to W3C by name (the online proposal does not) and included a reference to the recent publication of the eGovernment Interest Group Improving Access to Government through Better Use of the Web.
So I filled out a form requesting to speak. To my surprise, the Chair invited me to testify early in the hearing.
Before I spoke, however, a representative from the Mayor's Office voiced opposition to some specifics of the proposal. Earlier that day, at the Personal Democracy Forum elsewhere in the city, the Mayor himself announced several initiatives regarding publishing government data. This had generated some excitement, and a number of people who had been attending the conference (I had not) were present at the hearing.
The Mayor's Office cited 5 or 6 reasons why it opposed the particular proposal (which I trust will appear in the public record that I've not yet located) but the main ones I recall were cost and burden. I would paraphrase some of the exchange between the city council committee and the Mayor's office as follows:
W3C's eGovernment Interest Group has been working with a growing number of agencies to gather information that will help address these sorts of concerns. Now they will develop best practices and guidelines for publishing government data. This is not an area I know well, so I look forward to being able to refer to the eGov IG's findings. However, I'm sure New York City is not the first government to wrestle with the technology, the cultural issues ("why should I publish my data?"), and how to use taxpayer money to do this.
When my turn came to speak, I said something like this:
I hope my summary here is backed up by the public record.
Posted at 22:51
By Tom
Scott
| This
guest post originally appeared on Tom Scott’s blog; republished
under CreativeCommons License, and with kind permission of the
author.
It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.
TimBL is
working with the UK Cabinet Office (as an advisor) to make our
information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:
The Guardian has an article on the appointment.
Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data
to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009)
looking at how the BBC has adopted semantic web technologies,
including DBpedia, to help provide a better, more coherent user
experience. For which we won best paper of the in-use track –
congratulations to Silver
and Georgie.
The BBC has announced a couple SPARQL endpoints, hosted by talis
and openlink [welcomebackstage.com]
Both platforms allow you to search and query the BBC data in a
number of different ways, including SPARQL — the standard
query language for semantic web data. If you’re not familiar with
SPARQL, the Talis
folk have published a tutorial that uses some NASA data.
A social semantic BBC?
Nice presentation from Simon and Ben on how social discovery of
content could work… “show me the radio programmes my friends have
listen to, show me the stuff my friends like that I’ve not seen”
all built on people’s existing social graph. People meet content
via activity.
PriceWaterhouseCooper’s spring technology forecast focuses on
Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side,
you gain access to the comprehensive data you need to make
decisions. On the supply side, you share more of your internal data
with partners, suppliers, and—yes—even the public in ways they can
take the best advantage of. The Linked Data approach is about
confronting your data silos and turning your information management
efforts in a different direction for the sake of scalability. It is
a component of the information mediation layer enterprises must
create to bridge the gap between strategy and operations… The term
“Semantic Web” says more about how the technology works than what
it is. The goal is a data Web, a Web where not only documents but
also individual data elements are linked.”
Including an
interview with me!
sameas.org a service to help link
up equivalent URIs
It helps you to find co-references between different data sets.
Interestingly it’s also licenced under CC0 which means all
copyright and related or neighboring rights are waived.
Image: “Semantic Web Rubik’s Cube” by dullhunk, CC License, via flickr
Posted at 13:45
En mi art�culo anterior habl� del nuevo buscador sem�ntico de Microsoft. En este art�culo definir� precisamente conceptos como "b�squeda sem�ntica" y "buscador sem�ntico" y expondr� ejemplos de sus ventajas con respecto a los buscadores convencionales, as� como de sus limitaciones actuales.
Muchos de los buscadores actuales se basan en palabras clave. Es decir, el usuario introduce las palabras relevantes de su b�squeda ("Albert Einstein" y "Nobel", p. ej.), y la aplicaci�n devuelve todos los documentos que contienen esas palabras. En el apartado 3.2 de El futuro de la Web (http://www.javahispano.org/tutorials.item.action?id=55) puede encontrarse una exposici�n de las desventajas de esos buscadores. Dos son las m�s importantes:
Un estudio de David Hawking y de varios investigadores evalu� 20 buscadores convencionales (basados en palabras clave) usando 54 b�squedas. El porcentaje de resultados relevantes despu�s de inspeccionar las 20 primeras p�ginas web devueltas fue del 0,5% para el mejor buscador (Northern Light), y Google fue el segundo buscador m�s preciso. As� pues, la popularidad de los buscadores basados en palabras clave no tiene mucho que ver con su precisi�n, sino con la paciencia de buey de los usuarios.
Una b�squeda sem�ntica es una consulta en la que se tiene en cuenta el contexto, y por tanto el significado, de aquello por lo que se pregunta (y no solamente las palabras de la consulta), con el objetivo de evitar la ambig�edades tanto de las consultas como del texto de los documentos donde se busca. Por ejemplo, una b�squeda sem�ntica con las palabras "descubridor" y "penicilina" devolver�a documentos sobre Alexander Fleming, aunque en ellos no aparecieran esos dos t�rminos, porque identificar�a los conceptos que estructuran la b�squeda (la penicilina es un producto del cual se desea averiguar su descubridor o, dicho m�s formalmente, Medicina(Penicilina) tieneInventor Persona(Alexander Fleming)). El fin �ltimo de las b�squedas sem�nticas radica en que los usuarios puedan formular b�squedas m�s precisas y expresivas, que originen resultados relevantes para el usuario con la m�nima intervenci�n de �ste.
Normalmente, se admite que las b�squedas sem�nticas se basan en t�cnicas para extraer informaci�n mediante la utilizaci�n de ontolog�as (v�ase http://www.wshoy.sidar.org/index.php?2005/12/09/30-ontologias-que-son-y-para-que-sirven) o metadatos. El uso de ontolog�as permite definir formalmente los dominios de inter�s (teor�as cient�ficas, por ejemplo) con la suficiente riqueza expresiva para que los usuarios pueden especificar sus b�squedas con bastante detalle, ya sea antes de ejecutar la consulta o durante su ejecuci�n.
Desde un punto de vista t�cnico, un buscador sem�ntico es una aplicaci�n que comprende las b�squedas de los usuarios y los textos de los documentos de la web mediante el uso de algoritmos que simulan comprensi�n o entendimiento, y que a partir de �stos proporciona resultados correctos sin que el usuario tenga que abrir el documento e inspeccionarlo por s� mismo. Un buscador de este tipo reconoce el contexto correcto para las palabras o sentencias de b�squeda. Google o Yahoo no son buscadores sem�nticos, pues se basan fundamentalmente en algoritmos que generan estad�sticas a partir de palabras y enlaces, y no en algoritmos cognitivos que capturen el conocimiento impl�cito en las palabras y su contexto. Por ejemplo, una b�squeda como "�Qui�n fue Urano?" en cualquiera de esos buscadores devolver� resultados afines al s�ptimo planeta del Sistema Solar, cuando est� claro que el prop�sito de la b�squeda es encontrar informaci�n sobre el dios primordial del cielo en la mitolog�a griega.
Los buscadores sem�nticos no siempre pueden acertar a la primera el significado de una palabra polis�mica. Por tanto, deben disponer de medios de desambiguaci�n para conocer el sentido exacto que tiene la palabra en la b�squeda. Por ejemplo, un buscador sem�ntico que utilize internamente ontolog�as con conceptos inform�ticos y medios de transporte deber� disponer de herramientas para determinar a qu� se refiere el usuario cuando hace una consulta con la palabra bus, que puede significar autob�s o "sistema digital que transfiere datos entre los componentes de un computador o computadores". Para ello, puede escoger el significado m�s probable, preguntar al usuario para que elija entre varias opciones (como hace el buscador Hakia, que presenta las opciones extray�ndolas de su ontolog�a) o usar las dem�s palabras de la b�squeda para inferir el significado exacto de bus en ese contexto (p. ej., en una consulta como "�A qu� hora sale este viernes el bus para Soria desde Madrid?").
Como un buscador sem�ntico se basa en algoritmos que simulan la comprensi�n de las palabras y, por ende, establecen relaciones entre ellas, pueden realizar b�squedas de inter�s para el usuario aunque en los documentos devueltos no figuren las palabras o expresiones de b�squeda. Por ejemplo, un buscador sem�ntico en que se introdujera la palabra "marsupial" mostrar�a documentos donde aparecer�an t�rminos como �stos: canguro, koala, satanelo de Nueva Guinea, monito del monte, rata canguro, zarig�eya, tlacuache, demonio de Tasmania. Como demuestra este ejemplo, las b�squedas sem�nticas son muy superiores a las basadas en palabras clave: uno puede encontrar documentos de inter�s que jam�s encontrar�a buscando con palabras clave. Adem�s, si uno buscara informaci�n sobre distintas especies de marsupiales, no necesitar�a formular la consulta de distintas maneras, con el nombre de cada especie, para obtener la informaci�n deseada.
La falta de estructura y de anotaciones sem�nticas en los recursos de la web (documentos Word, PDF, p�ginas HTML, etc.) obliga a que los buscadores sem�nticos analicen mediante algoritmos cognitivos los recursos, palabra a palabra y oraci�n a oraci�n, para asignar las palabras y oraciones a conceptos ontol�gicos. Estos algoritmos son lentos y requieren supervisi�n humana. De ah� que los buscadores sem�nticos no cubran por ahora tantos recursos de la web como los convencionales, que emplean algoritmos estad�sticos, mucho m�s r�pidos y completamente automatizados. Esta limitaci�n desaparecer� cuando se vayan mejorando los algoritmos cognitivos o en cuanto los "islotes sem�nticos" se unan para formar la web sem�ntica o, al menos, "continentes sem�nticos".
"Nunca existir� la web sem�ntica �oigo a lo lejos�. Es tan imposible que funcione como las m�quinas voladoras de Leonardo da Vinci." Tengo dos objeciones contra esa opini�n. Una: el pesimismo no tiene futuro. Dos: hubo un tiempo, no lejano, en que se pensaba que la interoperabilidad sint�ctica era imposible salvo con enormes inversiones, y casi todos apostaron a que no habr�a un �nico caballo ganador en la carrera de los lenguajes de intercambio de datos. Se equivocaron. Y algunos perdieron hasta la camisa.
A falta de la web sem�ntica, algunos ya se han puesto al tajo. Existen buscadores sem�nticos que trabajan ya estructurando la informaci�n a la que luego se accede mediante b�squedas. Por ejemplo, Freebase (http://www.freebase.com/), un buscador social, utiliza una base de datos de grafos para definir su estructura de datos como una serie de nodos y un conjunto de enlaces que establecen relaciones entre los nodos. Seg�n la documentaci�n oficial de Freebase, lo que diferencia a Freebase de otras bases de datos es que cualquier asunto puede ir acompa�ado de muchas clases distintas de informaci�n. El ejemplo que dan es muy claro: "Por ejemplo, Arnold Schwarzenegger podr�a aparecer como actor en una base de datos de pel�culas, como gobernador en una base de datos de pol�tica y como Mr. Universo en una base de datos de culturistas. En Freebase, solamente hay un tema sobre Arnold Schwarzenegger, que contiene informaci�n sobre las tres facetas de su vida p�blica. El tema unificado act�a como un centro de informaci�n, por lo que es f�cil encontrar informaci�n sobre �l y aportarla, independientemente de qu� clase de informaci�n es".
En principio, los buscadores sem�nticos podr�an evitar las p�ginas basura, que proliferan en la web como malas hierbas en un campo abandonado. Como tienen en cuenta el contexto de las palabras o frases de los documentos, podr�an descartar esas p�ginas enseguida. Por ejemplo, una p�gina web que incluya la frase "web sem�ntica" rodeada de frases sobre c�mo aumentar la potencia sexual, juguetes er�ticos y sexo f�cil en alg�n pa�s lejano de costumbres relajadas ser�a eliminada de cualquier b�squeda sobre la web sem�ntica o tendr�a una relevancia muy baja; pues el contexto de estas �ltimas frases (sexo) no tiene ninguna relaci�n con la web sem�ntica.
Que un buscador permita introducir preguntas en lenguaje natural ("�Qu� tiempo hace ahora en Viena?") y las responda correctamente no significa necesariamente que sea un buscador sem�ntico: puede que solamente traduzca las preguntas en lenguaje natural a consultas sobre una base de datos.
Por el momento, casi todos los buscadores sem�nticos permiten solamente b�squedas en ingl�s, aunque se est�n ampliando para que admitan otros lenguajes. Aparte del predominio del ingl�s, la causa de eso se debe tambi�n a las dificultades inherentes a reflejar el conocimiento de los lenguajes naturales en estructuras de datos que permitan b�squedas r�pidas y escalables (matrices, listas, pilas, colas, �rboles, grafos, etc.). Por ejemplo, el buscador Hakia utiliza un vocabulario en forma de ontolog�a que incluye unos 100.000 sentidos de palabras inglesas, y ese n�mero continuar� aumentando seg�n se perfeccione la aplicaci�n. Confeccionar cualquier vocabulario de ese tama�o es una empresa lenta, tediosa y muy cara, y que debe ser realizada por un equipo bien coordinado de especialistas en ling��stica.
Se equivocar� quien piense que, teniendo una ontolog�a de sentidos de palabras en ingl�s, resulta sencilla su conversi�n a una ontolog�a en otro idioma: la conversi�n de ontolog�as ling��sticas de un idioma a otro es un proceso muy complejo y que requiere la supervisi�n constante de un equipo de traductores. Por poner un ejemplo, si queremos pasar de una ontolog�a ling��stica en espa�ol a una en alem�n, debemos considerar todas las posibles traducciones al alem�n de cada palabra espa�ola; en caso contrario, los resultados de las b�squedas en alem�n estar�n m�s limitados que los de las b�squedas en espa�ol. Una palabra espa�ola sencilla y sin ambig�edades como "autom�vil" puede traducirse en alem�n como "Auto", "Wagen", "Kraftwagen", "Kraftfahrzeug", "Automobil", "Motorfahrzeug" o "KFZ" (seguro que hay m�s traducciones, pero hasta ah� llega mi alem�n b�sico).
En una b�squeda interlingu�stica espa�ol-alem�n de tipo sem�ntico, todas estas palabras deber�an tenerse en cuenta para encontrar todos los documentos relevantes cuando alguien escriba "autom�vil" en el buscador. (Las b�squedas interlingu�sticas son aquellas en que se traduce una b�squeda en un lenguaje a otro lenguaje, y los resultados se traducen de nuevo al primer lenguaje. Google est� trabajando para a�adir a su buscador esta clase de b�squedas, que permitir�n, entre otras muchas cosas, que un hispanohablante puede reservar entradas en museos y cines en Tokio, aunque la informaci�n de horarios y venta de entradas no est� disponible en espa�ol.)
Es probable que los buscadores sem�nticos cambien la manera en que se busca y se muestra la informaci�n y que supongan un gran cambio para los usuarios ocasionales. Consid�rense, por ejemplo, las interfaces que aparecen en las siguientes capturas de pantalla, procedentes de Mnemo (http://www.mnemo.org/) Kart00 (http://www.kartoo.com/) y KoolTorch ( http://www.kooltorch.com/).
Posted at 07:33
At Semantic Technology 2009 we formally announced PelletDb, our new product that integrates Pellet with Oracle’s Semantic Database system, including the Oracle RDF query engine and OWL reasoner. We’re excited about PelletDb since it makes Pellet available to Oracle users, including its sound and correct OWL 2 reasoning, unique reasoning services like SPARQL-DL and explanations, etc. But we’re also excited because it makes Oracle’s enterprise-class information management facilities available to Pellet users and apps.
Today we’re releasing an extensive PelletDb whitepaper (PDF) that explains in detail what PelletDb is, how it works, who should use it, etc. It includes customer benefits, sample code, and a basic roadmap for future development. If you’re curious about how we’re fusing Pellet and Oracle, check out the whitepaper.
The PelletDb limited beta is on-track to begin 15 July, so please get in touch if you want to participate.
Posted at 20:28
SemTech 2009, along with W3C's significant participation in it, is now behind us. Besides catching upon on emails, I have spent the past week reflecting on the enthusiasm, presentations, and flurry of activities that constituted this year's event in San Jose, 14 to 18 June.
One strong feeling I had while in San Jose, was a sense of /deja vu/ in the Web world. Stepping back, I realize that 2009 feels a lot like 1999 when I was consulting with Allaire (remember CFML and ColdFusion?) and attended their user group meetings teaming with enthusiastic Web developers with war stories about their successes and failures bringing Web development servers into organizations of all types and sizes.
Ten years ago, many enterprises were just getting onto the "e-commerce bus," having been either eclipsed or inspired by the likes of innovative Web-centric companies such as Amazon.com and eBay who launched in 1995, or early-adopter retailers like JCPenney whose understanding of the catalogue business put them online faster than many other retailers, or businesses for that matter. Many mainline companies were in various phases of their Web evolution in 1999 -- from brochureware to intranets to pilot customer-facing interactive sites. And keep in mind that ten years ago, Google was barely two.
In 1999 there was also a wide cross-section of skill sets and diversity of understanding about what the Web was, how it worked, and what people and tools to trust to bring one's vision onto the Web. I remember sitting in focus groups with a number of HTML Web designers who were impatient with their more senior corporate IT colleagues who insisted on clear roadmaps, risk assessments and cost-benefit analyses for the Web-based tools and technology solutions their companies were considering.
The Java developers, engineers and system architects in other discussion groups also weren't too keen on the irreverent attitudes and huge amounts of money being thrown at these young people, who just a few years earlier were teenagers playing video games at the arcades. But understanding and trust continued to build, innovation accelerated, communities with technical skills increased, and revenues skyrocketed as a direct result of vendors developing and companies embracing new Web technologies.
We fast forward to 2009 and see similar dynamics with Semantic Web technologies. There are the early adopters and evangelists who have already climbed aboard the "RDF-bus," understand what's possible with W3C's Semantic Web technology standards, and can point to impressive results in new tools, pilot projects and even robust deployments within organizations, governments, and enterprises.
Yet skeptics remain both in terms of understanding the paradigm shift that the Semantic Web brings, just as the early Web challenged the status quo, and in the legitimate need for better tools and long-term architectural considerations for how to successfully deploy Semantic Web technologies in large enterprises.
Like the early Web and the W3C standards and subsequent commercial tools, products and services that enabled its rapid growth, the W3C Semantic Web stack is highly stable today. The accelerating uptake of W3C Semantic Web standards, new tools and applications were part of the buzz at this year's Semantic Technologies Conference.
In addition to hearing and seeing many new use cases and case studies, the call for commercialization was clear, as was the amount of enthusiasm among the technologists doing good and exciting work. The community's call to publish and link data in RDF or RDFa is clearly being heard, with The New York Times joining the ranks of large data holders eager and willing to publish to the Linked Open Data Cloud.
Finally, the number of Semantic Web communities flourishing in cities coast to coast across North America and in Europe, is another healthy sign that the growth and adoption of Semantic Web technologies has not only "crossed the chasm" (in keeping with Geoffrey Moore's model), but has spawned strong beachheads of support among highly skilled technology professionals across business, industry, and government sectors.
It is my hope that at next year's Semantic Technologies Conference -- which is changing venues to San Francisco -- we will point to an even higher coordinate on the adoption curve and see amazing new results and impact from the use of W3C Semantic Web technologies. If I were Jean Luc Picard, I would, "Make it so." But for now, I'll continue in my role of education and outreach for W3C.... Look forward to seeing many of you throughout the year and at next year's conference!
Posted at 13:38
Two weeks ago at Semantic Technology 2009 conference Evren and Mike presented a 4 hour tutorial about building OWL-based applications with Pellet 2. About 50 people attended, which was a surprising turnout given that it was at the rump end of the conference, a notoriously difficult time slot.
After some polishing based on feedback, we’re making the tutorial materials, including sample code, slides, and a bundled download of Pellet, available for use in learning (or teaching others) how to use Pellet, both interactively and programmatically.
Enjoy!
Posted at 19:39
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 6667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
In comparison, Bigdata reported 200K triples-per-second for the first 8000 LUBM universities on a 15 blade box. We expect to do about that much on one new dual Xeon board; we’ll publish this when this is done.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
Posted at 16:12
The Unit Social Software (USS) in DERI is currently looking for Ph.D.
candidates. Applications must be sent by the end of the week at
hr.ie@deri.org and positions
will start in September.
More details in the add below:
The Unit Social Software (USS) at the Digital Enterprise Research
Institute - DERI: http://www.deri.ie/ - of the National
University of Ireland, Galway invites applications for a 4 years
fully-funded PhD fellowship position.
DERI is a leading research institute in semantic technologies that offers a stimulating, dynamic and multi-cultural research environment, excellent ties to research-groups worldwide and standardization bodies, close collaboration with industrial partners and up-to-date infrastructure and resources.
The DERI Unit Social Software focuses on the convergence of Social Software and the Semantic Web by developing models and tools that support and take advantage of these two trends. Achievements of DERI USS include SIOC - Semantically-Interlinked Online Communities - and a large number of publications and tutorials on the topic in international venues and journals. USS Research is performed in collaboration with other DERI units and industrial partners. The PhD position is funded by Science Foundation Ireland (http://sfi.ie) within the Lion2 project and offers for the successful candidate an annual stipend, course fees and conference travel when presenting.
Applicants should have a strong interest in Social Software,
Semantic Web and Web Science in general and hold an excellent
primary degree or Masters qualification in a relevant discipline
(e.g. computer science, information
science, knowledge representation), with an emphasis on practical
aspects of research (e.g. industrial project experience, ontology
development and open-source software developement being distinct
advantages). Selected
candidates are expected to have the willingness to combine formal
scientific work with application-oriented research and development
in projects funded by national and international (EU) funding
agencies, as well as participating in
open-source projects and standardization activities.
Please submit your application (including cover letter, relevant publications or software implementation, full CV and contact details for two referees) to hr.ie@deri.org by 5pm on Friday, July 3rd with the subject line 'PhD Position - DERI USS'. Candidates will be contacted in the first week of July and interviews will be then conducted for successful applications. For further information please contact Alexandre Passant (alexandre.passant@deri.org) and John Breslin (john.breslin@deri.org).
Posted at 15:24
Over the past few months we have seen an impressive increase in Semantic Web Meetups all over the world. More and more afficionados enjoy this informal and decentralized way of networking with the local community, gaining new inputs and impressions for projects and business ideas . On July 16, 2009 the first Semantic Web Meetup in Vienna takes place at headquarter of the Austrian Press Agency.
Join the community! It’s fun and free of charge!
Posted at 14:16
I read the
Posted at 22:22
| important dates | |
| abstracts | 21 Sept 09 |
| submissions | 01 Oct 09 |
| notification | 15 Dec 09 |
| final copy | 15 Jan 10 |
| publication | April 10 |
The Journal of Web Semantics will publish a special issue on Data Mining and Social Network Analysis for integrating Semantic Web and Web 2.0 in the spring of 2010. The special issue will be edited by Bettina Berendt, Andreas Hotho and Gerd Stumme and initial abstracts for papers must be submitted via the Elsevier EES system by September 21, 2009.
The special issue, invites contributions that show how synergies between Semantic Web and Web 2.0 techniques can be successfully used. Since both communities work on network-like data structures, analysis methods from different fields of research could form a link between those communities. Techniques can be - but are not limited to - social network analysis, graph analysis, machine learning and data mining methods.
Relevant topics include
Posted at 14:16
As a compliment to the most recent Linked Data Design Issues note by TimBL, I would like to add this subtle tweak to the enumerated rules:
If you perform the steps above, on any HTTP network (e.g. World Wide Web), you implicitly bind the Names/Identifiers of things to negotiable representations of their metadata (description) bearing documents.
Also note, you can create and deploy the resulting RDF metadata using any of the following approaches:
Posted at 14:49
I published a blog entry on right before my journey back to Europe (and an addendum because I forgot something in the original blog entry…) with much more details. If you are interested in more detailed impressions on the conference, you can read it there. Suffices it to say: it was a great week!

Posted at 09:16
A recent article by Tim Berners-Lee, “Putting Government Data online“, has attracted significant interest to the datasets published at the US data.gov website. As Berners-Lee discusses the Semantic Web techniques that can be used to get those data into RDF space (something we are now working on), we would like to share our initial investigation of the contents of these government datasets.
I. Translate dataset into RDF
The catalog of the datasets in data.gov,http://www.data.gov/details/92, is published in CSV format as part of data.gov. We converted it into RDF using simple CSV parsing. We kept the translation minimal: (i) the properties are directly created from thecolumn names; (ii) each table row is mapped to an instance of pmlp:Dataset; (iii) all non-header cells are mapped to a literal - we don’t create new URIs at this point. The output of our work is published on tw website at:
http://data-gov.tw.rpi.edu/raw/92/catalog.rdf
(We are now starting to do more integration work, extracting multiple objects from single tables, linking into the linked open data cloud, etc. and will publish new version when that is done - the purpose of this first work was simply to make the catalog more available to the RDF community)
II. Browse and query the RDF graph
As an example, we can browse the dataset in tabulator, and then use a SPARQL webservice to query the dataset. For example, we use a sparql query to list datasets published in CSV format:
http://onto.rpi.edu/sw4j/sparql?queryURL=http://data-gov.tw.rpi.edu/sparql/select-csv-dataset.sparql
III. Observations on the RDF graph
Using this service we can answer some basic questions about the data.gov datatsets:
1. How many datasets are published, and how many among them can be easily converted into RDF?
There are 332 datasets which can be partitioned by type: raw data catalog(301); tool catalog (31).
Not all of the datasets have a link to downloadable data because some offer only browseable data via their own websites, Others publish datasets in multiple formats. As of today, the online static files associated with the datasets are distributed as follows: 204 datasets offer a CSV format dump, 10 datasets offer an XML format dump, and 21 datasets offer an XLS format dump.
2. How are the datasets categorized?
| Category | number of datasets |
| Geography and Environment | 227 |
| Labor Force, Employment, and Earnings | 30 |
| Social Insurance and Human Services | 30 |
| Health and Nutrition | 11 |
| Law Enforcement, Courts, and Prisons | 7 |
| Population | 4 |
| Other | 3 |
| Prices | 3 |
| Business Enterprise | 2 |
| Education | 2 |
| Energy and Utilities | 2 |
| Federal Government Finances and Employment | 2 |
| Income, Expenditures, Poverty, and Wealth | 2 |
| Science and Technology | 2 |
| Transportation | 2 |
| Construction and Housing | 1 |
| International Statistics | 1 |
| National Security and Veterans Affairs | 1 |
3. What are some of the key items in the dataset?
4. What are the sources of the datasets?
The majority of the datasets are published by the EPA, and they contain environmental data partitioned by the states of the US in three individual years. Others come from other govt agencies - the distribution is as follows:

IV. Getting Datasets linked
Although the datasets are not explicily linked, we see a number of opportunities for connecting these datasets to others (and into the Linked Open Data datasets):
V. Conclusions
We are committed to getting more of the data.gov data online soon (in RDF), and then investigating data integration and knowledge discovery. In order to get our datasets linked to the linked data cloud, we will use SPARQL for extracting entities and our Semantic Mediawiki as a platform to capture the owl:sameAs mappings. Scalable dataset publishing is also challenging as some of these are very large datasets, e.g. “2005-2007 American Community Survey Three-Year PUMS Population File” has a 1.1 g zipped csv file. Moreover, some datasets are not directly available in one file but via a web service. Our current plan is to produce RDF documents available for download soon, and to work on bringing more of these datasets into live, SPARQLable forms as we can.
Li Ding, Dominic DiFranzo and Jim Hendler
Posted at 14:05
Some new releases around
Apple´s iPhone family, like the new OS3.0 or the new 3G S have
stimulated another big hype around this “little darling”. I took a
look at another facet, namely: Has the Semantic Web entered the
iPhone realm yet (or vice versa)? Experts have been talking about
the need for semantically enhanced mobile applications for years,
so let´s see, if they are in place already.
Searching for “semantic web” in the AppStore delivers six results, one of them called “SemanticWb” is obviously an interesting match. The application “extracts current life sciences and health care knowledge and place them conveniently at your fingertips on your iPhone”. The application offers search suggestions and moderated search and retrieves articles from PubMed or genetic disorders which are related to the search term. Good start, this is a neat iPhone application which should be interesting for medical doctors and related professions.
Another application on the iPhone which is related to the semantic web is the “English wordnet dictionary” based on WordNet from Princeton University.
So, not much semantic web on the iPhone so far - I thought until Evriverse was released some weeks ago. The iPhone version of evri.com offers a new way to find connections between all kind of things. Similar to OpenCalais Evri can extract people, places, organisations, products etc. from unstructured information like news or blogs. The innovation around Evriverse is the way how complex search queries around “anything” can be formulated by just touching the screen. For example, if you are looking for information about “Tim Berners-Lee” the application not only offers auto-complete but also suggests related people, organisations etc. to refine any search query. Such relations are updated constantly and are based on the semantic analysis of news and blogs.
Evriverse offers the most comfortable way to do news research on the iPhone today. It shows how semantic technologies can enhance user experience on a mobile device and it will path the way to more semantic (web) apps on the iPhone.
Posted at 09:31
For the past few months, there have been a variety of calls for feedback and suggestions on how the US Government can move towards becoming more open and transparent, especially in terms of their dealings with citizens and also for disseminating information about their recent financial stimulus package.
As part of this, the National Dialogue forum was set up to solicit solutions for ways of monitoring the “expenditure and use of recovery funds”. Tim Berners-Lee wrote a proposal on how linked open data could provide semantically-rich, linkable and reusable data from Recovery.gov. I also blogged about this recently, detailing some ideas for how discussions by citizens on the various uses of expenditure (represented using SIOC and FOAF) could be linked together with financial grant information (in custom vocabularies).
More recently, the Open Government Initiative solicited ideas for a government that is “more transparent, participatory, and collaborative”, and the brainstorming and discussion phases have just ended. This process is now in its third phase, where the ideas proposed to solve various challenges are to be more formally drafted in a collaborative manner.
What is surprising about this is how few submissions and contributions have been put into this third and final phase (see graph below), especially considering that there is only one week for this to be completed. Some topics have zero submissions, e.g. “Data Transparency via Data.gov: Putting More Data Online”.
This doesn’t mean that people aren’t still thinking about this. On Monday, Tim Berners-Lee published a personal draft document entitled “Putting Government Data Online“. But we need more contributions from the Linked Data community to the drafts during phase three of the Open Government Directive if we truly believe that this solution can make a difference.
(I watched it again today, and added a little speech bubble to the image below to express my delight at seeing SIOC profiles on the Linked Open Data cloud slide.)
We also have a recently-established Linked Data Research Centre at DERI in NUI Galway.
Posted at 15:25
Posted at 08:23
Posted at 22:27
Posted at 16:38

Posted at 14:00
SemTech 2009 has come and gone, and it was great. I was concerned—as were others—that the state of the economy would depress the turnout and enthusiasm for the show, but it seems that any such effects were at least counterbalanced by a growing interest in semantic technologies. Early reports are that attendance was up about 20% from last year, and at sessions, coffee breaks, and the exhibit hall there seemed to always be more people than I expected. Good stuff.
Eric P. and I gave our SPARQL By Example tutorial to a crowd of about 50 people on Monday. From the feedback I’ve received, it seems that people found the session beneficial, and at least a couple of people remarked on the fact that Eric and I seemed to be having fun. If this whole semantic thing doesn’t work out, at least we can fall back on our ad-hoc comedy routines.
Anyways, I wanted to share a couple of links with everyone. I think they work nicely to supplement other SPARQL tutorials in helping teach SPARQL to newcomers and infrequent practitioners.
Enjoy, and, as always, I’d welcome any feedback, suggestions for improvements, or pointers to how/where you’re able to make use of these materials.
Posted at 04:39
I've finished the re-organization of my web site, though I have odds and ends to finish up. I still have two major changes featuring SVG and RDFa that I need to incorporate, but the structure and web site designs are finished.
Thanks to Drupal's non-aggressive use of .htaccess, I've been able to create a top-level Drupal installation to act as "feeder" to all of the sub-sites. I tried this once before with Wordpress, but the .htaccess entries necessary for that CMS made it impossible to have the sub-sites, much less static pages in sub-directories.
Rather than use Planet or Venus software to aggregate feed entries for all of my sites, I'm manually creating an excerpt describing a new entry, and posting it at Burningbird, with a link back to the full article. I also keep a listing of the last few months stories for each sub-site in the sidebar, in addition to random display of images.
There is no longer any commenting directly on a story. One of the drawbacks with XHTML and an unforgiving browser such as Firefox, is that a small error is enough to render the page useless. I incorporate Drupal modules to protect comments, but I also allow people to enter in some markup. This combination handles most of the accidentally bad markup, but not all. And it doesn't protect against those determined to inject invalid markup. The only way to eliminate all problems is not allow any markup, which I find to be too restrictive.
Comments are, however, supported at the Burningbird main site. To allow for discussion on a story, I've embedded a link in every story that leads back to the topmost Burningbird entry, where people can comment. Now, in those infrequent times when a comment causes a problem with a page, the story is still accessible. And there is a single Comment RSS feed that now encompasses all site comments.
The approach may not be ideal, but commentary is now splintered across weblog, twitter, and what not anyway—what's another link among friends?
I call my web site design "Silhouette" and will release it as a Drupal theme as soon as it's fully tested. It's a very simple two column design, with sidebar column either to the right (standard) or easily adjusted to fall to the right. It's an accessible design, with only the top navigation bar coming between the top of the page and the first story. It is valid markup, as is, with the XHTML+RDFa Doctype, because I've embedded RDFa into the design. It is not valid, however, when you also add SVG silhouettes, as I do with all but the top most site.
The design is also valid XHTML 5.0, except for a hard coded meta element that was added to Drupal because of security issues. I don't serve the pages up as HTML 5, though, because the RDFa Doctype triggers certain behaviors in RDFa tools. I'm also not using any of the new HTML 5 structural elements.
The site design is plain, but it suits me and that's what matters. The content is legible and easy to locate, and navigate, and that's my second criteria. I will be adding some accessibility improvements in the next few months, but they won't impact on the overall design.
What differs between all of the sites is the header graphic, and the SVG silhouettes, which I changed to suit the topic or mood of the site. The silhouettes were a lot of fun, but they aren't essential, and you won't be able to see them if you use a browser that doesn't support SVG inline. Which means you IE users will need to use another browser to see the images.
I also incorporate some new CSS features, including some subtle use of text-shadows with headers (to add richness to the stark use of black text on pastel graphics) and background-color: rgba functionality for semi-transparent backgrounds. The effects are not viewable by browsers that don't yet support these newer CSS styles, but loss of functionality does not impact access to the material.
Now, for some implementation basics:
The expanded primary menu footer was simple, using Drupal's API:
<?php
$tree = menu_tree_all_data('primary-links');
print menu_tree_output($tree);
?>
To implement the "Comment on this story" link for each story, I installed the Content Construction Kit (CCK), with the additional link module, and expanded the story content type to add the new "comment on this story" field. When I add the entry, I type in the URL for the comment post at Burningbird, which automatically gets linked in with the text "Comment on this story" as the title.
I manually manage the link from the Burningbird site to the sub-site writing, both because the text and circumstance of the link differs, and the CCK field isn't included as part of the feed. I may play around with automating this process, but I don't plan on writing entries so frequently that I find this workflow to be a burden.
The images were tricky. I have implemented both the piclens and mediaRSS Drupal Modules, and if you access any of my image galleries with an application such as Cooliris, you'll get that wonderful image management capability. (I wish more people would use this functionality for their image libraries.)
I also display sub-site specific random images within the sub-site sidebars, but I wanted the additional capability to display random images from across all of the sites in the topmost Burningbird sidebar.
To get this cross-site functionality, I installed Gallery2 at http://burningbird.net/gallery2, and synced it with the images from all of my sub-sites. I then installed the Gallery2 Drupal module at Burningbird (which you can view directly) and used Gallery2 plug-ins to provide random images within the Drupal sidebar blocks.
Drupal prevented direct access from Gallery2 to the image directories, but it was a simple matter to just copy the images and do a bulk upload. When I add a new image, I'll just pull the image directly from the Drupal Gallery page using Gallery2's image extraction functionality. Again, I don't add so many images that I find this workflow to be onerous, but if others have implemented a different approach, I'd enjoy hearing of alternatives.
One problem that arose is that none of the Gallery2 themes is XHTML compliant because of HTML entity use. All I can say is: folks, please stop using . Use   instead, if you're really, really generating XHTML, not just HTML pretending to be XHTML.
To fix the non-compliant XHTML problem, I copied a version of my site to a separate theme, and just removed the PHP that serves the page up as XHTML for XHTML-capable browsers from this "Silhouette for HTML" theme. The Gallery2 Drupal modules allow you to specify a different theme for the Gallery2 pages, and I use the new HTMLated theme for the Gallery2 pages. I use my XHTML compliant theme for the rest of the site. Over time, I can probably add conditional tests to my main theme to test for the presence of Gallery blocks, but what I have is simple and works for now.
Lastly, I redirected the old Planet/Venus based feed locations to the Burningbird feed. You can still access full feeds from all of my sub-sites, and get full entries for all but the larger stories and books, but the entries at Burningbird will be excerpts, except for Burningbird-only posts. Speaking of which, all of my smaller status updates, and general chit-chat will be made directly at Burningbird—I'm leaving the sub-sites for longer, more in-depth, and "stand alone" writings.
As I mentioned earlier, I still have some work with SVG and RDFa to finish before I'm completely done with the redesign. I also have some additional tweaks to make with the existing infrastructure. For instance, I have custom 404, 403, and 410 error pages, but Drupal overrides the 403 and 404 pages. You can redirect the error handling to specific pages, but not to static pages, only to pages within the Drupal system. However, I'm not too worried about this issue, as I'm finding that there's typically a Drupal module for any problem, just waiting to be discovered.
I know I must come across as a Drupal fangirl in this writing, but after using the application for over a year, and especially after this site redesign, I have found that no other piece of software matches my needs so well as Drupal. It's not perfect software—there is no such thing as perfect software—but it works for me.
* This process convinced me to switch fully from using Firefox to using Safari. It was so much more simple to fix pages with XHTML errors using Safari than with Firefox's overly aggressive XHTML error handling.
Posted at 19:51
Posted at 16:36
Posted at 16:30
I wrote a
Posted at 14:09
The first and possibly most important aspect of
Posted at 21:53
Just saw Jim’s post on What is the Semantic Web really all about?
I have been wondering about this problem too. What is Semantic Web? Yesterday I have asked a question “Why few (or none?) Web 2.0 sites provide hierarchical tagging?” on LinkedIn and get some pretty good answers:
http://www.linkedin.com/answers?viewQuestion=&questionID=496785&askerID=14212719
For your convenience, I attached my LinkedIn post at the end of this blog.
There are two things in the answers that draw my
attention:
* Many do _not_ believe tags, or even hierarchical tags, are
semantic; “semantics” means RDF
or triples at least to them;
* Some believe that even implementing a hierarchical tagging system
is not easy in engineering or social aspects.
I think these two beliefs, among many other reasons, may explain in part why the “Semantic Web” is still far from a reality. The first is about the overestimation of what is “semantics”: triple is one way to express semantics, but it is a question that whether it is _the_ way. The second is about the underestimation of “Web”-scale: realizing a knowledge system, even if is conceptually “simple”, on the Web can lead to serious scalability problems, both for machine (can you make <1s response for all queries?) and for people (on changing their way of thinking).
Here is what I believe about “semantic web” (note no-capitalization). First, it is not necessarily “the Semantic Web” (just like there is no “the Mobile Web”), as defined by W3C standards or the layered cake model. Semantics is a way of organizing things, RDF and OWL are some ways to express it, but other ways should be encouraged too and sometime work better. Second, tools and services should be “web-ish”, something like a semanticized version of youtube or gmail; after all, “web users” are rarely a bioinformatician or can master a Java-based ontology editor. Third, start deployment with very very basic semantics like trees (yeah, I know some will protest) and sameAs, but do it in a very very efficient way - if we can’t even come up with a Web-efficient tree reasoner, then how realistic we can come up with a Web-efficient RDF or OWL reasoner?
Now I’m prepared to dodge tomatoes :D
by Jie Bao
===============
My original post on LinkedIn (reorganized a bit)
Why few (or none?) Web 2.0 sites provide hierarchical tagging?
Gmail label and delicious tagging are flat, which is troublesome all the time for me. I have to add (unnecessarily) many tags even if they can be easily inferred. I didn’t find an alternative that allows me to organize my tags in a tree or network. Is there any technical or marketing reason?
People have been talking about semantic web a for a while and are looking for a killer app. It’s apparent that hierarchical tagging is semantic, is in high demand, and is relatively easy to do. Why there is none in popular sites?
PS 1: Let me clarify some situations when hierarchical tagging will save me a lot of time: recently I’m reading a book of Qian Mu, a historian, and tagging my notes on delicious with tags “qianmu“; I also want all those notes be tagged with “history“, but I have to always add both “qianmu” and “history”.
Sometimes I want more than one tags to be inferred. For example, when I add “wuxu” (the year of 1898), I want tags “qing“, “china” and “reform” to be added. You will find how trouble it is to add all 4 tags together when you have about 10 notes on “wuxu”.
In another example, I want to share my tags in both Chinese and English. If I can define two subclass relations between two tags, each in a different language, I will not have to always add the both tags.
Now I have about 1000 tags on delicious. I’m really really in despair need for a hierarchy. I’m willing to pay delicious $100 for such a service.
PS 2: Further clarification: I don’t believe I will need a tagging system that always requires me to pick up terms from a tree, DAG, or a network. I can still freely add tags. But I need some way to clean up my tags from time to time, and organize them. It is just like how i clean up my “download” folder: put them into different folders, and if a folder is too big make some subfolders.
Posted at 20:26