No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster.
Now, from last time,
we know to generate the data without 10 GB of namespace
prefixes per file and with many short files. So we have 1.5 TB
of gzipped data in 40,000 files, spread over 12 machines. The data
generator has again been modified. Now the generation was about 4
days. Also from last time, we know to treat small integers
specially when they occur as partition keys: 1 and 2 are very
common values and skew becomes severe if they all go to the same
partition; hence consecutive small INTs
each go to a
different partition, but for larger ones the low 8 bits are
ignored, which is good for compression: Consecutive values must
fall in consecutive places, but not for small INTs
.
Another uniquely brain-dead feature of the BSBM generator has also
been rectified: When generating multiple files, the program would
put things in files in a round-robin manner, instead of putting
consecutive numbers in consecutive places, which is how every other
data generator or exporter does it. This impacts bulk load locality
and as you, dear reader, ought to know by now, performance comes
from (1) locality and (2) parallelism.
The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.
The first experiment is with triples, so no characteristic sets, no schema.
So, first day (Monday), we notice that one cannot allocate more
than 9 GB of memory. Then we figure out that it cannot be done with
malloc
, whether in small or large pieces, but it can
with mmap
. Ain't seen that before. One day shot. Then,
towards the end of day 2, load begins. But it does not run for more
than 15 minutes before a network error causes the whole thing to
abort. All subsequent tries die within 15 minutes. Then, in the
morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For
loading this is all the same; the maximal aggregate throughput is
800 MB/s, which is around 40% of the nominal bidirectional
capacity of 12 GigE's. So, it works better, for 30 minutes, and one
can even stop the load and do a checkpoint. But after resuming, one
box just dies; does not even respond to ping. We change this to
another. After this, still running on GigE, there are no more
network errors. So, at the end of day 3, maybe 10% of the data are
in. But now it takes 2h21min to make a checkpoint, i.e., make the
loaded data durable on disk. One of the boxes manages to write 2
MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before.
The data can however be read back once the write is finally
done.
Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.
In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.
In the evening of day 4, adding more files to load in parallel
increases the load rate to between 4.9 and 5.2 Mt/s. This is about
as fast as this will go, since the load is not exactly even. This
comes from the RDF stupidity of keeping an index on everything, so
even object values where an index is useless get indexed, leading
to some load peaks. For example, there is an index on
POSG
for triples were the predicate is
rdf:type
and the object is a common type. Use of
characteristic sets will stop this nonsense.
But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.
A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.
At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).
So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.
At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node
at 1500% CPU out of a possible 1600%. The next one is under 800%,
which is just to show what it means to index everything. In
specific, the node that has the highest CPU is the one in whose
partition the bsbm:offer
class falls, so that there is
a local peak since one of every 9 or so triples says that something
is an offer
. The stupidity of the triple store is to
index garbage like this to begin with. The reason why the
performance is still good is that a POSG
index where
P
and O
are fixed and the S
is densely ascending is very good, with everything but the
S
represented as run lengths and the S
as
bitmaps. Still, no representation at all is better for performance
than even the most efficient representation.
The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.
The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.
At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.
At this time, noon of day 6, we find that some cross-partition
joins need more distinct pieces of memory than the default kernel
settings allow per process. A large number of partitions makes a
large number of sometimes long messages which makes many
mmaps
. So we will wait until morning of day 8 (Monday)
for the administrator to set these. In the meantime, we analyze the
behavior of the workload on the 37 Gt scale model cluster on
my desktop.
To be continued...
No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster.
Now, from last time,
we know to generate the data without 10 GB of namespace
prefixes per file and with many short files. So we have 1.5 TB
of gzipped data in 40,000 files, spread over 12 machines. The data
generator has again been modified. Now the generation was about 4
days. Also from last time, we know to treat small integers
specially when they occur as partition keys: 1 and 2 are very
common values and skew becomes severe if they all go to the same
partition; hence consecutive small INTs
each go to a
different partition, but for larger ones the low 8 bits are
ignored, which is good for compression: Consecutive values must
fall in consecutive places, but not for small INTs
.
Another uniquely brain-dead feature of the BSBM generator has also
been rectified: When generating multiple files, the program would
put things in files in a round-robin manner, instead of putting
consecutive numbers in consecutive places, which is how every other
data generator or exporter does it. This impacts bulk load locality
and as you, dear reader, ought to know by now, performance comes
from (1) locality and (2) parallelism.
The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.
The first experiment is with triples, so no characteristic sets, no schema.
So, first day (Monday), we notice that one cannot allocate more
than 9 GB of memory. Then we figure out that it cannot be done with
malloc
, whether in small or large pieces, but it can
with mmap
. Ain't seen that before. One day shot. Then,
towards the end of day 2, load begins. But it does not run for more
than 15 minutes before a network error causes the whole thing to
abort. All subsequent tries die within 15 minutes. Then, in the
morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For
loading this is all the same; the maximal aggregate throughput is
800 MB/s, which is around 40% of the nominal bidirectional
capacity of 12 GigE's. So, it works better, for 30 minutes, and one
can even stop the load and do a checkpoint. But after resuming, one
box just dies; does not even respond to ping. We change this to
another. After this, still running on GigE, there are no more
network errors. So, at the end of day 3, maybe 10% of the data are
in. But now it takes 2h21min to make a checkpoint, i.e., make the
loaded data durable on disk. One of the boxes manages to write 2
MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before.
The data can however be read back once the write is finally
done.
Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.
In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.
In the evening of day 4, adding more files to load in parallel
increases the load rate to between 4.9 and 5.2 Mt/s. This is about
as fast as this will go, since the load is not exactly even. This
comes from the RDF stupidity of keeping an index on everything, so
even object values where an index is useless get indexed, leading
to some load peaks. For example, there is an index on
POSG
for triples were the predicate is
rdf:type
and the object is a common type. Use of
characteristic sets will stop this nonsense.
But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.
A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.
At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).
So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.
At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node
at 1500% CPU out of a possible 1600%. The next one is under 800%,
which is just to show what it means to index everything. In
specific, the node that has the highest CPU is the one in whose
partition the bsbm:offer
class falls, so that there is
a local peak since one of every 9 or so triples says that something
is an offer
. The stupidity of the triple store is to
index garbage like this to begin with. The reason why the
performance is still good is that a POSG
index where
P
and O
are fixed and the S
is densely ascending is very good, with everything but the
S
represented as run lengths and the S
as
bitmaps. Still, no representation at all is better for performance
than even the most efficient representation.
The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.
The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.
At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.
At this time, noon of day 6, we find that some cross-partition
joins need more distinct pieces of memory than the default kernel
settings allow per process. A large number of partitions makes a
large number of sometimes long messages which makes many
mmaps
. So we will wait until morning of day 8 (Monday)
for the administrator to set these. In the meantime, we analyze the
behavior of the workload on the 37 Gt scale model cluster on
my desktop.
To be continued...
The LOD2 FP7 ends at the end of August, 2014. This post begins a series that will crown the project with a grand finale, another decisive step towards the project’s chief goal of giving RDF and linked data performance parity with SQL systems.
In a nutshell, LOD2 went like this:
Triples were done right, taking the best of the column store world and adapting it to RDF. This is now in widespread use.
SQL was done right, as I have described in detail in the TPC-H series. This is generally available as open source in v7fasttrack. SQL is the senior science and a runner-up like sem-tech will not carry the day without mastering this.
RDF is now breaking free of the triple store. RDF is a very general, minimalistic way of talking about things. It is not a prescription on how to do database. Confusing these two things has given rise to RDF’s relative cost against alternatives. To cap off LOD2, we will have the flexibility of triples with the speed of the best SQL.
In this post we will look at accomplishments so far and outline what is to follow during August. We will also look at what in fact constitutes the RDF overhead, why this is presently so, and why this does not have to stay thus.
This series will be of special interest to anybody concerned with RDF efficiency and scalability.
At the beginning of LOD2, I wrote a blog post discussing the RDF technology and its planned revolution in terms of the legend of Perseus. The classics give us exemplars and archetypes, but actual histories seldom follow them one-to-one; rather, events may have a fractal nature where subplots reproduce the overall scheme of the containing story.
So it is also with LOD2: The Promethean pattern of fetching the fire (state of the art of the column store) from the gods (the DB world) and bringing it to fuel the campfires of the primitive semantic tribes is one phase, but it is not the totality. This is successfully concluded, and Virtuoso 7 is widely used at present. Space efficiency gains are about 3x over the previous, with performance gains anywhere from 3 to 100x. As pointed out in the Star Schema Benchmark series (part 1 and part 2), in the good case one can run circles in SPARQL around anything but the best SQL analytics databases.
In the larger scheme of things, this is just preparation. In the classical pattern, there is the call or the crisis: Presently this is that having done triples about as right as they can be done, the mediocre in SQL can be vanquished, but the best cannot. Then there is the actual preparation: Perseus talking to Athena and receiving the shield of polished brass and the winged sandals. In the present case, this is my second pilgrimage to Mount Database, consisting of the TPC-H series. Now, the incense has been burned and libations offered at each of the 22 stations. This is not reading papers, but personally making one of the best-ever implementations of this foundational workload. This establishes Virtuoso as one of the top-of-the-line SQL analytics engines. The RDF public, which is anyway the principal Virtuoso constituency today, may ask what this does for them.
Well, without this step, the LOD2 goal of performance parity with SQL would be both meaningless and unattainable. The goal of parity is worth something only if you compare the RDF contestant to the very best SQL. And the comparison cannot possibly be successful unless it incorporates the very same hard core of down-to-the-metal competence the SQL world has been pursuing now for over forty years.
It is now time to cut the Gorgon’s head. The knowledge and prerequisite conditions exist.
The epic story is mostly about principles. If it is about personal combat, the persons stand for values and principles rather than for individuals. Here the enemy is actually an illusion, an error of perception, that has kept RDF in chains all this time. Yes, RDF is defined as a data model with triples in named graphs, i.e., quads. If nothing else is said, an RDF Store is a thing that can take arbitrary triples and retrieve them with SPARQL. The naïve implementation is to store things as rows in a quad table, indexed in any number of ways. There have been other approaches suggested, such as property tables or materialized views of some joins, but these tend to flush the baby with the bathwater: If RDF is used in the first place, it is used for its schema-less-ness and for having global identifiers. In some cases, there is also some inference, but the matter of schema-less-ness and identifiers predominates.
We need to go beyond a triple table and a dictionary of URI names while maintaining the present semantics and flexibility. Nobody said that physical structure needs to follow this. Everybody just implements things this way because this is the minimum that will in any case be required. Combining this with a SQL database for some other part of the data/workload hits basically insoluble problems of impedance mismatch between the SQL and SPARQL type systems, maybe using multiple servers for different parts of a query, etc. But if you own one of the hottest SQL racers in DB city and can make it do anything you want, most of these problems fall away.
The idea is simple: Put the de facto rectangular part of RDF data into tables; do not naively index everything in places where an index gives no benefit; keep the irregular or sparse part of the data as quads. Optimize queries according to the table-like structure, as that is where the volume is and where getting the best plan is a make or break matter, as we saw in the TPC-H series. Then, execute in a way where the details of the physical plan track the data; i.e., sometimes the operator is on a table, sometimes on triples, for the long tail of exceptions.
In the next articles we will look at how this works and what the gains are.
These experiments will for the first time showcase the adaptive schema features of the Virtuoso RDF store. Some of these features will be commercial only, but the interested will be able to reproduce the single server experiments themselves using the v7fasttrack open source preview. This will be updated around the second week of September to give a preview of this with BSBM and possibly some other datasets, e.g., Uniprot. Performance gains for regular datasets will be very large.
To be continued...
The LOD2 FP7 ends at the end of August, 2014. This post begins a series that will crown the project with a grand finale, another decisive step towards the project’s chief goal of giving RDF and linked data performance parity with SQL systems.
In a nutshell, LOD2 went like this:
Triples were done right, taking the best of the column store world and adapting it to RDF. This is now in widespread use.
SQL was done right, as I have described in detail in the TPC-H series. This is generally available as open source in v7fasttrack. SQL is the senior science and a runner-up like sem-tech will not carry the day without mastering this.
RDF is now breaking free of the triple store. RDF is a very general, minimalistic way of talking about things. It is not a prescription on how to do database. Confusing these two things has given rise to RDF’s relative cost against alternatives. To cap off LOD2, we will have the flexibility of triples with the speed of the best SQL.
In this post we will look at accomplishments so far and outline what is to follow during August. We will also look at what in fact constitutes the RDF overhead, why this is presently so, and why this does not have to stay thus.
This series will be of special interest to anybody concerned with RDF efficiency and scalability.
At the beginning of LOD2, I wrote a blog post discussing the RDF technology and its planned revolution in terms of the legend of Perseus. The classics give us exemplars and archetypes, but actual histories seldom follow them one-to-one; rather, events may have a fractal nature where subplots reproduce the overall scheme of the containing story.
So it is also with LOD2: The Promethean pattern of fetching the fire (state of the art of the column store) from the gods (the DB world) and bringing it to fuel the campfires of the primitive semantic tribes is one phase, but it is not the totality. This is successfully concluded, and Virtuoso 7 is widely used at present. Space efficiency gains are about 3x over the previous, with performance gains anywhere from 3 to 100x. As pointed out in the Star Schema Benchmark series (part 1 and part 2), in the good case one can run circles in SPARQL around anything but the best SQL analytics databases.
In the larger scheme of things, this is just preparation. In the classical pattern, there is the call or the crisis: Presently this is that having done triples about as right as they can be done, the mediocre in SQL can be vanquished, but the best cannot. Then there is the actual preparation: Perseus talking to Athena and receiving the shield of polished brass and the winged sandals. In the present case, this is my second pilgrimage to Mount Database, consisting of the TPC-H series. Now, the incense has been burned and libations offered at each of the 22 stations. This is not reading papers, but personally making one of the best-ever implementations of this foundational workload. This establishes Virtuoso as one of the top-of-the-line SQL analytics engines. The RDF public, which is anyway the principal Virtuoso constituency today, may ask what this does for them.
Well, without this step, the LOD2 goal of performance parity with SQL would be both meaningless and unattainable. The goal of parity is worth something only if you compare the RDF contestant to the very best SQL. And the comparison cannot possibly be successful unless it incorporates the very same hard core of down-to-the-metal competence the SQL world has been pursuing now for over forty years.
It is now time to cut the Gorgon’s head. The knowledge and prerequisite conditions exist.
The epic story is mostly about principles. If it is about personal combat, the persons stand for values and principles rather than for individuals. Here the enemy is actually an illusion, an error of perception, that has kept RDF in chains all this time. Yes, RDF is defined as a data model with triples in named graphs, i.e., quads. If nothing else is said, an RDF Store is a thing that can take arbitrary triples and retrieve them with SPARQL. The naïve implementation is to store things as rows in a quad table, indexed in any number of ways. There have been other approaches suggested, such as property tables or materialized views of some joins, but these tend to flush the baby with the bathwater: If RDF is used in the first place, it is used for its schema-less-ness and for having global identifiers. In some cases, there is also some inference, but the matter of schema-less-ness and identifiers predominates.
We need to go beyond a triple table and a dictionary of URI names while maintaining the present semantics and flexibility. Nobody said that physical structure needs to follow this. Everybody just implements things this way because this is the minimum that will in any case be required. Combining this with a SQL database for some other part of the data/workload hits basically insoluble problems of impedance mismatch between the SQL and SPARQL type systems, maybe using multiple servers for different parts of a query, etc. But if you own one of the hottest SQL racers in DB city and can make it do anything you want, most of these problems fall away.
The idea is simple: Put the de facto rectangular part of RDF data into tables; do not naively index everything in places where an index gives no benefit; keep the irregular or sparse part of the data as quads. Optimize queries according to the table-like structure, as that is where the volume is and where getting the best plan is a make or break matter, as we saw in the TPC-H series. Then, execute in a way where the details of the physical plan track the data; i.e., sometimes the operator is on a table, sometimes on triples, for the long tail of exceptions.
In the next articles we will look at how this works and what the gains are.
These experiments will for the first time showcase the adaptive schema features of the Virtuoso RDF store. Some of these features will be commercial only, but the interested will be able to reproduce the single server experiments themselves using the v7fasttrack open source preview. This will be updated around the second week of September to give a preview of this with BSBM and possibly some other datasets, e.g., Uniprot. Performance gains for regular datasets will be very large.
To be continued...
On Monday, August 18 ,13.30, Edgard Marx, will give a pre-presentation of his Semantics’ conference talk about the accepted paper Towards an Open Question Answering Architecture.
This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.
On Monday, August 18 ,13.30, Edgard Marx, will give a pre-presentation of his Semantics’ conference talk about the accepted paper Towards an Open Question Answering Architecture.
This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.
Hello!
The 21st European Conference on Artificial Intelligence (ECAI) will be held in the city of Prague, Czech Republic from 18th to 22nd August 2014. Various excellent papers on artificial intellegence, logic, rule mining and many more topics will be presented.
AKSW member Ricardo Usbeck will present a poster of AGDISTIS – Agnostic Disambiguation of Named Entities using Linked Data. To the best of our knowledge AGDISTIS is able to outperform the state-of-the-art approaches in entity linking by up to 29% f-meassure. Come and visit him at ECAI 2014.
A demo of AGDISTIS is available here: http://agdistis.aksw.org/demo and the paper can be found here.
Hello!
The 21st European Conference on Artificial Intelligence (ECAI) will be held in the city of Prague, Czech Republic from 18th to 22nd August 2014. Various excellent papers on artificial intellegence, logic, rule mining and many more topics will be presented.
AKSW member Ricardo Usbeck will present a poster of AGDISTIS – Agnostic Disambiguation of Named Entities using Linked Data. To the best of our knowledge AGDISTIS is able to outperform the state-of-the-art approaches in entity linking by up to 29% f-meassure. Come and visit him at ECAI 2014.
A demo of AGDISTIS is available here: http://agdistis.aksw.org/demo and the paper can be found here.
In this blog post, I will show you how we can use basic graph analysis metrics to analyze Big Structures. I will do the analysis of the UMBEL reference concept structure to demonstration how graph and network analysis can be used to better understand the nature and organization of possible issues with Big Structures such as UMBEL. |
I will first present the formulas that have been used to perform the UMBEL analysis. To better understand this section, you will need some math knowledge, but there is nothing daunting here. You can probably safely skip that section if you desire.
The second section is the actual analysis of a Big Structure (UMBEL) using the graph measures I presented in the initial section. The results will be presented, analyzed and explained.
After reading this blog post, you should better understand how graph and network analysis techniques can be used to understand, use and leverage Big Structures to help integrate and interoperate disparate data sources.
A Big Structure is a network (a graph) of inter-related concepts which is composed of thousands or even hundred of thousands of such concepts. One characteristic of a Big Structure is its size. By nature, a Big Structure is too big to manipulate by hand and it requires tools and techniques to understand and assess the nature and the quality of the structure. It is for this reason that we have to leverage graph and network measures to help us in manipulating these Big Structures.
In the case of UMBEL, the Big Structure is a scaffolding of reference concepts used to link external (unrelated) structures to help data integration and to help unrelated systems inter-operate. In a World where the Internet Of Things is the focus of big companies and where there are more than 400 standards, such techniques and technologies are increasingly important, otherwise it will end-up being the Internet of [Individual] Things. Such a Big Structure can also be used for other tasks such as helping machine learning techniques to categorize and disambiguate pieces of data by leveraging such a structure of types.
UMBEL is an RDF
and OWL
ontology of a bit more than 26 000 reference concepts. Because the
structure is represented using RDF, it means that it is a directed graph.
All of UMBEL’s vertices
are classes, and all of the
edges
are properties.
The most important fact to keep in mind until the end of this blog post is that we are manipulating a directed graph. This means that all of the formulas used to analyze the UMBEL graph are formulas applicable to directed graphs only.
I will keep the normal graph analysis language that we use in
the literature, however keep in mind that a vertice
is
a class
or a named individual
and that a
edge
is a property
.
The UMBEL structure we are using is composed of the
classes view
and the individuals view
of
the ontology. That means that all the concepts are there, where
some of them only have a class view,
others have an individual
view and others have both (because they got punned).
In this section, I will present the graph measures that we will use to perform the initial UMBEL graph analysis. In the next section, we will make the analysis of UMBEL using these measures, and we will discuss the results.
This section uses math notations. It could be skipped, but I suggest to try to take time some time to understand each measure since it will help to understand the analysis.
In this blog post, a graph is represented as where is the graph, is the set of all the vertices and is the set of all the edges of the same type that relates vertices.
The UMBEL analysis focuses on one of the following transitive
properties:Â
rdfs:subClassOf
,Â umbel:superClassOf
,Â skos:broaderTransitive
,
skos:narrowerTransitive
and rdf:type
.
When we do perform the analysis, we are picking-up a subgraph
that is composed of all the connections between the vertices that
are linked by thisÂ edge (property).
The first basic measure is the density
of a graph. The density measures how many edges are in set
compared to the maximum possible number
of edges between vertices in set . The density is measured with:
where is the density
, is the number of properties
(edges)
and is the number of classes
(vertices).
The density is a ratio of the number of edges that exists, and the number of edges that could exists in the graph. is the number of edges in the graph and is the number of possible maximum number of edges.
The maximum density is 1, and the minimum density is 0. The density of a graph gives us an idea about the number of connections that exists between the vertices.
The degree
of a vertex is the number of edges that connect that vertex to
other vertices. The average degree of a graph is another measure of how many edges are
in set compared to number of vertices in set
.
where is the average degree
,
is the number of properties
(edges)
and is the number of classes
(vertices).
This measure tells the average number of nodes to which any given node is connected.
The diameter
of a graph is the longest shortest path between two
vertices in the graph. This means that this is the longest path
that excludes all detours, loops, etc. between two vertices.
Let be the length of the shortest path between and . And . The diameter of the graph is defined as:
This metric gives us an assessment of the size of the graph. It is useful to understand the kind of graph we are playing with. We will also relate it with the average path length to assess the span of the graph and the distribution of path lengths.
The average path length is the average of the shortest path length, averaged over all pairs of vertices. Let be the length of the shortest path between and . And .
where, is the number of vertices in the graph ; where, is the number of pairs of distinct vertices. Note that the number of pairs of distinct vertices is equal to the number of shortest paths between all pairs of vertices if we pick just one in case of a tie (two shortest paths with the same length).
In the context of ontology analysis, I would compare this metric as the speed of the ontology. What I mean by that is that one of the main tasks we do with an ontology is to infer new facts from known facts. Many inferring activities requires traversing the graph of an ontology. This means that the smaller the average path length between two classes, the more performant these inferencing activities should be.
The local clustering coefficient quantifies how well connected the neighborhood vertices of a given vertex are. It is the ratio of the edges that exists between all of the neighborhood vertices of a given vertex and the maximum number of possible edges between these same neighborhood vertices.
where is the number of neighborhood vertices, is the maximum number of edges between the neighborhood vertices and is the set of all the neighborhood vertices for a given vertex .
The local clustering coefficient is represented by the sum of the clustering coefficient of all the vertices of a graph divided by the number of vertices in . It is given by:
Betweenness centrality is a measure of importance of a node in a graph. It is represented by the number of times a node participates in the shortest path between other nodes. If a node participates in the shortest path of multiple other nodes, then it means that it is more important than other nodes in the graph. It acts like a conduit.
where is the total number of shortest paths from node to node and is the number of those paths that pass through .
In the context of ontology analysis, the betweenness centrality
will tell us which of the classes that participates the more often
in the shortest paths of a given transitive property between other
classes. This measure is interesting to help us understand how a
subgraph is constructed. For example, if we take a transitive
property such as rdfs:subClassOf,
then the graphs
generated by a subgraph composed of this relationship only should
be more hierarchic by the semantic nature of the property. This
means that the nodes (classes) with the highest betweenness
centrality value should be classes that participate in the upper
portion of the ontology (the more general classes). However, if we
think about the foaf:knows
transitive property between
named individuals, then the results should be quite different and
suggest a different kind of graph.
Now that we have a good understanding of some core graph
analysis measures, we will use them to analyze the graph of
relationship between the UMBEL classes and reference concepts using
the subgraphs generated by the properties:
rdfs:subClassOf
, umbel:superClassOf
,
skos:broaderTransitive
,
skos:narrowerTransitive
and rdf:type
.
The maximum number of edges in UMBEL is: which is about two thirds of a billion of edges. This is quite a lot of edges, but it is important to keep in mind since most of the following ratios are based on this maximum number of edges (connections) between the nodes of the UMBEL graph.
Here is the table that shows the density of each subgraph generated by each property:
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of edges | 39 410 | 116 792 | 36 016 | 36 322 | 271 810 |
Density | 0.0000567 | 0.0001402 | 0.0000518 | 0.0000523 | 0.0001021 |
As you can see, the density of any of the UMBEL subgraphs is really low considering that the maximum density of the graph is 1. However this gives us a picture of the structure: Most of the concepts have no more than few connections between each node for any of the analyzed properties.
This makes sense, since a conceptual structure is meant to model relationships between concepts that represent concepts of the real world, and in the real world, the concepts that we created are far from being connected to every other concept.
This is actually what we want: we want a conceptual structure which has a really low density. This suggest that the concepts are unambiguously related and hierarchized.
Having a high density (let’s say, 0.01
, which would
mean that there are nearly 7 million connections between the 26 345
concepts for a given property) may suggest that the concepts are
too highly connected which could suggest that using the UMBEL
ontology for tagging, classifying and reasoning over its concepts
won’t be an optimal choice because of the nature of the structure
and its connectivity (and possible lack of hierarchy).
The average degree shows the average number of UMBEL nodes that are connected to any other node for one of the given property.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of Edges | 39406 | 97323 | 36016 | 36322 | 70921 |
Number of Vertices | 26345 | 26345 | 26345 | 26345 | 26345 |
Average degree | 1.4957 | 3.6941 | 1.3670 | 1.3787 | 2.6920 |
As you can see, the number are quite low: from 1.36
to 3.69
. This is consistent with what we saw with the
density
measure above. This helps confirm our
assumption that all of these properties create mostly hierarchical
subgraphs.
However there seems to be one anomaly with these results, the
average degree 3.69
of the
umbel:superClassOf
property. Intuitively, its degree
should be near the one of the rdfs:subClassOf
but it
is far from this: it is more than twice its average degree. Looking
at the OWL serialization of the UMBEL version 1.05 reveals that
most umbel:RefConcept
do have 3 triples:
umbel:superClassOf skos:Collection ,
skos:ConceptScheme ,
skos:OrderedCollection .
This makes no sense that the umbel:RefConcept
are
super classes of these skos
classes. I suspect that
this got introduced via punning
at some point in the
history of UMBEL and got unnoticed until today. This issue will be
fixed in a coming maintenance version of UMBEL.
If we check back the density measure
of the graph,
we notice that we have a density of 0.0001402
for the
umbel:superClassOf
property versus
0.0000567
for the rdfs:subClassOf
property which has about the same ratio. So we could have noticed
the same anomaly by taking a better look at the density
measure
.
But in any case, this shows how this kind of graph analysis can be used to find such issues in Big Structures (structures too big to find all these issues by scrolling the code only).
The diameter of the UMBEL graph is like the worse case
scenario
. It tells us what is the longest shortest path for
a given property’s subgraph.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Diameter | 19 | 19 | 19 | 19 | 4 |
In this case, this tell us that the longest shortest path
between two given nodes for the rdfs:subClassOf
property is 19. So in the worse case, if we infer something between
these two nodes, then our algorithms will require maximum 19 steps
(think in terms of a breath first search, etc).
The average path length distribution shows us the a number paths
that have x
in length. Because of the nature of UMBEL
and its relationships, I think we should expect a normal
distribution. An interesting observation we can do is the the
average path length is situated around 6, which is the six degree
of separation.
We saw that the “worse case scenario” was a shortest
path of 19 for all the property except rdf:type
. Now
we know that the average is around 6.
Here we can notice an anomaly in the expected normal
distribution of the path lengths. Considering the other analysis we
did, we can consider that the anomaly is related to the
umbel:superClassOf
issue we found. What we will have
to re-check this metric once we fix the issue. I expect we will see
a return to a normal distribution.
The average local clustering coefficient will tell us how clustered the UMBEL subgraphs are.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Average local clustering coefficient | 0.0001201 | 0.00000388 | 0.03004554 | 0.00094251 | 0.77191429 |
As we can notice, UMBEL does not have the small-world effect with its small clustering coefficients depending on the properties we are looking at. It means that there is not a big number of hubs in the network and so that the number of steps from going from a class or a reference concept to another is higher than in other kinds of networks, like airport networks. This makes sense by looking at the average path length and the path length distributions we observed above.
At the same time, this is the nature of the UMBEL graph: it is meant to be a a well structured set of concepts with multiple different specificity layers.
To understand its nature, we could consider the robustness of the network. Normally, networks with high average clustering coefficients are known to be robust, which means that if a random node is removed, it shouldn’t impact the average clustering coefficient or the average path length of that network. However, in a network that doesn’t have the small-world effect, then they are considered less robust which means that if a node is removed, it could greatly impact the clustering coefficients (which would be lower) and the average path length (which would be higher).
This makes sense in such a conceptual network: if we remove a concept from the structure, it will most than likely impact the connectivity of the other concepts of the network.
One interesting thing to notice is the clustering coefficient of
0.03
for the skos:broaderTransitive
property and 0.00012
for the
rdfs:subClassOf
property. I have no explanation for
this discrepancy at the moment, but this should be investigated
after the fix as well since intuitively these two coefficient
should be close.
As I said above, In the context of ontology analysis, the betweenness centrality tells us which of the classes participates more often in the shortest paths of a given transitive property between other classes. This measure is useful to help us understand how a subgraph is constructed.
If we check the results below, we can see that all the top nodes
are nodes that we could easily classify as being part of the upper
portion of the UMBEL ontology (the general concepts). Another
interesting thing to notice is that the issue we found with the
umbel:superClassOf
property doesn’t seem to have any
impact on the betweenness centrality of this subgraph.
PartiallyTangible | 0.1853398395065167 |
HomoSapiens | 0.1184076250864128 |
EnduringThing_Localized | 0.1081317879905902 |
SpatialThing_Localized | 0.092787668995485 |
HomoGenus | 0.07956810084399618 |
You can download the full list from here
PartiallyTangible | 0.1538140444614064 |
HomoSapiens | 0.1053345447606491 |
EnduringThing_Localized | 0.08982813055659158 |
SuperType | 0.08545549956197795 |
Animal | 0.06988077993945754 |
You can download the full list from here
PartiallyTangible | 0.2051286996513612 |
EnduringThing_Localized | 0.1298479341295156 |
HomoSapiens | 0.09859060900526667 |
Person | 0.09607892589570508 |
HomoGenus | 0.08806468362881092 |
You can download the full list from here
PartiallyTangible | 0.2064665713447941 |
EnduringThing_Localized | 0.1294085106511612 |
HomoSapiens | 0.1019893654646639 |
Person | 0.09366129700219764 |
HomoGenus | 0.09047063411213117 |
You can download the full list from here
owl:Class | 0.3452564052004049 |
RefConcept | 0.3424169525837704 |
ExistingObjectType | 0.0856962574437039 |
ObjectType | 0.01990245954437302 |
TemporalStuffType | 0.01716817183946576 |
You can download the full list from here
Now, let’s push the analysis further. Remember that I mentioned
that all the properties we are analyzing in this blog post are
transitive? Let’s perform exactly the same metrics analysis, but
this time we will use the transitive closure
of the
subgraphs.
The transitivity characteristic of a property is simple to understand. Let’s consider this tiny graph where the property is a transitive property. Since is transitive, there is also a relationship .
Given we inferred using the transitive relation .
Now, let’s use the power of these transitive properties and
let’s analyze the transitive closure
of the subgraphs
that we are using to compute the metrics. The transitive
closure
is simple to understand. From the input subgraph, we
are generating a new graph where all these transitive relations are
explicit.
Let’s illustrate that using this small graph: . The transitive clojure
would create a new graph:
This is exactly what we will be doing with the sub-graphs created by the properties we are analyzing in this blog post. The end result is that we will be analyzing a graph with many more edges than we previously had with the non transitive closure versions of the subgraphs.
What we will analyze now is the impact of considering the transitive closure upon the ontology metrics analysis.
Remember that the maximum number of edges in UMBEL is: , which is about two thirds of a billion edges.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of Edges | 789 814 | 922 328 | 674 061 | 661 629 | 76 074 |
Density | 0.0011380 | 0.0013289 | 0.0009712 | 0.0009533 | 0.0001096 |
As we can see, we have many more edges now with the transitive closure. The density of the graph is higher as well since we inferred new relationships between nodes from the transitive nature of the properties. However it is still low considering the number of possible edges between all nodes of the UMBEL graph.
We now see the impact of transitive closure on the average degree of the subgraphs. Now each node of the subgraphs are connected by 25 to 35 other nodes in average.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of Edges | 789 814 | 922 328 | 674 061 | 661 629 | 76 074 |
Number of Vertices | 26345 | 26345 | 26345 | 26345 | 26345 |
Average degree | 29.97965 | 35.00960 | 25.58591 | 25.11402 | 2.887606 |
One interesting fact is that the anomaly disappears with this
transitive closure subgraph for the umbel:superClassOf
property. There is still a glitch with it, but I don’t think it
would raise suspicion at first. This is important to note since we
won’t have noticed this issue with the current version of the UMBEL
ontology if we would have analyzed the transitive closure of the
subgraph only.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Diameter | 2 | 2 | 2 | 2 | 2 |
As expected, the diameter of any of the transitive closure
subgraphs is 2
. It is the case since we made explicit
a fact (a edge) between two nodes that was not explicit at first.
This is good, but this is no quite useful from the perspective of
ontology analysis.
This would only be helpful if the number were not 2
which would suggest some errors in the way you computed the
diameter of the graph.
However what we can see here is that the speed of the
ontology (as defined in the Average Path Length
section above) is greatly improved. Since we forward-chained
the facts in the transitive closure sub-graphs, it means that
knowing if a class A
is a sub-class of a class
B
is much faster, since we have a single lookup to do
instead of an average of 6 for the non transitive closure version
of the subgraphs.
All of the path length distributions
will be the same as this one:
Since the diameter is 2, then we have a full lot of paths at 2, and 26 345 at 1. This is not really that helpful from the standpoint of ontology analysis.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Average local clustering coefficient | 0.3281777 | 0.0604906 | 0.2704963 | 0.0138592 | 0.7590267 |
Some of the properties like the rdfs:subClassOf
property shows a much stronger coefficient than
with the non-transitive closure version. This is normal since all
the nodes are connected to the other nodes down the paths. So, if a
node in between disappears, then it won’t affect the connectivity
of the subgraph since all the linkage that got inferred still
remains.
This analysis also suggests that the transitive closure version of the subgraphs are much stronger (which makes sense too).
However I don’t think this metric is that important of a characteristic to check when we analyze reference ontologies since they do not need to be robust. They are not airport or telephonic networks that need to cope with disappearing nodes in the network.
What the betweenness centrality measure does with the transitive
closure of the subgraphs is that it highlight the real top concepts
of the ontology like Thing
, Class
,
Individual
, etc. Like most of the other measures, it
blurs the details of the structure (which is not necessarily a good
thing).
SuperType | 0.03317996389023238 |
AbstractLevel | 0.02810408526564482 |
Thing | 0.02772171675862925 |
Individual | 0.02747482318621853 |
TopicsCategories | 0.02638342698407473 |
You can download the full list from here
The interesting thing here is that this measure actual shows the actual concepts for which we discovered an issue with above.
skos:Concept | 0.02849203320293865 |
owl:Thing | 0.02849094898994718 |
SuperType | 0.02848878056396423 |
skos:ConceptScheme | 0.02825567477079737 |
skos:OrderedCollection | 0.02825567477079737 |
You can download the full list from here
Thing | 0.03225428380683926 |
Individual | 0.03196350419108375 |
Location_Underspecified | 0.0261924189600178 |
Region_Underspecified | 0.02618796825161338 |
TemporalThing | 0.02594169571990208 |
You can download the full list from here
Thing | 0.03298126713602109 |
Location_Underspecified | 0.02680852092899528 |
Region_Underspecified | 0.02680398659044947 |
TemporalThing | 0.02648809433842489 |
SomethingExisting | 0.02582305801837314 |
You can download the full list from here
owl:Class | 0.3452564052004049 |
RefConcept | 0.342338078899975 |
ExistingObjectType | 0.0856962574437039 |
ObjectType | 0.01990245954437302 |
TemporalStuffType | 0.01716817183946576 |
You can download the full list from here
This blog post shows that simple graph analysis metrics applied to Big Structures can be quite helpful to understand their nature, how they have been constructed, what is their size, their impact on some algorithms that could use them, and to find potential issues in the structure.
One thing we found is that the correlation between the
properties rdfs:subClassOf
and
skos
:broaderTransitive
are nearly
identical. They nearly have the same values for each metrics. If
you were new to the UMBEL ontology you wouldn’t have known this
fact without doing this kind of analysis or by spending much time
looking at the serialized OWL file. It doesn’t tell us anything
about how similar the relations are, but it does tell us
that they have the same impact on the ontology’s graph
structure.
Performing this analysis also led us to discover a few anomalies
with the umbel:superClassOf
property, suggesting an
issue with the current version of the ontology. This issue would
have been hard to notice, and understand, without performing such a
graph analysis to the structure.
However, I also had the intuition that the analysis of the transitive closure of the subgraphs would have led to more interesting results. At best that analysis did confirm a few things, but in most of the cases it only blurred the specificities of most of the metrics.
These analysis metrics will soon be made available as standard Web services, so that they may be applied against any arbitrary graph or ontology.
In this blog post, I will show you how we can use basic graph analysis metrics to analyze Big Structures. I will do the analysis of the UMBEL reference concept structure to demonstration how graph and network analysis can be used to better understand the nature and organization of possible issues with Big Structures such as UMBEL. |
I will first present the formulas that have been used to perform the UMBEL analysis. To better understand this section, you will need some math knowledge, but there is nothing daunting here. You can probably safely skip that section if you desire.
The second section is the actual analysis of a Big Structure (UMBEL) using the graph measures I presented in the initial section. The results will be presented, analyzed and explained.
After reading this blog post, you should better understand how graph and network analysis techniques can be used to understand, use and leverage Big Structures to help integrate and interoperate disparate data sources.
A Big Structure is a network (a graph) of inter-related concepts which is composed of thousands or even hundred of thousands of such concepts. One characteristic of a Big Structure is its size. By nature, a Big Structure is too big to manipulate by hand and it requires tools and techniques to understand and assess the nature and the quality of the structure. It is for this reason that we have to leverage graph and network measures to help us in manipulating these Big Structures.
In the case of UMBEL, the Big Structure is a scaffolding of reference concepts used to link external (unrelated) structures to help data integration and to help unrelated systems inter-operate. In a World where the Internet Of Things is the focus of big companies and where there are more than 400 standards, such techniques and technologies are increasingly important, otherwise it will end-up being the Internet of [Individual] Things. Such a Big Structure can also be used for other tasks such as helping machine learning techniques to categorize and disambiguate pieces of data by leveraging such a structure of types.
UMBEL is an RDF
and OWL
ontology of a bit more than 26 000 reference concepts. Because the
structure is represented using RDF, it means that it is a directed graph.
All of UMBEL’s vertices
are classes, and all of the
edges
are properties.
The most important fact to keep in mind until the end of this blog post is that we are manipulating a directed graph. This means that all of the formulas used to analyze the UMBEL graph are formulas applicable to directed graphs only.
I will keep the normal graph analysis language that we use in
the literature, however keep in mind that a vertice
is
a class
or a named individual
and that a
edge
is a property
.
The UMBEL structure we are using is composed of the
classes view
and the individuals view
of
the ontology. That means that all the concepts are there, where
some of them only have a class view,
others have an individual
view and others have both (because they got punned).
In this section, I will present the graph measures that we will use to perform the initial UMBEL graph analysis. In the next section, we will make the analysis of UMBEL using these measures, and we will discuss the results.
This section uses math notations. It could be skipped, but I suggest to try to take time some time to understand each measure since it will help to understand the analysis.
In this blog post, a graph is represented as where is the graph, is the set of all the vertices and is the set of all the edges of the same type that relates vertices.
The UMBEL analysis focuses on one of the following transitive
properties:Â
rdfs:subClassOf
,Â umbel:superClassOf
,Â skos:broaderTransitive
,
skos:narrowerTransitive
and rdf:type
.
When we do perform the analysis, we are picking-up a subgraph
that is composed of all the connections between the vertices that
are linked by thisÂ edge (property).
The first basic measure is the density
of a graph. The density measures how many edges are in set
compared to the maximum possible number
of edges between vertices in set . The density is measured with:
where is the density
, is the number of properties
(edges)
and is the number of classes
(vertices).
The density is a ratio of the number of edges that exists, and the number of edges that could exists in the graph. is the number of edges in the graph and is the number of possible maximum number of edges.
The maximum density is 1, and the minimum density is 0. The density of a graph gives us an idea about the number of connections that exists between the vertices.
The degree
of a vertex is the number of edges that connect that vertex to
other vertices. The average degree of a graph is another measure of how many edges are
in set compared to number of vertices in set
.
where is the average degree
,
is the number of properties
(edges)
and is the number of classes
(vertices).
This measure tells the average number of nodes to which any given node is connected.
The diameter
of a graph is the longest shortest path between two
vertices in the graph. This means that this is the longest path
that excludes all detours, loops, etc. between two vertices.
Let be the length of the shortest path between and . And . The diameter of the graph is defined as:
This metric gives us an assessment of the size of the graph. It is useful to understand the kind of graph we are playing with. We will also relate it with the average path length to assess the span of the graph and the distribution of path lengths.
The average path length is the average of the shortest path length, averaged over all pairs of vertices. Let be the length of the shortest path between and . And .
where, is the number of vertices in the graph ; where, is the number of pairs of distinct vertices. Note that the number of pairs of distinct vertices is equal to the number of shortest paths between all pairs of vertices if we pick just one in case of a tie (two shortest paths with the same length).
In the context of ontology analysis, I would compare this metric as the speed of the ontology. What I mean by that is that one of the main tasks we do with an ontology is to infer new facts from known facts. Many inferring activities requires traversing the graph of an ontology. This means that the smaller the average path length between two classes, the more performant these inferencing activities should be.
The local clustering coefficient quantifies how well connected the neighborhood vertices of a given vertex are. It is the ratio of the edges that exists between all of the neighborhood vertices of a given vertex and the maximum number of possible edges between these same neighborhood vertices.
where is the number of neighborhood vertices, is the maximum number of edges between the neighborhood vertices and is the set of all the neighborhood vertices for a given vertex .
The local clustering coefficient is represented by the sum of the clustering coefficient of all the vertices of a graph divided by the number of vertices in . It is given by:
Betweenness centrality is a measure of importance of a node in a graph. It is represented by the number of times a node participates in the shortest path between other nodes. If a node participates in the shortest path of multiple other nodes, then it means that it is more important than other nodes in the graph. It acts like a conduit.
where is the total number of shortest paths from node to node and is the number of those paths that pass through .
In the context of ontology analysis, the betweenness centrality
will tell us which of the classes that participates the more often
in the shortest paths of a given transitive property between other
classes. This measure is interesting to help us understand how a
subgraph is constructed. For example, if we take a transitive
property such as rdfs:subClassOf,
then the graphs
generated by a subgraph composed of this relationship only should
be more hierarchic by the semantic nature of the property. This
means that the nodes (classes) with the highest betweenness
centrality value should be classes that participate in the upper
portion of the ontology (the more general classes). However, if we
think about the foaf:knows
transitive property between
named individuals, then the results should be quite different and
suggest a different kind of graph.
Now that we have a good understanding of some core graph
analysis measures, we will use them to analyze the graph of
relationship between the UMBEL classes and reference concepts using
the subgraphs generated by the properties:
rdfs:subClassOf
, umbel:superClassOf
,
skos:broaderTransitive
,
skos:narrowerTransitive
and rdf:type
.
The maximum number of edges in UMBEL is: which is about two thirds of a billion of edges. This is quite a lot of edges, but it is important to keep in mind since most of the following ratios are based on this maximum number of edges (connections) between the nodes of the UMBEL graph.
Here is the table that shows the density of each subgraph generated by each property:
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of edges | 39 410 | 116 792 | 36 016 | 36 322 | 271 810 |
Density | 0.0000567 | 0.0001402 | 0.0000518 | 0.0000523 | 0.0001021 |
As you can see, the density of any of the UMBEL subgraphs is really low considering that the maximum density of the graph is 1. However this gives us a picture of the structure: Most of the concepts have no more than few connections between each node for any of the analyzed properties.
This makes sense, since a conceptual structure is meant to model relationships between concepts that represent concepts of the real world, and in the real world, the concepts that we created are far from being connected to every other concept.
This is actually what we want: we want a conceptual structure which has a really low density. This suggest that the concepts are unambiguously related and hierarchized.
Having a high density (let’s say, 0.01
, which would
mean that there are nearly 7 million connections between the 26 345
concepts for a given property) may suggest that the concepts are
too highly connected which could suggest that using the UMBEL
ontology for tagging, classifying and reasoning over its concepts
won’t be an optimal choice because of the nature of the structure
and its connectivity (and possible lack of hierarchy).
The average degree shows the average number of UMBEL nodes that are connected to any other node for one of the given property.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of Edges | 39406 | 97323 | 36016 | 36322 | 70921 |
Number of Vertices | 26345 | 26345 | 26345 | 26345 | 26345 |
Average degree | 1.4957 | 3.6941 | 1.3670 | 1.3787 | 2.6920 |
As you can see, the number are quite low: from 1.36
to 3.69
. This is consistent with what we saw with the
density
measure above. This helps confirm our
assumption that all of these properties create mostly hierarchical
subgraphs.
However there seems to be one anomaly with these results, the
average degree 3.69
of the
umbel:superClassOf
property. Intuitively, its degree
should be near the one of the rdfs:subClassOf
but it
is far from this: it is more than twice its average degree. Looking
at the OWL serialization of the UMBEL version 1.05 reveals that
most umbel:RefConcept
do have 3 triples:
umbel:superClassOf skos:Collection ,
skos:ConceptScheme ,
skos:OrderedCollection .
This makes no sense that the umbel:RefConcept
are
super classes of these skos
classes. I suspect that
this got introduced via punning
at some point in the
history of UMBEL and got unnoticed until today. This issue will be
fixed in a coming maintenance version of UMBEL.
If we check back the density measure
of the graph,
we notice that we have a density of 0.0001402
for the
umbel:superClassOf
property versus
0.0000567
for the rdfs:subClassOf
property which has about the same ratio. So we could have noticed
the same anomaly by taking a better look at the density
measure
.
But in any case, this shows how this kind of graph analysis can be used to find such issues in Big Structures (structures too big to find all these issues by scrolling the code only).
The diameter of the UMBEL graph is like the worse case
scenario
. It tells us what is the longest shortest path for
a given property’s subgraph.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Diameter | 19 | 19 | 19 | 19 | 4 |
In this case, this tell us that the longest shortest path
between two given nodes for the rdfs:subClassOf
property is 19. So in the worse case, if we infer something between
these two nodes, then our algorithms will require maximum 19 steps
(think in terms of a breath first search, etc).
The average path length distribution shows us the a number paths
that have x
in length. Because of the nature of UMBEL
and its relationships, I think we should expect a normal
distribution. An interesting observation we can do is the the
average path length is situated around 6, which is the six degree
of separation.
We saw that the “worse case scenario” was a shortest
path of 19 for all the property except rdf:type
. Now
we know that the average is around 6.
Here we can notice an anomaly in the expected normal
distribution of the path lengths. Considering the other analysis we
did, we can consider that the anomaly is related to the
umbel:superClassOf
issue we found. What we will have
to re-check this metric once we fix the issue. I expect we will see
a return to a normal distribution.
The average local clustering coefficient will tell us how clustered the UMBEL subgraphs are.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Average local clustering coefficient | 0.0001201 | 0.00000388 | 0.03004554 | 0.00094251 | 0.77191429 |
As we can notice, UMBEL does not have the small-world effect with its small clustering coefficients depending on the properties we are looking at. It means that there is not a big number of hubs in the network and so that the number of steps from going from a class or a reference concept to another is higher than in other kinds of networks, like airport networks. This makes sense by looking at the average path length and the path length distributions we observed above.
At the same time, this is the nature of the UMBEL graph: it is meant to be a a well structured set of concepts with multiple different specificity layers.
To understand its nature, we could consider the robustness of the network. Normally, networks with high average clustering coefficients are known to be robust, which means that if a random node is removed, it shouldn’t impact the average clustering coefficient or the average path length of that network. However, in a network that doesn’t have the small-world effect, then they are considered less robust which means that if a node is removed, it could greatly impact the clustering coefficients (which would be lower) and the average path length (which would be higher).
This makes sense in such a conceptual network: if we remove a concept from the structure, it will most than likely impact the connectivity of the other concepts of the network.
One interesting thing to notice is the clustering coefficient of
0.03
for the skos:broaderTransitive
property and 0.00012
for the
rdfs:subClassOf
property. I have no explanation for
this discrepancy at the moment, but this should be investigated
after the fix as well since intuitively these two coefficient
should be close.
As I said above, In the context of ontology analysis, the betweenness centrality tells us which of the classes participates more often in the shortest paths of a given transitive property between other classes. This measure is useful to help us understand how a subgraph is constructed.
If we check the results below, we can see that all the top nodes
are nodes that we could easily classify as being part of the upper
portion of the UMBEL ontology (the general concepts). Another
interesting thing to notice is that the issue we found with the
umbel:superClassOf
property doesn’t seem to have any
impact on the betweenness centrality of this subgraph.
PartiallyTangible | 0.1853398395065167 |
HomoSapiens | 0.1184076250864128 |
EnduringThing_Localized | 0.1081317879905902 |
SpatialThing_Localized | 0.092787668995485 |
HomoGenus | 0.07956810084399618 |
You can download the full list from here
PartiallyTangible | 0.1538140444614064 |
HomoSapiens | 0.1053345447606491 |
EnduringThing_Localized | 0.08982813055659158 |
SuperType | 0.08545549956197795 |
Animal | 0.06988077993945754 |
You can download the full list from here
PartiallyTangible | 0.2051286996513612 |
EnduringThing_Localized | 0.1298479341295156 |
HomoSapiens | 0.09859060900526667 |
Person | 0.09607892589570508 |
HomoGenus | 0.08806468362881092 |
You can download the full list from here
PartiallyTangible | 0.2064665713447941 |
EnduringThing_Localized | 0.1294085106511612 |
HomoSapiens | 0.1019893654646639 |
Person | 0.09366129700219764 |
HomoGenus | 0.09047063411213117 |
You can download the full list from here
owl:Class | 0.3452564052004049 |
RefConcept | 0.3424169525837704 |
ExistingObjectType | 0.0856962574437039 |
ObjectType | 0.01990245954437302 |
TemporalStuffType | 0.01716817183946576 |
You can download the full list from here
Now, let’s push the analysis further. Remember that I mentioned
that all the properties we are analyzing in this blog post are
transitive? Let’s perform exactly the same metrics analysis, but
this time we will use the transitive closure
of the
subgraphs.
The transitivity characteristic of a property is simple to understand. Let’s consider this tiny graph where the property is a transitive property. Since is transitive, there is also a relationship .
Given we inferred using the transitive relation .
Now, let’s use the power of these transitive properties and
let’s analyze the transitive closure
of the subgraphs
that we are using to compute the metrics. The transitive
closure
is simple to understand. From the input subgraph, we
are generating a new graph where all these transitive relations are
explicit.
Let’s illustrate that using this small graph: . The transitive clojure
would create a new graph:
This is exactly what we will be doing with the sub-graphs created by the properties we are analyzing in this blog post. The end result is that we will be analyzing a graph with many more edges than we previously had with the non transitive closure versions of the subgraphs.
What we will analyze now is the impact of considering the transitive closure upon the ontology metrics analysis.
Remember that the maximum number of edges in UMBEL is: , which is about two thirds of a billion edges.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of Edges | 789 814 | 922 328 | 674 061 | 661 629 | 76 074 |
Density | 0.0011380 | 0.0013289 | 0.0009712 | 0.0009533 | 0.0001096 |
As we can see, we have many more edges now with the transitive closure. The density of the graph is higher as well since we inferred new relationships between nodes from the transitive nature of the properties. However it is still low considering the number of possible edges between all nodes of the UMBEL graph.
We now see the impact of transitive closure on the average degree of the subgraphs. Now each node of the subgraphs are connected by 25 to 35 other nodes in average.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Number of Edges | 789 814 | 922 328 | 674 061 | 661 629 | 76 074 |
Number of Vertices | 26345 | 26345 | 26345 | 26345 | 26345 |
Average degree | 29.97965 | 35.00960 | 25.58591 | 25.11402 | 2.887606 |
One interesting fact is that the anomaly disappears with this
transitive closure subgraph for the umbel:superClassOf
property. There is still a glitch with it, but I don’t think it
would raise suspicion at first. This is important to note since we
won’t have noticed this issue with the current version of the UMBEL
ontology if we would have analyzed the transitive closure of the
subgraph only.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Diameter | 2 | 2 | 2 | 2 | 2 |
As expected, the diameter of any of the transitive closure
subgraphs is 2
. It is the case since we made explicit
a fact (a edge) between two nodes that was not explicit at first.
This is good, but this is no quite useful from the perspective of
ontology analysis.
This would only be helpful if the number were not 2
which would suggest some errors in the way you computed the
diameter of the graph.
However what we can see here is that the speed of the
ontology (as defined in the Average Path Length
section above) is greatly improved. Since we forward-chained
the facts in the transitive closure sub-graphs, it means that
knowing if a class A
is a sub-class of a class
B
is much faster, since we have a single lookup to do
instead of an average of 6 for the non transitive closure version
of the subgraphs.
All of the path length distributions
will be the same as this one:
Since the diameter is 2, then we have a full lot of paths at 2, and 26 345 at 1. This is not really that helpful from the standpoint of ontology analysis.
Class view | Individual view | ||||
Metric | sub class of | super class of | broader | narrower | type |
Average local clustering coefficient | 0.3281777 | 0.0604906 | 0.2704963 | 0.0138592 | 0.7590267 |
Some of the properties like the rdfs:subClassOf
property shows a much stronger coefficient than
with the non-transitive closure version. This is normal since all
the nodes are connected to the other nodes down the paths. So, if a
node in between disappears, then it won’t affect the connectivity
of the subgraph since all the linkage that got inferred still
remains.
This analysis also suggests that the transitive closure version of the subgraphs are much stronger (which makes sense too).
However I don’t think this metric is that important of a characteristic to check when we analyze reference ontologies since they do not need to be robust. They are not airport or telephonic networks that need to cope with disappearing nodes in the network.
What the betweenness centrality measure does with the transitive
closure of the subgraphs is that it highlight the real top concepts
of the ontology like Thing
, Class
,
Individual
, etc. Like most of the other measures, it
blurs the details of the structure (which is not necessarily a good
thing).
SuperType | 0.03317996389023238 |
AbstractLevel | 0.02810408526564482 |
Thing | 0.02772171675862925 |
Individual | 0.02747482318621853 |
TopicsCategories | 0.02638342698407473 |
You can download the full list from here
The interesting thing here is that this measure actual shows the actual concepts for which we discovered an issue with above.
skos:Concept | 0.02849203320293865 |
owl:Thing | 0.02849094898994718 |
SuperType | 0.02848878056396423 |
skos:ConceptScheme | 0.02825567477079737 |
skos:OrderedCollection | 0.02825567477079737 |
You can download the full list from here
Thing | 0.03225428380683926 |
Individual | 0.03196350419108375 |
Location_Underspecified | 0.0261924189600178 |
Region_Underspecified | 0.02618796825161338 |
TemporalThing | 0.02594169571990208 |
You can download the full list from here
Thing | 0.03298126713602109 |
Location_Underspecified | 0.02680852092899528 |
Region_Underspecified | 0.02680398659044947 |
TemporalThing | 0.02648809433842489 |
SomethingExisting | 0.02582305801837314 |
You can download the full list from here
owl:Class | 0.3452564052004049 |
RefConcept | 0.342338078899975 |
ExistingObjectType | 0.0856962574437039 |
ObjectType | 0.01990245954437302 |
TemporalStuffType | 0.01716817183946576 |
You can download the full list from here
This blog post shows that simple graph analysis metrics applied to Big Structures can be quite helpful to understand their nature, how they have been constructed, what is their size, their impact on some algorithms that could use them, and to find potential issues in the structure.
One thing we found is that the correlation between the
properties rdfs:subClassOf
and
skos
:broaderTransitive
are nearly
identical. They nearly have the same values for each metrics. If
you were new to the UMBEL ontology you wouldn’t have known this
fact without doing this kind of analysis or by spending much time
looking at the serialized OWL file. It doesn’t tell us anything
about how similar the relations are, but it does tell us
that they have the same impact on the ontology’s graph
structure.
Performing this analysis also led us to discover a few anomalies
with the umbel:superClassOf
property, suggesting an
issue with the current version of the ontology. This issue would
have been hard to notice, and understand, without performing such a
graph analysis to the structure.
However, I also had the intuition that the analysis of the transitive closure of the subgraphs would have led to more interesting results. At best that analysis did confirm a few things, but in most of the cases it only blurred the specificities of most of the metrics.
These analysis metrics will soon be made available as standard Web services, so that they may be applied against any arbitrary graph or ontology.
Sratom 0.4.6 is out. Sratom is a small C library for serialising LV2 atoms to/from Turtle.
Changes:
Sratom 0.4.6 is out. Sratom is a small C library for serialising LV2 atoms to/from Turtle.
Changes:
Sord 0.12.2 is out. Sord is a lightweight C library for storing RDF statements in memory.
Changes:
Sord 0.12.2 is out. Sord is a lightweight C library for storing RDF statements in memory.
Changes:
Serd 0.20.0 is out. Serd is a lightweight, high-performance, dependency-free C library for RDF syntax which supports reading and writing Turtle and NTriples.
Changes:
Serd 0.20.0 is out. Serd is a lightweight, high-performance, dependency-free C library for RDF syntax which supports reading and writing Turtle and NTriples.
Changes:
Second, there is also another excellent and interesting series of workshops.
Date | Title | Hosts | Room | Room | Website |
________________________________ | ____________ | 09:00 – 12:30 | 14:00 – 17:30 | ||
01.09.2014 | Link Discovery of the Web of Data (Organized by GeoKnow & LinkingLOD) | Axel Ngonga (Uni Leipzig) | yes | - | Website |
01.09.2014 | GeoLD – Geospatial Linked Data (organised by the GeoKnow Project) |
Jens Lehmann (Uni Leipzig)
Daniel Hladky (Ontos) Andreas Both (Unister) |
yes | yes | Website |
01.09.2014 | MLODE 2014 – Content Analysis and the Semantic Web, a LIDER Hackathon | Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) | yes | yes | Website |
02.09.2014 | MLODE 2014 – Mulitlingual Linked Open Data for Enterprises, LIDER Roadmapping workshop | Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) | yes | yes | Website |
02.09.2014 | MLODE 2014 – Community meetings and break-out sessions | Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) | - | yes | Website |
Second, there is also another excellent and interesting series of workshops.
Date | Title | Hosts | Room | Room | Website |
________________________________ | ____________ | 09:00 – 12:30 | 14:00 – 17:30 | ||
01.09.2014 | Link Discovery of the Web of Data (Organized by GeoKnow & LinkingLOD) | Axel Ngonga (Uni Leipzig) | yes | - | Website |
01.09.2014 | GeoLD – Geospatial Linked Data (organised by the GeoKnow Project) |
Jens Lehmann (Uni Leipzig)
Daniel Hladky (Ontos) Andreas Both (Unister) |
yes | yes | Website |
01.09.2014 | MLODE 2014 – Content Analysis and the Semantic Web, a LIDER Hackathon | Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) | yes | yes | Website |
02.09.2014 | MLODE 2014 – Mulitlingual Linked Open Data for Enterprises, LIDER Roadmapping workshop | Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) | yes | yes | Website |
02.09.2014 | MLODE 2014 – Community meetings and break-out sessions | Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) | - | yes | Website |
Additionally, there are also interessting workshops. For example, Link Discovery of the Web of Data (Organized by GeoKnow & LinkingLOD) organized by Axel Ngonga-Ngomo.
Additionally, there are also interessting workshops. For example, Link Discovery of the Web of Data (Organized by GeoKnow & LinkingLOD) organized by Axel Ngonga-Ngomo.
We just released a new UMBEL ontology graph analysis web service endpoint: the Shortest Path web service endpoint. |
The Shortest Path
Web service is used to get the
shortest path between two UMBEL reference concepts by following the
path of a transitive property. The concepts that belong to that
path will be returned by the server.
This web service is similar to the degree
web
service endpoint but the actual path is shown. This web service is
(marginally more useful) than degree
. So if
you don’t need to know the actual concepts that participate in the
shortest path between two concepts, then you should be using the
degree
web
service endpoint instead.
The graph created by the UMBEL reference concepts ontology is a mostly an directed acyclic graph (DAG). This means that a given pair of concepts is not necessarily linked via all the properties. In these cases, the shortest path returns an error message rather than the path concepts.
This new web service endpoint is intended for users that want to perform graph/network analysis tasks on the UMBEL web service endpoint.
The web service endpoint is freely available. It can return its resultset in JSON or in EDN (Extensible Data Notation).
This endpoint will return a vector (so the order of the results is important) of concepts that participate into the shortest path. For each concept, its URI and preferred label are returned.
We also provide an online shortest path tool that people can use to experience interacting with the web service.
The user first needs to select the two concepts for which he wants to find the shortest path between the two. Then he has to select the transitive property he want to use to find the path.
Once the user clicks the Get Shortest Path
button,
he will get list of concepts, and the order, that compose the
path.
If no path exists between the two concepts for the selected property, an error message is displayed to the user.
Another improvement included with this release is the enhancement of the UMBEL taggers^{1}^{2}. It is now possible to tag any document accessible on the Web. The only thing you have to do is to provide a URL where the tagger will find the document to download and tag.
The user interface for the taggers also was modified to expose this new functionality. You now have the choice to give a text or a URL as input to the endpoints:
We just released a new UMBEL ontology graph analysis web service endpoint: the Shortest Path web service endpoint. |
The Shortest Path
Web service is used to get the
shortest path between two UMBEL reference concepts by following the
path of a transitive property. The concepts that belong to that
path will be returned by the server.
This web service is similar to the degree
web
service endpoint but the actual path is shown. This web service is
(marginally more useful) than degree
. So if
you don’t need to know the actual concepts that participate in the
shortest path between two concepts, then you should be using the
degree
web
service endpoint instead.
The graph created by the UMBEL reference concepts ontology is a mostly an directed acyclic graph (DAG). This means that a given pair of concepts is not necessarily linked via all the properties. In these cases, the shortest path returns an error message rather than the path concepts.
This new web service endpoint is intended for users that want to perform graph/network analysis tasks on the UMBEL web service endpoint.
The web service endpoint is freely available. It can return its resultset in JSON or in EDN (Extensible Data Notation).
This endpoint will return a vector (so the order of the results is important) of concepts that participate into the shortest path. For each concept, its URI and preferred label are returned.
We also provide an online shortest path tool that people can use to experience interacting with the web service.
The user first needs to select the two concepts for which he wants to find the shortest path between the two. Then he has to select the transitive property he want to use to find the path.
Once the user clicks the Get Shortest Path
button,
he will get list of concepts, and the order, that compose the
path.
If no path exists between the two concepts for the selected property, an error message is displayed to the user.
Another improvement included with this release is the enhancement of the UMBEL taggers^{1}^{2}. It is now possible to tag any document accessible on the Web. The only thing you have to do is to provide a URL where the tagger will find the document to download and tag.
The user interface for the taggers also was modified to expose this new functionality. You now have the choice to give a text or a URL as input to the endpoints:
In middle July 2014 I attended the DCO summer school at Big Sky Resort, MT, with a 2-day field trip at Yellowstone National Park (YNP) – a nice experience – the venue is wonderful, and also the topics covered by the curriculum. But what impressed me the most is to see how the Web brings changes to geoscience works as well as geoscientists.
We have three excellent field trip guides, Lisa Morgan, Pat Shanks and Bill Inskeep. They prepared and distributed a 82-page YNP field trip guide! Of course they first shared it online through Dropbox. What also impressed me is that when I showed my golden spike information portal to Lisa, she also showed me a few APPs on her iPhone with state geologic map services – useful gadget for field work. But our field trip experience in YNP showed that a paper map is still necessary as it is bigger and provides a overview of a wider area, and it needs no battery.
The YNP itself has a virtual observatory website called Yellowstone Volcano Observatory, hosted by USGS and University of Utah. The portal provides “timely monitoring and hazard assessment of volcanic, hydrothermal, and earthquake activity in the Yellowstone Plateau region.” Featured information includes publications, online mapping services, and also images, videos and webcams about YNP.
I was happy to see that Katie Pratt and I are accompanied by many other summer school participants when we were tweeting on Twitter. Search the hashtag #DCOSS14 you will find how active the participants were on Twitter during the period of the summer school. I was even a little surprise to see that Donato Giovannelli @d_giovannelli helped answer a question about twitter impact on citation by pasting the link to a paper, a few seconds after I gave a short introduction to the Altmetric.com and its use in Nature Publishing Group, Springer and Wiley.
And my role at the summer school was two-fold: participant and lecturer. I gave a presentation titled ‘Why data science matters and what we can do with it‘, in which I addressed four sub-topics: data management and publication, interoperability of data, provenance of research, and era of Science 2.0. The slides are accessible on Slidershare [link].
In middle July 2014 I attended the DCO summer school at Big Sky Resort, MT, with a 2-day field trip at Yellowstone National Park (YNP) – a nice experience – the venue is wonderful, and also the topics covered by the curriculum. But what impressed me the most is to see how the Web brings changes to geoscience works as well as geoscientists.
We have three excellent field trip guides, Lisa Morgan, Pat Shanks and Bill Inskeep. They prepared and distributed a 82-page YNP field trip guide! Of course they first shared it online through Dropbox. What also impressed me is that when I showed my golden spike information portal to Lisa, she also showed me a few APPs on her iPhone with state geologic map services – useful gadget for field work. But our field trip experience in YNP showed that a paper map is still necessary as it is bigger and provides a overview of a wider area, and it needs no battery.
The YNP itself has a virtual observatory website called Yellowstone Volcano Observatory, hosted by USGS and University of Utah. The portal provides “timely monitoring and hazard assessment of volcanic, hydrothermal, and earthquake activity in the Yellowstone Plateau region.” Featured information includes publications, online mapping services, and also images, videos and webcams about YNP.
I was happy to see that Katie Pratt and I are accompanied by many other summer school participants when we were tweeting on Twitter. Search the hashtag #DCOSS14 you will find how active the participants were on Twitter during the period of the summer school. I was even a little surprise to see that Donato Giovannelli @d_giovannelli helped answer a question about twitter impact on citation by pasting the link to a paper, a few seconds after I gave a short introduction to the Altmetric.com and its use in Nature Publishing Group, Springer and Wiley.
And my role at the summer school was two-fold: participant and lecturer. I gave a presentation titled ‘Why data science matters and what we can do with it‘, in which I addressed four sub-topics: data management and publication, interoperability of data, provenance of research, and era of Science 2.0. The slides are accessible on Slidershare [link].
di:sha1;eCt+TB1Pj/vgY05nqB48sd1seqo=?http=trueg.selfhost.eu%3A8899
di:sha1;eCt+TB1Pj/vgY05nqB48sd1seqo=?http=trueg.selfhost.eu%3A8899
Last week, a tweet from
Last week, a tweet from
ApacheCon brings together the open source community to learn about and collaborate on the technologies and projects driving the future of open source, big data and cloud computing. Apache projects have and continue to be hugely influential in the innovation and development of software development across a plethora of categories from content, databases and servers, to big data, cloud, mobile and virtual machine.
The developers, programmers, committers and users driving this innovation and utilising these tools will meet in Budapest on November 17-19, for collaboration, education and community building.
In the last years Linked Data has become an important topic in the Apache Software Foundation, with projects such as Jena, Marmotta, Stanbol, Clerezza and Any23. Redlink supports the event by co-chairing a dedicated track about Linked Data. The track aims to be a place where all these projects can meet to explore synergies across the different projects and developers. It is also particularly interesting for us to connect with other data-intensive projects to discuss their approaches with Semantic Web technologies.
Last week the Apache Software Foundation officially announced the schedule. The programme has many interesting technical talks. Here is where you can meet some Redlinkers presenting our technology:
Looking forward to meet you in Budapest this coming November!
ApacheCon brings together the open source community to learn about and collaborate on the technologies and projects driving the future of open source, big data and cloud computing. Apache projects have and continue to be hugely influential in the innovation and development of software development across a plethora of categories from content, databases and servers, to big data, cloud, mobile and virtual machine.
The developers, programmers, committers and users driving this innovation and utilising these tools will meet in Budapest on November 17-19, for collaboration, education and community building.
In the last years Linked Data has become an important topic in the Apache Software Foundation, with projects such as Jena, Marmotta, Stanbol, Clerezza and Any23. Redlink supports the event by co-chairing a dedicated track about Linked Data. The track aims to be a place where all these projects can meet to explore synergies across the different projects and developers. It is also particularly interesting for us to connect with other data-intensive projects to discuss their approaches with Semantic Web technologies.
Last week the Apache Software Foundation officially announced the schedule. The programme has many interesting technical talks. Here is where you can meet some Redlinkers presenting our technology:
Looking forward to meet you in Budapest this coming November!
Press release, Salzburg, Austria – July 24, 2014
Redlink is now a supporter member of the Open Data Institute. As a an innovative startup in the enterprise linked data sector, Redlink brings the value of semantic processing and linked data services built on free and open-source software and delivered as a platform-as-a-service to a wider audience of developers, public institutions and IT integrators. This membership represents for Redlink a major step in promoting open data culture in Europe and an integral part of our ongoing work as technology enablers.
Founded by Sir Tim Berners-Lee and Professor Sir Nigel Shadbolt, and opened in December 2012, the ODI is an independent,
non-profit, non-partisan, limited by guarantee company. With a 5,000 sq ft convening space in the heart of London’s thriving Shoreditch area, and a global remit, ODI work to catalyse an open data culture to create economic, environmental, and social value. The ODI helps unlock supply, generates demand, creates and disseminates knowledge to address local and global issues.
Gavin Starks, ODI CEO: “In joining the ODI, Redlink is showing leadership in its sector, recognising the social, economic and environmental potential of open data. More than 70 pioneering member companies have now joined the ODI to deliver new products and services and create value for business, and society”.
Redlink was born in March 2013 from the core committers of Apache Marmotta and Apache Stanbol to democratise semantic technologies and to help organisations take full advantage of linked data made publicly available by governments for structuring any form of unstructured data.
John Pereira, Redlink CEO: “We have come a long way in understanding the importance of freeing data from legacy and proprietary formats, the results are clear with the many initiatives and available open datasets. Now we need to demonstrate the business value. At Redlink our contribution is to simplify the use of semantic processing and linked data technology to power the new generation of exciting linked data driven applications”
About Redlink
Redlink GmbH (http://redlink.co), headquartered in Austria, helps enterprises make sense of their data by semantically enriching, linking and searching the vast amounts of unstructured data. Redlink is the company behind the open source projects Apache Stanbol and Apache Marmotta, and is committed to the wide adoption of open source semantic technologies to support a broad set of mission-critical and real-time production uses.
Media Contacts
John Pereira
Redlink GmbH
+43 660 277 1228
john.pereira@redlink.co
Andrea Volpini
Redlink GmbH
+39 348 761 7242
andrea.volpini@redlink.co
Emma Thwaites
The ODI Communications Team
emma@theodi.org
Press release, Salzburg, Austria – July 24, 2014
Redlink is now a supporter member of the Open Data Institute. As a an innovative startup in the enterprise linked data sector, Redlink brings the value of semantic processing and linked data services built on free and open-source software and delivered as a platform-as-a-service to a wider audience of developers, public institutions and IT integrators. This membership represents for Redlink a major step in promoting open data culture in Europe and an integral part of our ongoing work as technology enablers.
Founded by Sir Tim Berners-Lee and Professor Sir Nigel Shadbolt, and opened in December 2012, the ODI is an independent,
non-profit, non-partisan, limited by guarantee company. With a 5,000 sq ft convening space in the heart of London’s thriving Shoreditch area, and a global remit, ODI work to catalyse an open data culture to create economic, environmental, and social value. The ODI helps unlock supply, generates demand, creates and disseminates knowledge to address local and global issues.
Gavin Starks, ODI CEO: “In joining the ODI, Redlink is showing leadership in its sector, recognising the social, economic and environmental potential of open data. More than 70 pioneering member companies have now joined the ODI to deliver new products and services and create value for business, and society”.
Redlink was born in March 2013 from the core committers of Apache Marmotta and Apache Stanbol to democratise semantic technologies and to help organisations take full advantage of linked data made publicly available by governments for structuring any form of unstructured data.
John Pereira, Redlink CEO: “We have come a long way in understanding the importance of freeing data from legacy and proprietary formats, the results are clear with the many initiatives and available open datasets. Now we need to demonstrate the business value. At Redlink our contribution is to simplify the use of semantic processing and linked data technology to power the new generation of exciting linked data driven applications”
About Redlink
Redlink GmbH (http://redlink.co), headquartered in Austria, helps enterprises make sense of their data by semantically enriching, linking and searching the vast amounts of unstructured data. Redlink is the company behind the open source projects Apache Stanbol and Apache Marmotta, and is committed to the wide adoption of open source semantic technologies to support a broad set of mission-critical and real-time production uses.
Media Contacts
John Pereira
Redlink GmbH
+43 660 277 1228
john.pereira@redlink.co
Andrea Volpini
Redlink GmbH
+39 348 761 7242
andrea.volpini@redlink.co
Emma Thwaites
The ODI Communications Team
emma@theodi.org
On Monday, July 28, in room P702 at 3.00 p.m., Edgard Marx proposes a question answering system. He has a computer science background (BSc. and MSc. in Computer Science/PUC-Rio) and is a member of AKSW (Agile Knowledge Engineering and Semantic Web). Edgard has been engaging in Semantic Web technology research since 2010 and is mainly working on evangelization and developing of conversion and mapping tools.
The use of Semantic Web technologies led to an increasing number
of structured data published on the Web.
Despite the advances on question answering systems retrieving and
presenting the desired information from RDF structured sources is
still substantial challenging.
In this talk we will present our proposal and working draft to
address this challenges.
This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.
On Monday, July 28, in room P702 at 3.00 p.m., Edgard Marx proposes a question answering system. He has a computer science background (BSc. and MSc. in Computer Science/PUC-Rio) and is a member of AKSW (Agile Knowledge Engineering and Semantic Web). Edgard has been engaging in Semantic Web technology research since 2010 and is mainly working on evangelization and developing of conversion and mapping tools.
The use of Semantic Web technologies led to an increasing number
of structured data published on the Web.
Despite the advances on question answering systems retrieving and
presenting the desired information from RDF structured sources is
still substantial challenging.
In this talk we will present our proposal and working draft to
address this challenges.
This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.
Mike Bergman just published the second part^{1} of his series of blog posts that summarize the evolution of the Semantic Web in the last decade, and how our experience of the last 7 years of research in that field has led to these observations.
The second part of that series is: Big Structure: At The Nexus of Knowledge Bases, the Semantic Web and Artificial Intelligence.
He continues to outline some issues with the Semantic Web, but more importantly how it fits in a much broader ecosystem, namely KBAI (Knowledge Based AI). He explains the difference between data integration and data interoperability and how these problems could benefit leveraging a sub-set of the Artificial Intelligence domain related to data interoperability:
We welcome hearing from you!
Mike Bergman just published the second part^{1} of his series of blog posts that summarize the evolution of the Semantic Web in the last decade, and how our experience of the last 7 years of research in that field has led to these observations.
The second part of that series is: Big Structure: At The Nexus of Knowledge Bases, the Semantic Web and Artificial Intelligence.
He continues to outline some issues with the Semantic Web, but more importantly how it fits in a much broader ecosystem, namely KBAI (Knowledge Based AI). He explains the difference between data integration and data interoperability and how these problems could benefit leveraging a sub-set of the Artificial Intelligence domain related to data interoperability:
We welcome hearing from you!