Planet RDF

It's triples all the way down

October 25

Frederick Giasson: Create a Domain Text Classifier Using Cognonto

A common task required by systems that automatically analyze text is to classify an input text into one or multiple classes. A model needs to be created to scope the class (what belongs to it and what does not) and then a classification algorithm uses this model to classify an input text.

Multiple classification algorithms exists to perform such a task: Support Vector Machine (SVM), K-Nearest Neigbours (KNN), C4.5 and others. What is hard with any such text classification task is not so much how to use these algorithms: they are generally easy to configure and use once implemented in a programming language. The hard – and time-consuming – part is to create a sound training corpus that will properly define the class you want to predict. Further, the steps required to create such a training corpus must be duplicated for each class you want to predict.

Since creating the training corpus is what is time consuming, this is where Cognonto provides its advantages.

In this article, we will show you how Cognonto’s KBpedia Knowledge Graph can be used to automatically generate training corpuses that are used to generate classification models. First, we define (scope) a domain with one or multiple KBpedia reference concepts. Second, we aggregate the training corpus for that domain using the KBpedia Knowledge Graph and its linkages to external public datasets that are then used to populate the training corpus of the domain. Third, we use the Explicit Semantic Analysis (ESA) algorithm to create a vectorial representation of the training corpus. Fourth, we create a model using (in this use case) an SVM classifier. Finally, we predict if an input text belongs to the class (scoped domain) or not.

This use case can be used in any workflow that needs to pre-process any set of input texts where the objective is to classify relevant ones into a defined domain.

Unlike more traditional topic taggers where topics are tagged in an input text with weights provided for each of them, we will see how it is possible to use the semantic interpreter to tag main concepts related to an input text even if the surface form of the topic is not mentioned in the text. We accomplish this by leveraging ESA’s semantic interpreter.

General and Specific Domains

In this article, two concepts are at the center of everything: what I call the general domain and the specific domain(s). What I call the general domain can be seen as the set of all specific domains. It includes the set of classes that generally define common things of the World. What we call a specific domain is one or multiple classes that scope a domain of interest. A specific domain is a subset of classes of the general domain.

In Cognonto, the general domain is defined by all the ~39,000 KBpedia reference concepts. A specific domain is any sub-set of the ~39,000 KBpedia reference concept that adequately scopes a domain of interest.

The purpose of this use case is to show how we can determine if an input text belongs to a specific domain of interest. What we have to do is to create two training corpuses: one that defines the general domain, and one that defines the specific domain. However, how do we go about defining these corpuses? One way would be to do this manually, but it would take an awful lot of time to do.

This is the crux of the matter: we will generate the general domain corpus and specific domain ones automatically using the KBpedia Knowledge Graph and all of its linkages to external public datasets. The time and resources thus saved from creating the training corpuses can be spent testing different classification algorithms, tweaking their parameters, evaluating them, etc.

What is so powerful in leveraging the KBpedia Knowledge Graph in this manner is that we can generate training sets for all kind of domains of interests automatically.

Training Corpuses

The first step we have to do is to define the training corpuses that we will use to create the semantic interpreter and the SVM classification models. We have to create the general domain training corpus and the domain specific training corpus. The example domain I have chosen for this use case is scoped by the ideas of Music, Musicians, Music Records, Musical Groups, Musical Instruments, etc.

Define The General Training Corpus

The general training corpus is quite easy to create. The only thing I have to do is to query the KBpedia Knowledge Graph to get all the Wikipedia pages linked to all the KBpedia reference concepts. These pages will become the general training corpus.

Note that in this article I will only use the linkages to the Wikipedia dataset, but I could also use any other datasets that are linked to the KBpedia Knowledge Graph in exactly the same way. Here is how we aggregate all the documents that will belong to a training corpus:

Note all I need do is to use the KBpedia structure, query it, and then write the general corpus into a CSV file. This CSV file will be used later for most of the subsequent tasks.

(define-general-corpus "resources/kbpedia_reference_concepts_linkage.n3" "resources/general-corpus-dictionary.csv")

Define The Specific Domain Training Corpus

The next step is to define the training corpuse of the specific domain for this use case, the music domain. To do so, I need merely search KBpedia to find all the reference concepts I am interested in that will scope my music domain. These domain-specific KBpedia reference concepts will be the features of the SVM models we will test below.

What the define-domain-corpus function does below is simply to query KBpedia to get all the Wikipedia articles related to these concepts, their sub-classes and to create the training corpus from them.

In this article we only define a binary classifier. However, if we would want to create a multi-class classifier then we would have to define multiple specific domain training corpuses exactly the same way. The only time we would have to spend is to search KBpedia (using the Cognonto user interface) to find the reference concepts we want to use to scope the domains we want to define. We will show how quickly this can be done with impressive results in a later use case.

(define-domain-corpus [""

Create Training Corpuses

Once the training corpuses are defined, we want to cache them locally to be able to play with them, without having to re-download them from the Web or re-create them each time.


The cache is composed of 24,374 Wikipedia pages, which is about 2G of raw data. However, we have some more processing to perform on the raw Wikipedia pages since what we ultimately want is a set of relevant tokens (words) that will be used to calculate the value of the features of our model using the ESA semantic interpreter. Since we may want to experiment with different normalization rules, what we do is to re-write each document of the corpus in another folder that we will be able to re-create as required if the normalization rules change in the future. We can quickly re-process these input files and save them in separate folders for testing and comparative purposes.

The normalization steps performed by this function are to:

  1. Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
  2. Normalize the text with the following rules:
    1. remove diacritics characters
    2. remove everything between brackets like: [edit] [show]
    3. remove punctuation
    4. remove all numbers
    5. remove all invisible control characters
    6. remove all [math] symbols
    7. remove all words with 2 characters or fewer
    8. remove line and paragraph seperators
    9. remove anything that is not an alpha character
    10. normalize spaces
    11. put everything in lower case, and
    12. remove stop words.

Normalization steps could be dropped or others included, but these are the standard ones Cognonto applies in its baseline configuration.

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

After cleaning, the size of the cache is now 208M (instead of the initial 2G for the raw web pages).

Note that unlike what is discussed in the original ESA research papers by Evgeniy Gabrilovich we are not pruning any pages (the ones with less than X number of tokens, etc. This could be done but at a subsequent tweaking step, which our results below indicate is not really necessary.

Now that the training corpuses are created we can now build the semantic interpreter to create the vectors that will be used to train the SVM classifier.

Build Semantic Interpreter

What we want to do is to classify (determine) if an input text belongs to a class as defined by a domain. The relatedness of the input text is based on how closely the specific domain corpus is related to the general one. This classification is performed with some classifiers like SVM, KNN and C4.5. However, each of these algorithms need to use some kind of numerical vector, upon which the actual classifier requires to model and classify the candidate input text. Creating this numeric vector is the job of the ESA Semantic Interpreter.

Let’s dive a little further into the Semantic Interpreter to understand how it operates. Note that you can skip the next section and continue with the following one.

How Does the Semantic Interpreter Work?

The Semantic Interpreter is a process that maps fragments of natural language into a weighted sequence of text concepts ordered by their relevance to the input.

Each concept in the domain is accompanied by a document from the KBpedia Knowledge Graph, which acts as its representative term set to capture the idea (meaning) of the concept. The overall corpus is based on the combined documents from KBpedia that match the slice retrieved from the knowledge graph based on the domain query(ies).

The corpus is composed of create-domain-tagger-using-cognonto_6c2f6b56eb1f033153b7937febd62b32a841a7e2 concepts that come from the domain ontology associated with create-domain-tagger-using-cognonto_1f55d35233c47f3073a838b1d223ba9f2d3a48e2 KBpedia Knowledge Base documents. We build a sparse matrix create-domain-tagger-using-cognonto_4c6d0f6f5b83f376901c5df05c69c39809726770 where each of the create-domain-tagger-using-cognonto_a802029e846072ba5e3b7b45e8121a2747fc69ff columns corresponds to a concept and where each of the rows corresponds to a word that occurs in the related entity documents create-domain-tagger-using-cognonto_5fff667160a8636abcf6862cf1ef27512e971759. The matrix entry create-domain-tagger-using-cognonto_9b3449aa2048b4d8b80316d793e20d6cbbf0a588 is the TF-IDF value of the word create-domain-tagger-using-cognonto_aa2db9982673a03cc8f12a2138763f0df783456e in document create-domain-tagger-using-cognonto_1704d2d70bd876bdd667179e3762e9d2e989c614.


The TF-IDF value of a given term is calculated as:


where create-domain-tagger-using-cognonto_88b08b2e81707614f5c9f387c3a61331a297e5b4 is the number of words in the document create-domain-tagger-using-cognonto_1704d2d70bd876bdd667179e3762e9d2e989c614, where the term frequency is defined as:


and where the document frequency create-domain-tagger-using-cognonto_7c74065b76d1b965532a2af02031be0d5220045b is the number of documents where the term create-domain-tagger-using-cognonto_3fdf772586413ab3eb6a0ddc20c9d93b71e48633 appears.

Unlike the standard ESA system, pruning is not performed on the matrix to remove the least-related concepts for any given word. We are not doing the pruning due to the fact that the ontologies are highly domain specific as opposed to really broad and general vocabularies. However, with a different mix of training text, and depending on the use case, the stardard ESA model may benefit from pruning the matrix.

Once the matrix is created, we do perform cosine normalization on each column:


where create-domain-tagger-using-cognonto_9b3449aa2048b4d8b80316d793e20d6cbbf0a588 is the TF-IDF weight of the word create-domain-tagger-using-cognonto_3fdf772586413ab3eb6a0ddc20c9d93b71e48633 in the concept document create-domain-tagger-using-cognonto_bfde35dfaa942ab76c69d72eb189e4cf1de36bd3, where create-domain-tagger-using-cognonto_df5ef8f1a0a99c333958e31956bc99cf67d3da0b is the square root of the sum of exponent of the TF-IDF weight of each word create-domain-tagger-using-cognonto_3fdf772586413ab3eb6a0ddc20c9d93b71e48633 in document create-domain-tagger-using-cognonto_1704d2d70bd876bdd667179e3762e9d2e989c614. This normalization removes, or at least lowers, the effect of the length of the input documents.

Creating the First Semantic Interpreter

The first semantic interpreter we will create is composed of the general corpus which has 24,374 Wikipedia pages and the music domain-specific corpus composed of 62 Wikipedia pages. The 62 Wikipedia pages that compose the music domain corpus come from the selected KBpedia reference concepts and their sub-classes that we defined in the Define The Specific Domain Training Corpus section above.

(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--base.csv")

(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

Evaluating Models

Before building the SVM classifier, we have to create a gold standard that we will use to evaluate the performance of the models we will test. What I did is to aggregate a list of news feeds from the CBC and from Reuters and then I crawled each of them to get the news they were containing. Once I aggregated each of them in a spreadsheet, I manually classified each of them. The result is a gold standard of 336 news pages which were classified as being related to the music domain or not. It can be downloaded from here.

Subsequently, three days later, I re-crawled the same feeds to create a second gold standard that has 345 new spages. It can be downloaded from here. I will use both to evaluate the different SVM models we will create below. (I created the two standards because of some internal tests and statistics we are compiling.)

Both gold standards got created this way:

(defn create-gold-standard-from-feeds
  (let [feeds [""

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

Each of the different models we will test in the next sections will be evaluated using the following function:

(defn evaluate-model
  [evaluation-no gold-standard-file]
  (let [gold-standard (rest
                       (with-open [in-file (io/reader gold-standard-file)]
                          (csv/read-csv in-file))))
        true-positive (atom 0)
        false-positive (atom 0)
        true-negative (atom 0)
        false-negative (atom 0)]

    (with-open [out-file (io/writer (str "resources/evaluate-" evaluation-no ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])

      (doseq [[class title url] gold-standard]
        (when-not (.exists (io/as-file (str "resources/gold-standard-cache/" (md5 url))))
          (spit (str "resources/gold-standard-cache/" (md5 url)) (slurp url)))
        (let [predicted-class (classify-text (-> (slurp (str "resources/gold-standard-cache/" (md5 url)))
          (println predicted-class " :: " title)
          (csv/write-csv out-file [[predicted-class title url]] :append true)
          (when (and (= class "1")
                     (= predicted-class 1.0))
            (swap! true-positive inc))

          (when (and (= class "0")
                     (= predicted-class 1.0))
            (swap! false-positive inc))

          (when (and (= class "0")
                     (= predicted-class 0.0))
            (swap! true-negative inc))

          (when (and (= class "1")
                     (= predicted-class 0.0))
            (swap! false-negative inc))))

      (println "True positive: " @true-positive)
      (println "false positive: " @false-positive)
      (println "True negative: " @true-negative)
      (println "False negative: " @false-negative)


      (let [precision (float (/ @true-positive (+ @true-positive @false-positive)))
            recall (float (/ @true-positive (+ @true-positive @false-negative)))]
        (println "Precision: " precision)
        (println "Recall: " recall)
        (println "Accuracy: " (float (/ (+ @true-positive @true-negative) (+ @true-positive @false-negative @false-positive @true-negative))))
        (println "F1: " (float (* 2 (/ (* precision recall) (+ precision recall)))))))))

What this function does is to calculate the number of true-positive, false-positive, true-negative and false-negatives scores within the gold standard by applying the current model, and then to calculate the precision, recall, accuracy and F1 metrics. You can read more about how binary classifiers can be evaluated from here.

Build SVM Model

Now that we have numeric vector representations of the music domain and now that we have a way to evaluate the quality of the models we will be creating, we can now create and evaluate our prediction models.

The classification algorithm I choose to use for this article is the Support Vector Machine (SVM). I use the Java port of the LIBLINEAR library. Let’s create the first SVM model:

(build-svm-model-vectors "resources/svm/base/")
(train-svm-model "svm.w0" "resources/svm/base/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

This initial model is created using a training set that is composed of 24,311 documents that doesn’t belong to the class (the music specific domain), and 62 documents that does belong to that class.

Now, let’s evaluate how this initial model perform against the the two gold standards:

(evaluate-model "w0" "resources/gold-standard-1.csv" )
True positive:  5
False positive:  0
True negative:  310
False negative:  21

Precision:  1.0
Recall:  0.1923077
Accuracy:  0.9375
F1:  0.32258064
(evaluate-model "w0" "resources/gold-standard-2.csv" )
True positive:  2
false positive:  1
True negative:  319
False negative:  23

Precision:  0.6666667
Recall:  0.08
Accuracy:  0.93043476
F1:  0.14285713

Well, this first run looks like to be really poor! The issue here is a common issue with how the SVM classifier is being used. Ideally, the number of documents that belong to the class and the number of documents that do not belong to the class should be about the same. However, because of the way we defined the music specific domain, and because of the way we created the training corpuses, we ended up with two really unbalanced sets of training documents: 24,311 that doesn’t belong to the class and only 63 that does belong to the class. That is the reason why we are getting these kinds of poor results.

What can we do from here? We have two possibilities:

  1. We use LIBLINEAR’s weight modifier parameter to modify the weight of the terms that exists in the 63 documents that belong to the class. Because the two sets are so unbalanced, the weight should theorically be around 386, or
  2. We add thousands of new documents that belong to the class we want to predict.

Let’s test both options. We will initially play with the weights to see how much we can improve the current situation.

Improving Performance Using Weights

What we will do now is to create a series of models that will differ in the weight we will define to improve the weight of the classified terms in the SVM process.

Weight 10

(train-svm-model "svm.w10" "resources/svm/base/"
                 :weights {1 10.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w10" "resources/gold-standard-1.csv")
True positive:  17
False positive:  1
True negative:  309
False negative:  9

Precision:  0.9444444
Recall:  0.65384614
Accuracy:  0.9702381
F1:  0.77272725
(evaluate-model "w10" "resources/gold-standard-2.csv")
True positive:  15
False positive:  2
True negative:  318
False negative:  10

Precision:  0.88235295
Recall:  0.6
Accuracy:  0.9652174
F1:  0.71428573

This is already a clear improvement for both gold standards. Let’s see if we continue to see improvements if we continue to increase the weight.

Weight 25

(train-svm-model "svm.w25" "resources/svm/base/"
                 :weights {1 25.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w25" "resources/gold-standard-1.csv")
True positive:  20
False positive:  3
True negative:  307
False negative:  6

Precision:  0.8695652
Recall:  0.7692308
Accuracy:  0.97321427
F1:  0.8163265
(evaluate-model "w25" "resources/gold-standard-2.csv")
True positive:  21
False positive:  5
True negative:  315
False negative:  4

Precision:  0.8076923
Recall:  0.84
Accuracy:  0.973913
F1:  0.82352936

The general metrics continued to improve. By increasing the weight, the precision dropped a little bit, but the recall improved quite a bit. The overall F1 score significantly improved. Let’s see with the Weight at 50.

Weight 50

(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w50" "resources/gold-standard-1.csv")
True positive:  23
False positive:  7
True negative:  303
False negative:  3

Precision:  0.76666665
Recall:  0.88461536
Accuracy:  0.9702381
F1:  0.82142854
(evaluate-model "w50" "resources/gold-standard-2.csv")
True positive:  23
False positive:  6
True negative:  314
False negative:  2

Precision:  0.79310346
Recall:  0.92
Accuracy:  0.9768116
F1:  0.8518519

The trend continues: decline in precision increase of recall and overall F1 score is better in both cases. Let’s try with a weight of 200

Weight 200

(train-svm-model "svm.w200" "resources/svm/base/"
                 :weights {1 200.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w200" "resources/gold-standard-1.csv")
True positive:  23
False positive:  7
True negative:  303
False negative:  3

Precision:  0.76666665
Recall:  0.88461536
Accuracy:  0.9702381
F1:  0.82142854
(evaluate-model "w200" "resources/gold-standard-2.csv")
True positive:  23
False positive:  6
True negative:  314
False negative:  2

Precision:  0.79310346
Recall:  0.92
Accuracy:  0.9768116
F1:  0.8518519

Results are the same, it looks like improving the weights up to a certain point adds further to the predictive power. However, the goal of this article is not to be an SVM parametrization tutorial. Many other tests could be done such as testing different values for the different SVM parameters like the C parameter and others.

Improving Performance Using New Music Domain Documents

Now let’s see if we can improve the performance of the model even more by adding new documents that belong to the class we want to define in the SVM model. The idea of adding documents is good, but how may we quickly process thousands of new documents that belong to that class? Easy, we will use the KBpedia Knowledge Graph and its linkage to entities that exists into the KBpedia Knowledge Base to get thousands of new documents highly related to the music domain we are defining.

Here is how we will proceed. See how we use the type relationship between the classes and their individuals:

The millions of completely typed instances in KBpedia enable us to retrieve such large training sets efficiently and quickly.

Extending the Music Domain Model

To extend the music domain model I added about 5000 albums, musicians and bands documents using the relationships querying strategy outlined in the figure above. What I did is just to add 3 new features but with thousands of new training documents in the corpus.

What I had to do was to:

  1. Extend the domain pages with the new entities
  2. Cache the new entities’ Wikipedia pages
  3. Build a new semantic interpreter that take the new documents into account, and
  4. Build a new SVM model that use the new semantic interpreter’s output.
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--extended.csv")

(build-semantic-interpreter "domain-extended" "resources/semantic-interpreters/domain-extended/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/domain-extended/")

Evaluating the Extended Music Domain Model

Just like what we did for the first series of tests, we now will create different SVM models and evaluate them. Since we now have a nearly balanced set of training corpus documents, we will test much smaller weights (no weight, and then 2 weight).

(train-svm-model "svm.w0" "resources/svm/domain-extended/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w0" "resources/gold-standard-1.csv")
True positive:  20
False positive:  12
True negative:  298
False negative:  6

Precision:  0.625
Recall:  0.7692308
Accuracy:  0.9464286
F1:  0.6896552
(evaluate-model "w0" "resources/gold-standard-2.csv")
True positive:  18
False positive:  17
True negative:  303
False negative:  7

Precision:  0.51428574
Recall:  0.72
Accuracy:  0.93043476
F1:  0.6

As we can see, the model is scoring much better than the previous one when the weight is zero. However, it is not as good as the previous one when weights are modified. Let’s see if we can benefit increasing the weight for this new training set:

(train-svm-model "svm.w2" "resources/svm/domain-extended/"
                 :weights {1 2.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w2" "resources/gold-standard-1.csv")
True positive:  21
False positive:  23
True negative:  287
False negative:  5

Precision:  0.47727272
Recall:  0.8076923
Accuracy:  0.9166667
F1:  0.59999996
(evaluate-model "w2" "resources/gold-standard-2.csv")
True positive:  20
False positive:  33
True negative:  287
False negative:  5

Precision:  0.3773585
Recall:  0.8
Accuracy:  0.8898551
F1:  0.51282054

Overall the models seems worse with weight 2, let’s try with weight 5:

(train-svm-model "svm.w5" "resources/svm/domain-extended/"
                 :weights {1 5.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w5" "resources/gold-standard-1.csv")
True positive:  25
False positive:  52
True negative:  258
False negative:  1

Precision:  0.32467532
Recall:  0.96153843
Accuracy:  0.8422619
F1:  0.4854369
(evaluate-model "w2" "resources/gold-standard-2.csv")
True positive:  23
False positive:  62
True negative:  258
False negative:  2

Precision:  0.27058825
Recall:  0.92
Accuracy:  0.81449276
F1:  0.41818184

The performances are just getting worse. But this makes sense at the same time. Now that the training set is balanced, there are many more tokens that participate into the semantic interpreter and so in the vectors generated by it and used by the SVM. If we increase the weight of a balanced training set, then this intuitively should re-unbalance the training set and worsen the performances. This is what is apparently happening.

Re-balancing the training set using this strategy does not look to be improving the prediction model, at least not for this domain and not for these SVM parameters.

Improving Using Manual Features Selection

So far, we have been able to test different kind of strategies to create different training corpuses, to select different features, etc. We have been able to do this within a day, mostly waiting for the desktop computer to build the semantic interpreter and the vectors for the training sets. It has been possible thanks to the KBpedia Knowledge Graph that enabled us to easily and automatically slice-and-dice the knowledge structure to perform all these tests quickly and efficiently.

There are other things we could do to continue to improve the prediction model, such as manually selecting features returned by KBpedia. Then we could test different parameters of the SVM classifier, etc. However, such tweaks are the possible topics of later use cases.

Multiclass Classification

Let me add a few additional words about multiclass classification. As we saw, we can easily define domains by selecting one or multiple KBpedia reference concepts and all of their sub-classes. This general process enables us to scope any domain we want to cover. Then we can use the KBpedia Knowledge Graph’s relationship with external data sources to create the training corpus for the scoped domain. Finally, we can use SVM as a binary classifier to determine if an input text belongs to the domain or not. However, what if we want to classify an input text with more than one domain?

This can easily be done by using the one-vs-rest (also called the one-vs-all) multiclass classification strategy. The only thing we have to do is to define multiple domains of interest, and then to create a SVM model for each of them. As noted above, this effort is almost solely one of posing one or more queries to KBpedia for a given domain. Finally, to predict if an input text belongs to any of each domain models we defined, we need to apply an SVM option (like LIBLINEAR) that already implements multi-class SVM classification.


In this article, we tested multiple, different strategies to create a good prediction model using SVM to classify input texts into a music-related class. We tested unbalanced training corpuses, balanced training corpuses, different set of features, etc. Some of these tests improved the prediction model; others made it worse. The key point that should be remembered is that any machine learning effort requires bounding, labeling, testing and refining multiple parameters in order to obtain the best results. Use of the KBpedia Knowledge Graph and its linkage to external public datasets enables Cognonto to now do this previously lengthy and time-consuming tasks quickly and efficiently.

Within a few hours, we created a classifier with an accuracy of about 97% that classifies input text to belong to a music domain or not. We demonstrate how we can create such classifiers more-or-less automatically using the KBpedia Knowledge Graph to define the scope of the domain and to classify new text into that domain based on relevant KBpedia reference concepts. Finally, we note how we may create multi-class classifiers using exactly the same mechanisms.

Posted at 00:49

October 24

Libby Miller: A presence robot with Chromium, WebRTC, Raspberry Pi 3 and EasyRTC

Here’s how to make a presence robot with Chromium 51, WebRTC, Raspberry Pi 3 and EasyRTC. It’s actually very easy, especially now that Chromium 51 comes with Raspian Jessie, although it’s taken me a long time to find the exact incantation.

If you’re going to use it for real, I’d suggest using the

Posted at 21:53

October 21

Leigh Dodds: “Open”

For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.

XYZ is “open” because:

  • It’s on the web
  • It’s free to use
  • It’s published under an open licence
  • It’s published under a custom licence, which limits some types of use (usually commercial, often everything except personal)
  • It’s published under an open licence, but we’ve not checked too deeply in whether we can do that
  • It’s free to use, so long as you do so within our app or application
  • There’s a restricted/limited access free version
  • There’s documentation on how it works
  • It was (or is) being made in public, with equal participation by anyone
  • It was (or is) being made in public, lead by a consortium or group that has limitation on membership (even if just fees)
  • It was (or is) being made privately, but the results are then being made available publicly for you to use

I gather that at

Posted at 14:51

Leigh Dodds: Current gaps in the open data standards framework

In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)

To define the scope of those standards, lets try and answer two questions.

Question 1: What are the various activities that we might want to carry out around an open dataset?

  • A. Discover the metadata and documentation about a dataset
  • B. Download or otherwise extract the contents of a dataset
  • C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
  • D. Monitor a dataset for updates
  • E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
  • F. Mirror a dataset to another location, e.g. exporting its metadata and contents
  • G. Link or reconcile some data against a dataset or register

Question 2: What are the various activities that we might want to carry out around an open data catalogue?

  • V. Find whether a dataset exists, e.g. via a search or similar interface
  • X. List the contents of the platform, e.g. its datasets or other published assets
  • Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
  • Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains

Now, based on that quick review: which of these areas of functionality are covered by existing standards?

Posted at 14:22

October 17

AKSW Group - University of Leipzig: AKSW Colloquium, 17.10.2016, Version Control for RDF Triple Stores + NEED4Tweet

In the upcoming Colloquium, October the 17th at 3 PM, two papers will be presented:

Version Control for RDF Triple Stores

Marvin Frommhold will discuss the paper “Version Control for RDF Triple Stores” by Steve Cassidy and James Ballantine which forms the foundation of his own work regarding versioning for RDF.

Abstract:  RDF, the core data format for the Semantic Web, is increasingly being deployed both from automated sources and via human authoring either directly or through tools that generate RDF output. As individuals build up large amounts of RDF data and as groups begin to collaborate on authoring knowledge stores in RDF, the need for some kind of version management becomes apparent. While there are many version control systems available for program source code and even for XML data, the use of version control for RDF data is not a widely explored area. This paper examines an existing version control system for program source code, Darcs, which is grounded in a semi-formal theory of patches, and proposes an adaptation to directly manage versions of an RDF triple store.

NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and

Afterwards, Diego Esteves will present the paper “NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and
Disambiguation” by Mena B. Habib and Maurice van Keulen which was accepted at ACL 2015.

Abstract: In this demo paper, we present NEED4Tweet, a Twitterbot for named entity extraction (NEE) and disambiguation (NED) for Tweets. The straightforward application of state-of-the-art extraction and disambiguation approaches on informal text widely used in Tweets, typically results in significantly degraded performance due to the lack of formal structure; the lack of sufficient context required;
and the seldom entities involved. In this paper, we introduce a novel framework
that copes with the introduced challenges. We rely on contextual and semantic features more than syntactic features which are less informative. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 07:55

October 14

AKSW Group - University of Leipzig: LIMES 1.0.0 Released

Dear all,

the LIMES Dev team is happy to announce LIMES 1.0.0.

LIMES, the Link Discovery Framework for Metric Spaces, is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. Our approaches facilitate different approximation techniques to compute estimates of the similarity between instances. These estimates are then used to filter out a large amount of those instance pairs that do not suffice the mapping conditions. By these means, LIMES can reduce the number of comparisons needed during the mapping process by several orders of magnitude. The approaches implemented in LIMES include the original LIMES  algorithm for edit distances, HR3, HYPPO and ORCHID.

Additionally, LIMES supports the first planning technique for link discovery HELIOS, that minimizes the overall execution of a link specification, without any loss of completeness. Moreover, LIMES implements supervised and unsupervised machine-learning algorithms for finding accurate link specifications. The algorithms implemented here include the supervised, active and unsupervised versions of EAGLE and WOMBAT.





User manual:

Developer manual:


What is new in LIMES 1.0.0:

  • New Controller that supports manual and graphical configuration
  • New machine learning pipeline: supports supervised, unsupervised and active learning algorithms
  • New dynamic planning for efficient link discovery
  • Updated execution engine to handle dynamic planning
  • Added support for qualitative (Precision, Recall, F-measure etc.) and quantitative (runtime duration etc.) evaluation metrics for mapping evaluation, in the presence of a gold standard
  • Added support for configuration files in XML and RDF formats
  • Added support for pointsets metrics such as Mean, Hausdorff and Surjection
  • Added support for MongeElkan, RatcliffObershelp string measures
  • Added support for Allen’s algebra temporal relations for event data
  • Added support for all topological relations derived from the DE-9IM model
  • Migrated the codebase to Java 8 and Jena 3.0.1

We would like to thank everyone who helped to create this release. We also acknowledge the support of the SAKE  and HOBBIT projects.

Kind regards,

The LIMES Dev team


Posted at 09:38

October 11

AKSW Group - University of Leipzig: DL-Learner 1.3 (Supervised Structured Machine Learning Framework) Released

Dear all,

the Smart Data Analytics group at AKSW is happy to announce DL-Learner 1.3.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

GitHub page:

DL-Learner is used for data analysis tasks within other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern-based and evolutionary techniques for learning on structured data. For a practical example, see It also offers a plugin for Protégé, which can give suggestions for axioms to add.

In the current release, we added a large number of new algorithms and features. For instance, DL-Learner supports terminological decision tree learning, it integrates the LEAP and EDGE systems as well as the BUNDLE probabilistic OWL reasoner. We migrated the system to Java 8, Jena 3, OWL API 4.2 and Spring 4.3. We want to point to some related efforts here:

We want to thank everyone who helped to create this release, in particular we want to thank Giuseppe Cota who visited the core developer team and significantly improved DL-Learner. We also acknowledge support by the recently SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as Big Data Europe and HOBBIT projects.

Kind regards,

Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin


Posted at 19:41

October 07

Frederick Giasson: Mapping Datasets, Schema and Ontologies Using the Cognonto Mapper

There are many situations were we want to link named entities from two different datasets or to find duplicate entities to remove in a single dataset. The same is true for vocabulary terms or ontology classes that we want to integrate and map together. Sometimes we want to use such a linkage system to help save time when creating gold standards for named entity recognition tasks.

There exist multiple data linkage & deduplication frameworks developed in several different programming languages. At Cognonto, we have our own system called the Cognonto Mapper.

Most mapping frameworks work more or less the same way. They use one or two datasets as sources of entities (or classes or vocabulary terms) to compare. The datasets can be managed by a conventional relational database management system, a triple store, a spreadsheet, etc. Then they have complex configuration options that let the user define all kinds of comparators that will try to match the values of different properties that describe the entities in each dataset. (Comparator types may be simple string comparisons, the added use of alternative labels or definitions, attribute values, or various structural relationships and linkages within the dataset.) Then the comparison is made for all the entities (or classes or vocabulary terms) existing in each dataset. Finally, an entity similarity score is calculated, with some threshold conditions used to signal whether the two entities (or classes or vocabulary terms) are the same or not.

The Cognonto Mapper works in this same general way. However, as you may suspect, it has a special trick in its toolbox: the SuperType Comparator. The SuperType Comparator leverages the KBpedia Knowledge Ontology to help disambiguate two given entities (or classes or vocabulary terms) based on their type and the analysis of their types in the KBPedia Knowledge Ontology. When we perform a deduplication or a linkage task between two large datasets of entities, it is often the case that two entities will be considered a nearly perfect match based on common properties like names, alternative names and other common properties even if they are two completely different things. This happens because entities are often ambiguous when only considering these basic properties. The SuperType Comparator’s role is to disambiguate the entities based on their type(s) by leveraging the disjointedness of the SuperType structure that governs the overall KBpedia structure. The SuperType Comparator greatly reduces the time needed to curate the deduplication or linkage tasks in order to determine the final mappings.

We first present a series of use cases for the Mapper below, followed by an explanation of how the Cognonto Mapper works, and then some conclusions.

Usages Of The Cognonto Mapper

When should the Cognonto Mapper, or other deduplication and mapping services, be used? While there are many tasks that warrant the usage of such a system, let’s focus for now on some use cases related to Cognonto and machine learning in general.

Mapping Across Schema

One of Cognonto’s most important use cases is to use the Mapper to link new vocabularies, schemas or ontologies to the KBpedia Knowledge Ontology (KKO). This is exactly what we did for the 24 external ontologies and schemas that we have integrated into KBpedia. Creating such a mapping can be a long and painstaking process. The Mapper greatly helps linking similar concepts together by narrowing the candidate pool of initial set of mappings, thereby increasing the efficiency of the analyst charged with selecting the final mappings between the two ontologies.

Creating ‘Gold Standards’

In my last article, I created a gold standard of 511 random web pages where I determined the publisher of the web page by hand. That gold standard was used to measure the performance of a named entities recognition task. However, to create the actual gold standard, I had to check in each dataset (5 of them with million of entities) if that publisher was existing in any of them. Performing such a task by hand means that I would have to send at least 2555 search queries to try find a matching entity. Let’s say that I am fast, and that I can write the query, send it, look at the results, and copy/paste the URI of the good entity in the gold standard within 30 seconds, it still means that I would complete such a task in roughly 21 hours. It is also clearly impossible to do that 8 hours per day for ~3 days for a sane person, so this task would probably take at least 1 week to complete.

This is why automating this mapping process is really important and this is what the Cognonto Mapper does. The only thing that is needed is to configure 5 mapper sessions. Each session tries to map the entities I identified by hand from the 511 web pages to each of the other datasets. Then I only need run the mapper for each dataset, review the matches, find the missing ones by hand and then merge the results into the final gold standard.

Curating Unknown Entities

In Cognonto, we have an unknown entities tagger that is used to detect possible publisher organizations that are not currently existing in the KBpedia knowledge base. In some cases, what we want to do is to save these detected unknown entities in an unknown entities dataset. Then this dataset will be used to review detected entities to include them back into the KBpedia knowledge base (such that they become new). In the review workflow, one of the steps should be to try to find similar entities to make sure that what was detected by the entities tagger was a totally new entity, and not a new surface form for that entity (which would become an alternative label for the entity and not an entirely new one). Such a checkup in the review workflow would be performed by the Cognonto Mapper.

How Does the SuperType Comparator Work?

As I mentioned in the introduction, the Cognonto Mapper is yet another linkage & deduplication framework. However, it has a special twist: its SuperType Comparator and the leveraging of the KBpedia Knowledge Ontology. Good, but how does it work? There is no better way to understand how it works than studying how two entities can be disambiguated based on their type. So, let’s do this.

Let’s consider this use case. We want to map two datasets together: Wikipedia and Musicbrainz. One of the Musicbrainz entity we want to map to Wikipedia is a music group called Attila with Billy Joel and Jon Small. Attila also exists in Wikipedia, but it is highly ambiguous and may refer to multiple different things. If we setup our linkage task to only work on the preferred and possible alternative labels, they we would have a match between the name of that album and multiple other things in Wikipedia with a matching likelihood that is probably nearly identical. However, how could we update the configuration to try to solve this issue? We have no choice, we will have to use the Cognonto Mapper SuperType Comparator.

Musicbrainz RDF dumps normally map a Musicbrainz group to a mo:MusicGroup. In the Wikipedia RDF dump the Attila rock band has a type dbo:Band. Both of these classes are linked to the KBpedia reference concept kbpedia:Band-MusicGroup. This means that the entities of both of these datasets are well connected into KBpedia.

Let’s say that the Cognonto Mapper does detect that the Attila entity in the Musicbrainz dataset has 4 candidates in Wikipedia:

  1. Attila, the rock band
  2. Attila, the bird
  3. Attila, the film
  4. Attila, the album

If the comparison is only based on the preferred label, the likelihood will be the same for all these entities. However, what happens when we start using the SuperType Comparator and the KBpedia Knowledge Ontology?

First we have to understand the context of each type. Using KBpedia, we can determine that rock bands, birds, albums and films are disjoint according to their super types: kko:Animals, kko:Organizations kko:AudioInfo and kko:VisualInfo.

Now that we understand each of the entities the system is trying to link together, and their context within the KBpedia Knowledge Ontology, let’s see how the Cognonto Mapper will score each of these entities based on their type to help disambiguate where labels are identical.

 "mo:MusicGroup -> dbo:Band"
 (.compare stc-ex-compare "" ""))
 "mo:MusicGroup -> dbo:Bird"
 (.compare stc-ex-compare "" ""))
 "mo:MusicGroup -> dbo:Film"
 (.compare stc-ex-compare "" ""))
 "mo:MusicGroup -> dbo:Album"
 (.compare stc-ex-compare "" ""))
Classes Similarity
mo:MusicGroup -> dbo:Band 1.0
mo:MusicGroup -> dbo:Bird 0.2
mo:MusicGroup -> dbo:Film 0.2
mo:MusicGroup -> dbo:Album 0.2

In these cases, the SuperType Comparator did assign a similarity of 1.0 to the mo:MusicGroup and the dbo:Band entities since those two classes are equivalent. All the other checks returns 0.20. When the comparator finds two entities that have disjoint SuperTypes, then it assigns the similarity value 0.20 to them. Why not 0.00 if they are disjoint? Well, there may be errors in the knowledge base, so that setting the comparator score to a very low level, it is still available for evaluation, even though its score is much reduced.

In this case the matching is unambiguous and the selection of the right linkage to perform is obvious. However you will see below that it is not (and often not) that simple to make such a clear selection.

Now let’s say that the next entity to match from the Musicbrainz dataset is another entity called Attila, but this time it refers to Attila, the album by Mina. Since the basis of the comparison (comparing the Musicbrainz Attila album instead of the band), the entire process will yield different results. The main difference is that the album will be compared to a film and an album from the Wikipedia dataset. As you can notice in the graph below, these two entities belong to the super types kko:AudioInfo and kko:VisualInfo which are not disjoint.

 "mo:MusicalWork -> dbo:Band"
 (.compare stc-ex-compare "" ""))
 "mo:MusicalWork -> dbo:Bird"
 (.compare stc-ex-compare "" ""))
 "mo:MusicalWork -> dbo:Film"
 (.compare stc-ex-compare "" ""))
 "mo:MusicalWork -> dbo:Album"
 (.compare stc-ex-compare "" ""))
Classes Similarity
mo:MusicalWork -> dbo:Band 0.2
mo:MusicalWork -> dbo:Bird 0.2
mo:MusicalWork -> dbo:Film 0.8762886597938144
mo:MusicalWork -> dbo:Album 0.9555555555555556

As you can see, the main difference is that we don’t have a perfect match between the entities. We thus need to compare between their types, and two of the entities are ambiguous based on their SuperType (their super types are non-disjoint). In this case, what the SuperType Comparator does is to check the set of super classes of both entities, and calculate a similarity measure between the two sets of classes and compute a similarity measure. It is why we have 0.8762 for one and 0.9555 for another.

A musical work and an album are two nearly identical concepts. In fact, a musical work is a conceptual work of an album (a record). A musical work is also strongly related to films since films includes musical works, etc. However, the relationship between a musical work and an album is stronger than with a film and this is what the similarity measure shows.

In this case, even if we have two ambiguous entities of an album and a film for which we don’t have disjoint super types, we are still able to determine which one to choose to create the mappiing based on the calculation of the similarity measure.


As we saw, there are multiple reasons why we would want to leverage the KBpedia Knowledge Ontology to help mapping and deduplication frameworks such as the Cognonto Mapper to disambiguate possible entity matches. KBpedia is not only good for mapping datasets together, it is also quite effective to help with some machine learning tasks such as creating gold standards or curating detected unknown entities. In the context of Cognonto, it is quite effective to map external ontologies, schemas or vocabularies to the KBpedia Knowledge Ontology. It is an essential tool for extending KBpedia to domain- and enterprise-specific needs.

In this article I focused on the SuperType Comparator that is leveraging the type structure of the KBpedia Knowledge Ontology. However, we can also use other structural features in KBpedia (such as an Aspects Comparator based on the aspects structure of KBpedia), singly or in combination, to achieve other mapping or disambiguation objectives.

Posted at 12:20

October 05

AKSW Group - University of Leipzig: OntoWiki 1.0.0 released

Dear Semantic Web and Linked Data Community,
we are proud to finally announce the releases of OntoWiki 1.0.0 and the underlying Erfurt Framework in version 1.8.0.
After 10 years of development we’ve decided to release the teenager OntoWiki from the cozy home of 0.x versions.
Since the last release of 0.9.11 in January 2014 we did a lot of testing to stabilize OntoWikis behavior and accordingly made a lot of bug fixes, also we are now using PHP Composer for dependency management, improved the testing work flow, gave a new structure and home to the documentation and we have created a neat project landing page.

The development of OntoWiki is completely open source and we are happy for any contribution, especially to the code and the documentation, which is also kept in a Git repository with easy to edit Markdown pages. If you have questions about the usage of OntoWiki besides the documentation you can also use or mailinglist or the stackoverflow tag “ontowiki”.

Please see for further information.

We also had a Poster for advertising the OntoWiki release at SEMANTiCS Conference:

OntoWiki 1.0

Philipp Frischmuth, Natanael Arndt, Michael Martin: OntoWiki 1.0: 10 Years of Development – What’s New in OntoWiki

We are happy for your feedback, in the name of the OntoWiki team,
Philipp, Michael and Natanael

Our Fingers on the Mouse

Posted at 14:50

October 04

Frederick Giasson: Improving Machine Learning Tasks By Integrating Private Datasets

In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like Wikipedia, Wikidata, Open Corporates and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own curated data.

In this article, I want to demonstrate how high-value private datasets may be integrated into the Cognonto’s KBpedia knowledge base to produce a significant impact on the quality of the results of some machine learning tasks. To demonstrate this impact, I have created a demo that is supported by a “gold standard” of 511 web pages taken at random, to which we have tagged the organization that published the web page. This demo is related to the publisher analysis portion of the Cognonto demo. We will use this gold standard to calculate the performance metrics of the publisher analyzer but more precisely, we will analyze the performance of the analyzer depending on the datasets it has access to perform its predictions.

Cognonto Publisher’s Analyzer

The Cognonto publisher’s analyzer is a portion of the overall Cognonto demo that tries to determine the publisher of a web page from analyzing the web page’s content. There are multiple moving parts to this analyzer, but its general internal workflow works as follows:

  1. It crawls a given webpage URL
  2. It extracts the page’s content and extracts its meta-data
  3. It tags all of the organizations (anything that is considered an organization in KBpedia) across the extracted content using the organization entities that exist in the knowledge base
  4. It tries to detect unknown entities that will eventually be added to the knowledge base after curation
  5. It performs an in-depth analysis of the organization entities (known or unknown) that got tagged in the content of the web page, and analyzes which of these is the most likely to be the publisher of the web page.

Such a machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the Cognonto analyzer is its knowledge base. We leverage Cognonto to detect known organization entities. We use the knowledge in the KB for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of datasets that compose the Cognonto knowledge base, and the KBpedia conceptual reference structure composed of roughly ~39,000 reference concepts.

Given the central role of the knowledge base in such an analysis process, we want to have a better idea of the impact of the datasets in the performance of such a system.

For this demo, I use three public datasets already in KBpedia and that are used by the Cognonto demo: Wikipedia (via DBpedia), Freebase and USPTO. Then I add two private datasets of high quality, highly curated and domain-related information augment the listing of potential organizations. What I will do is to run the Cognonto publisher analyzer on each of these 511 web pages. Then I will check which one got properly identified given the gold standard and finally I will calculate different performance metrics to see the impact of including or excluding a certain dataset.

Gold Standard

The gold standard is composed of 511 randomly selected web pages that got crawled and cached. When we run the tests below, the cached version of the HTML pages is used to make sure that we get the same HTML for each page for each test. When the pages are crawled, we execute any possible JavaScript code that the pages may contain before caching the HTML code of the page. That way, if some information in the page was injected by some JavaScript code, then that additional information will be cached as well.

The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used the determine if the publisher analyzer properly detects the actual publisher of the web page.

We also add a set of 10 web pages manually for which we are sure that no publisher can be determined for the web page. These are the 10 True Negative (see below) instances of the gold standard.

The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system.


The goal of this analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understanding the performance of the system. The metrics calculation is based on the confusion matrix.

The True Positive, False Positive, True Negative and False Negative (see Type I and type II errors for definitions) should be interpreted that way in the context of a named entities recognition task:

  1. True Positive (TP): test identifies the same entity as in the gold standard
  2. False Positive (FP): test identifies a different entity than what is in the gold standard
  3. True Negative (TN): test identifies no entity; gold standard has no entity
  4. False Negative (FN): test identifies no entity, but gold standard has one

The we have a series of metrics that can be used to measure the performance of of the system:

  1. Precision: is the proportion of properly predicted publishers amongst all of the publishers that exists in the gold standard (TP / TP + FP)
  2. Recall: is the proportion of properly predicted publishers amongst all the predictions that have been made (good and bad) (TP / TP + FN)
  3. Accuracy: it is the proportion of correctly classified test instances; the publishers that could be identified by the system, and the ones that couldn’t (the web pages for which no publisher could be identified). ((TP + TN) / (TP + TN + FP + FN))
  4. f1: the test’s equally weighted combination of precision and recall
  5. f2: the test’s weighted combination of precision and recall, with a preference for recall
  6. f0.5: the test’s weighted combination of precision and recall, with a preference for precision.

The F-score test the accuracy of the general prediction system. The F-score is a measure that combines precision and recall is the harmonic mean of precision and recall. The f2 measure weighs recall higher than precision (by placing more emphasis on false negatives), and the f0.5 measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall.

In general, I think that the metric that gives the best overall performance of this named entities recognition system is the accuracy one. I emphasize those test results below.

Running The Tests

The goal with these tests is to run the gold standard calculation procedure with different datasets that exist in the Cognonto knowledge base to see the impact of including/excluding these datasets on the gold standard metrics.

Baseline: No Dataset

The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics.

(table (generate-stats :js :execute :datasets []))
True positives:  2
False positives:  5
True negatives:  19
False negatives:  485

| key          | value        |
| :precision   | 0.2857143    |
| :recall      | 0.0041067763 |
| :accuracy    | 0.04109589   |
| :f1          | 0.008097166  |
| :f2          | 0.0051150895 |
| :f0.5        | 0.019417476  |

One Dataset Only

Now, let’s see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task.

Wikipedia (via DBpedia) Only

Let’s test the impact of adding a single general purpose dataset, the publicly available: Wikipedia (via DBpedia):

(table (generate-stats :js :execute :datasets [""]))
True positives:  121
False positives:  57
True negatives:  19
False negatives:  314

| key          | value      |
| :precision   | 0.6797753  |
| :recall      | 0.27816093 |
| :accuracy    | 0.2739726  |
| :f1          | 0.39477977 |
| :f2          | 0.31543276 |
| :f0.5        | 0.52746296 |

Freebase Only

Now, let’s test the impact of adding another single general purpose dataset, this one the publicly available: Freebase:

(table (generate-stats :js :execute :datasets [""]))
True positives:  11
False positives:  14
True negatives:  19
False negatives:  467

| key          | value       |
| :precision   | 0.44        |
| :recall      | 0.023012552 |
| :accuracy    | 0.058708414 |
| :f1          | 0.043737575 |
| :f2          | 0.028394425 |
| :f0.5        | 0.09515571  |


Now, let’s test the impact of adding still a different publicly available specialized dataset: USPTO:

(table (generate-stats :js :execute :datasets [""]))
True positives:  6
False positives:  13
True negatives:  19
False negatives:  473

| key          | value       |
| :precision   | 0.31578946  |
| :recall      | 0.012526096 |
| :accuracy    | 0.04892368  |
| :f1          | 0.024096385 |
| :f2          | 0.015503876 |
| :f0.5        | 0.054054055 |

Private Dataset #1

Now, let’s test the first private dataset:

(table (generate-stats :js :execute :datasets [""]))
True positives:  231
False positives:  109
True negatives:  19
False negatives:  152

| key          | value      |
| :precision   | 0.67941177 |
| :recall      | 0.60313314 |
| :accuracy    | 0.4892368  |
| :f1          | 0.6390042  |
| :f2          | 0.61698717 |
| :f0.5        | 0.6626506  |

Private Dataset #2

And, then, the second private dataset:

(table (generate-stats :js :execute :datasets [""]))
True positives:  24
False positives:  21
True negatives:  19
False negatives:  447

| key          | value       |
| :precision   | 0.53333336  |
| :recall      | 0.050955415 |
| :accuracy    | 0.08414873  |
| :f1          | 0.093023255 |
| :f2          | 0.0622084   |
| :f0.5        | 0.1843318   |

Combined Datasets – Public Only

A more realistic analysis is to use a combination of datasets. Let’s see what happens to the performance metrics if we start combining public datasets.

Wikipedia + Freebase

First, let’s start by combining Wikipedia and Freebase.

(table (generate-stats :js :execute :datasets [""
True positives:  126
False positives:  60
True negatives:  19
False negatives:  306

| key          | value      |
| :precision   | 0.67741936 |
| :recall      | 0.29166666 |
| :accuracy    | 0.28375733 |
| :f1          | 0.407767   |
| :f2          | 0.3291536  |
| :f0.5        | 0.53571427 |

Adding the Freebase dataset to the DBpedia one had the following effects on the different metrics:

metric Impact in %
precision -0.03%
recall +4.85%
accuracy +3.57%
f1 +3.29%
f2 +4.34%
f0.5 +1.57%

As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset.

Wikipedia + USPTO

Let’s switch Freebase for the other specialized public dataset, USPTO.

(table (generate-stats :js :execute :datasets [""
True positives:  122
False positives:  59
True negatives:  19
False negatives:  311

| key          | value      |
| :precision   | 0.67403316 |
| :recall      | 0.2817552  |
| :accuracy    | 0.27592954 |
| :f1          | 0.39739415 |
| :f2          | 0.31887087 |
| :f0.5        | 0.52722555 |

Adding the USPTO dataset to the DBpedia instead of Freebase had the following effects on the different metrics:

metric Impact in %
precision -0.83%
recall +1.29%
accuracy +0.73%
f1 +0.65%
f2 +1.07%
f0.5 +0.03%

As we may have expected the gains are smaller than Freebase. Maybe partly because it is smaller and more specialized. Because it is more specialized (enterprises that have patents registered in US), maybe the gold standard doesn’t represent well the organizations belonging to this dataset. But in any case, these are still gains.

Wikipedia + Freebase + USPTO

Let’s continue and now include all three datasets.

(table (generate-stats :js :execute :datasets [""
True positives:  127
False positives:  62
True negatives:  19
False negatives:  303

| key          | value      |
| :precision   | 0.6719577  |
| :recall      | 0.29534882 |
| :accuracy    | 0.2857143  |
| :f1          | 0.41033927 |
| :f2          | 0.3326349  |
| :f0.5        | 0.53541315 |

Now let’s see the impact of adding both Freebase and USPTO to the Wikipedia dataset:

metric Impact in %
precision +1.14%
recall +6.18%
accuracy +4.30%
f1 +3.95%
f2 +5.45%
f0.5 +1.51%

Now let’s see the impact of using highly curated, domain related, private datasets.

Combined Datasets – Public enhanced with private datasets

The next step is to add the private datasets of highly curated data that are specific to the domain of identifying web page publisher organizations. As the baseline, we will use the three public datasets: Wikipedia, Freebase and USPTO and then we will add the private datasets.

Wikipedia + Freebase + USPTO + PD #1

(table (generate-stats :js :execute :datasets [""
True positives:  279
False positives:  102
True negatives:  19
False negatives:  111

| key          | value      |
| :precision   | 0.7322835  |
| :recall      | 0.7153846  |
| :accuracy    | 0.58317024 |
| :f1          | 0.7237354  |
| :f2          | 0.7187017  |
| :f0.5        | 0.7288401  |

Now, let’s see the impact of adding the private dataset #1 along with Wikipedia, Freebase and USPTO:

metric Impact in %
precision +8.97%
recall +142.22%
accuracy +104.09%
f1 +76.38%
f2 +116.08%
f0.5 +36.12%

Adding the highly curated and domain specific private dataset #1 had a dramatic impact on all the metrics of the combined public datasets. Now let’s see what is the impact of the public datasets on the private dataset #1 metrics when it is used alone:

metric Impact in %
precision +7.77%
recall +18.60%
accuracy +19.19%
f1 +13.25%
f2 +16.50%
f0.5 +9.99%

As we can see, the public datasets does significantly increase the performance of the highly curated and domain specific private dataset #1.

Wikipedia + Freebase + USPTO + PD #2

(table (generate-stats :js :execute :datasets [""
True positives:  138
False positives:  69
True negatives:  19
False negatives:  285

| key          | value      |
| :precision   | 0.6666667  |
| :recall      | 0.32624114 |
| :accuracy    | 0.3072407  |
| :f1          | 0.43809524 |
| :f2          | 0.36334914 |
| :f0.5        | 0.55155873 |

Not all of the private datasets have equivalent impact. Let’s see the impact of adding the private dataset #2 instead of the #1:

metric Impact in %
precision -0.78%
recall +10.46%
accuracy +7.52%
f1 +6.75%
f2 +9.23%
f0.5 +3.00%

Wikipedia + Freebase + USPTO + PD #1 + PD #2

Now let’s see what happens when we use all the public and private datasets.

(table (generate-stats :js :execute :datasets [""
True positives:  285
False positives:  102
True negatives:  19
False negatives:  105

| key          | value      |
| :precision   | 0.7364341  |
| :recall      | 0.7307692  |
| :accuracy    | 0.59491193 |
| :f1          | 0.7335907  |
| :f2          | 0.7318952  |
| :f0.5        | 0.7352941  |

Let’s see the impact of adding the private datasets #1 and #2 to the public datasets:

metric Impact in %
precision +9.60%
recall +147.44%
accuracy +108.22%
f1 +78.77%
f2 +120.02%
f0.5 +37.31%

Adding Unknown Entities Tagger

There is one last feature with the Cognonto publisher analyzer: it is possible for it to identify unknown entities from the web page. (An “unknown entity” is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page.

(table (generate-stats :js :execute :datasets :all))
True positives:  345
False positives:  104
True negatives:  19
False negatives:  43

| key          | value      |
| :precision   | 0.76837415 |
| :recall      | 0.88917524 |
| :accuracy    | 0.7123288  |
| :f1          | 0.82437277 |
| :f2          | 0.86206895 |
| :f0.5        | 0.78983516 |

As we can see, the overall accuracy improved by 19.73% when considering the unknown entities compared to the public and private datasets.

metric Impact in %
precision +4.33%
recall +21.67%
accuracy +19.73%
f1 +12.37%
f2 +17.79%
f0.5 +7.42%


When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that worse dataset may in fact cover one prediction area not covered in a better one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones.

Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public dataset by significantly improving the accuracy of the system. We got a gain 19.19% in accuracy by adding the public datasets to the better scoring private dataset #1. Nearly 20% of improvement in such a predictive system is highly significant.

Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn’t appear to lower the overall performance of the system (even if I am sure that some could), but generally the more the better (but more doesn’t necessarely produce significant accuracy increases).

Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improved the overall accuracy by another 19.73%. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the long tail of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such “unknown” entities tells us where to focus efforts to add to the known database of existing publishers.


As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significan impact than others, but overall, each dataset contribute to the overall improvement of the predictions.

Posted at 15:00

Semantic Web Company (Austria):

Posted at 07:16

October 02

W3C Read Write Web Community Group: Read Write Web — Q3 Summary — 2016


The community group celebrates its 5th birthday this quarter.  With almost 3000 posts (roughly 2 per day) from around 100 members a large number of topics have been raised, discussed and resolved.  A bit thank you to to everyone that has been involved!

On the subject of statistics, there was a great paper produced by AKSW: LODStats: The Data Web Census Dataset which provides a comprehensive picture of the current state of a significant part of the Data Web.  There was also a status update from the LDP Next Community Group and Data on the Web Best Practices is now a Candidate Recommendation.

TPAC 2016 got under way in Lisbon.  While there was not a dedicated RWW session this year, many members of the group attended various related topics.  There was some interest reported around the work on Verified Claims, which hopes to form a working group quite soon.

Communications and Outreach

Apart from TPAC, I was able to attend the 3rd annual Hackers Congress at Paralelni Polis, which aims to spread ideas of decentralization in technology.  I was able to interact with some thought leaders in the crypto currency space and try to explain the decentralized nature of the web and how it can grow organically using standards to read and write.  I also got a chance to talk to people form the remote storage project.


Community Group

Having reached the 5 year milestone, it is perhaps a good time to reflect on the direction of the community group.  Do we want to keep going as we are, focus on specific topics, me more discussion oriented or more standards creation oriented?  I’ll send out a questionnaire on this.

A thread on ways to (re) decentralize the web generated some discussion.  There was also some discussion around the Internet of Things and a possible new framework for using Linked Data to read and write.



More work has been done in modularizing the solid linked data browser / editor into separate modular chunks (solid-ui, solid-app-set), that can be used to create apps on data, using a javascript shim.  An analogy I like to think of is RSS being a structured data format but with some code it can become a useful application.  Solid app set allows this to happen for any class of data.  I am really enjoying this paradigm and have started to translate the apps I write.  Here is an example of a playlist pane translation of a clients side app.  Tim has written a lot more of these in the same repo.

Node solid server has progressed, with the permissions system being broken down into its own module, solid permissions.  Also improvements have been made to the profile ui and the dashboard, which are still works in progress.

Much progress has been made on the document editor,, which is now also driving the Linked Data Notifications spec towards Candidate Recommendation, and integrating with the Web annotations specs.  An slightly older screencast of functionality is available here, but I am told new ones will also be published very soon.


Last but not Least…

We welcome the launch of Cognoto.

Cognonto (a portmanteau of ‘cognition’ and ‘ontology’) exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact and entity extraction and tagging.

Check a sample term, or read more from this comprehensive blog post.

Posted at 16:04

September 28

Frederick Giasson: Using Cognonto to Generate Domain Specific word2vec Models

word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.

In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).

It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a word’s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.

Training Corpus

A training corpus is really just a set of text used to train unsupervised machine learning algorithms. Any kind of text can be used by word2vec. The only thing it does is to learn the relationships between the words that exist in the text. However, not all training corpuses are equal. Training corpuses are often dirty, biaised and ambiguous. Depending on the task at hand, it may be exactly what is required, but more often than not, their errors need to be fixed. Cognonto has the advantage of starting with clean text.

When we want to create a new training corpus, the first step is to find a source of text that could work to create that corpus. The second step is to select the text we want to add to it. The third step is to pre-process that corpus of text to perform different operations on the text, such as: removing HTML elements; removing punctuation; normalizing text; detecting named entities; etc. The final step is to train word2vec to generate the model.

word2vec is somewhat dumb. It only learns what exists in the training corpus. It does not do anything other than “reading” the text and analyzing the relationships between the words (which are really just group of characters separated by spaces). The word2vec process is highly subject to the Garbage In, Garbage Out principle, which means that if the training set is dirty, biaised and ambiguous, then the learned relationship will end-up being of little or no value.

Domain-specific Training Corpus

A domain-specific training corpus is a specialized training corpus where its text is related to a specific domain. Examples of domains are music, mathematics, cars, healthcare, etc. In contrast, a general training corpus is a corpus of text that may contain text that discusses totally different domains. By creating a corpus of text that covers a specific domain of interest, we limit the usage of words (that is, their co-occurrences) to texts that are meaningful to that domain.

As we will see in this article, a domain-specific training corpus can be quite useful, and much more powerful, than general ones, if the task at hand is in relation to a specific domain of expertise. The major problem with domain-specific training corpuses is that they are really costly to create. We not only have to find the source of data to use, but we also have to select each document that we want to include in the training corpus. This can work if we want a corpus with 100 or 200 documents, but what if you want a training corpus of 100,000 or 200,000 documents? Then it becomes a problem.

It is the kind of problem that Cognonto helps to resolve. Cognonto and KBpedia, its knowledge base, is a set of ~39,000 reference concepts that have ~138,000 links to schema of external data sources such as Wikipedia, Wikidata and USPTO. It is that structure and these links to external data sources that we use to create domain-specific training corpuses on the fly. We leverage the reference concept structure to select all of the concepts that should be part of the domain that is being defined. Then we use Cognonto’s inference capabilities to infer all the other hundred or thousands of concepts that define the full scope of the domain. Then we analyze the hundreds or thousands of concepts we selected that way to get all of the links to external data sources. Finally we use these references to create the training corpus. All of this is done automatically once the initial few concepts that define my domain got selected. The workflow looks like:

The Process

To show you how this process works, I will create a domain-specific training set about musicians using Cognonto. Then I will use the Google News word2vec model created by Google and that has about 100 billion words. The Google model contains 300-dimensional vectors for 3 million words and phrases. I will use the Google News model as the general model to compare the results/performance between a domain specific and a general model.

Determining the Domain

The first step is to define the scope of the domain we want to create. For this article, I want a domain that is somewhat constrained to create a training corpus that is not too large for demo purposes. The domain I have chosen is musicians. This domain is related to people and bands that play music. It is also related to musical genres, instruments, music industry, etc.

To create my domain, I select a single KBpedia reference concept: Musician. If I wanted to broaden the scope of the domain, I could have included other concepts such as: Music, Musical Group, Musical Instrument, etc.

Aggregating the Domain-specific Training Corpus

Once we have determined the scope of the domain, the next step is to query the KBpedia knowledge base to aggregate all of the text that will belong to that training corpus. The end result of this operation is to create a training corpus with text that is only related to the scope of the domain we defined.

(defn create-domain-specific-training-set
  [target-kbpedia-class corpus-file]
  (let [step 1000
        entities-dataset ""
        kbpedia-dataset ""
        nb-entities (get-nb-entities-for-class-ws target-kbpedia-class entities-dataset kbpedia-dataset)]
    (loop [nb 0
           nb-processed 1]
      (when (< nb nb-entities)
        (doseq [entity (get-entities-slice target-kbpedia-class entities-dataset kbpedia-dataset :limit step :offset @nb-processed)]          
          (spit corpus-file (str (get-entity-content entity) "\n") :append true)
          (println (str nb-processed "/" nb-entities)))
        (recur (+ nb step)
               (inc nb-processed))))))

(create-domain-specific-training-set "" "resources/musicians-corpus.txt")

What this code does is to query the KBpedia knowledge base to get all the named entities that are linked to it, for the scope of the domain we defined. Then the text related to each entity is appended to a text file where each line is the text of a single entity.

Given the scope of the current demo, the musicians training corpus is composed of 47,263 documents. This is the crux of the demo. With a simple function, we are able to aggregate 47,263 text documents highly related to a conceptual domain we defined on the fly. All of the hard work has been delegated to the knowledge base and its conceptual structure (in fact, this simple function leverages 8 years of hard work).

Normalizing Text

The next step is a natural step related to any NLP pipeline. Before learning from the training corpus, we should clean and normalize the text of its raw form.

(defn normalize-proper-name
  (-> name
      (string/replace #" " "_")      

(defn pre-process-line
  (-> (let [line (-> line
                     ;; 1. remove all underscores
                     (string/replace "_" " "))]
        ;; 2. detect named entities and change them with their underscore form, like: Fred Giasson -> fred_giasson
        (loop [entities (into [] (re-seq #"[\p{Lu}]([\p{Ll}]+|\.)(?:\s+[\p{Lu}]([\p{Ll}]+|\.))*(?:\s+[\p{Ll}][\p{Ll}\-]{1,3}){0,1}\s+[\p{Lu}]([\p{Ll}]+|\.)" line))
               line line]
          (if (empty? entities)
            (let [entity (first (first entities))]
              (recur (rest entities)                     
                     (string/replace line entity (normalize-proper-name entity)))))))
      (string/replace (re-pattern stop-list) " ")
      ;; 4. remove everything between brackets like: [1] [edit] [show]
      (string/replace #"\[.*\]" " ")
      ;; 5. punctuation characters except the dot and the single quote, replace by nothing: (),[]-={}/\~!?%$@&*+:;<>
      (string/replace #"[\^\(\)\,\[\]\=\{\}\/\\\~\!\?\%\$\@\&\*\+:\;\<\>\"\p{Pd}]" " ")
      ;; 6. remove all numbers
      (string/replace #"[0-9]" " ")
      ;; 7. remove all words with 2 characters or less
      (string/replace #"\b[\p{L}]{0,2}\b" " ")
      ;; 10. normalize spaces
      (string/replace #"\s{2,}" " ")
      ;; 11. normalize dots with spaces
      (string/replace #"\s\." ".")
      ;; 12. normalize dots
      (string/replace #"\.{1,}" ".")
      ;; 13. normalize underscores
      (string/replace #"\_{1,}" "_")
      ;; 14. remove standalone single quotes
      (string/replace " ' " " ")
      ;; 15. re-normalize spaces
      (string/replace #"\s{2,}" " ")        
      ;; 16. put everything lowercase

      (str "\n")))

(defn pre-process-corpus
  [in-file out-file]
  (spit out-file "" :append true)
  (with-open [file ( in-file)]
    (doseq [line (line-seq file)]
      (spit out-file (pre-process-line line) :append true))))

(pre-process-corpus "resources/musicians-corpus.txt" "resources/musicians-corpus.clean.txt")

We remove all of the characters that may cause issues to the tokenizer used by the word2vec implementation. We also remove unnecessary words and other words that appear too often or that add nothing to the model we want to generate (like the listing of days and months). We also drop all numbers.

Training word2vec

The last step is to train word2vec on our clean domain-specific training corpus to generate the model we will use. For this demo, I will use the DL4J (Deep Learning for Java) library that is a Java implementation of the word2vec algorithm. Training word2vec is as simple as using the DL4J API like this:

(defn train
  [training-set-file model-file]
  (let [sentence-iterator (new LineSentenceIterator ( training-set-file))
        tokenizer (new DefaultTokenizerFactory)
        vec (.. (new Word2Vec$Builder)
                (minWordFrequency 1)
                (windowSize 5)
                (layerSize 100)
                (iterate sentence-iterator)
                (tokenizerFactory tokenizer)
    (.fit vec)
    (SerializationUtils/saveObject vec (io/file model-file))

(def musicians-model (train "resources/musicians-corpus.clean.txt" "resources/musicians-corpus.model"))

What is important to notice here is the number of parameters that can be defined to train word2vec on a corpus. In fact, that algorithm can be sensitive to parametrization.

Importing the General Model

The goal of this demo is to demonstrate the difference between a domain-specific model and a general model. Remember that the general model we chose was the Google News model that is composed of billion of words, but which is highly general. DL4J can import that model without having to generate it ourselves (in fact, only the model is distributed by Google, not the training corpus):

(defn import-google-news-model
  (org.deeplearning4j.models.embeddings.loader.WordVectorSerializer/loadGoogleModel ( "GoogleNews-vectors-negative300.bin.gz") true))

(def google-news-model (import-google-news-model))

Playing With Models

Now that we have a domain-specific model related to musicians and a general model related to news processed by Google, let’s start playing with both to see how they perform on different tasks. In the following examples, we will always compare the domain-specific training corpus with the general one.

Ambiguous Words

A characteristic of words is that their surface form can be ambiguous; they can have multiple meanings. An ambiguous word can co-occur with multiple other words that may not have any shared meaning. But all of this depends on the context. If we are in a general context, then this situation will happen more often than we think and will impact the similarity score of these ambiguous terms. However, as we will see, this phenomenum is greatly diminished when we use domain-specific models.

Similarity Between Piano, Organ and Violin

What we want to check is the relationship between 3 different music instruments: piano, organ and violin. We want to check the relationship between each of them.

(.similarity musicians-model "piano" "violin")
(.similarity musicians-model "piano" "organ")

As we can see, both tuples have a high likelihood of co-occurrence. This suggests that these terms of each tuple are probably highly related. In this case, it is probably because violins are often played along with a piano. And, it is probably that an organ looks like a piano (at least it has a keyboard).

Now let’s take a look at what the general model has to say about that:

(.similarity google-news-model "piano" "violin")
(.similarity google-news-model "piano" "organ")

The surprising fact here is the apparent dissimilarity between piano and organ compared with the results we got with the musicians domain-specific model. If we think a bit about this use case, we will probably conclude that these results makes sense. In fact, organ is an ambiguous word in a general context. An organ can be a musical instrument, but it can also be a part of an anatomy. This means that the word organ will co-occur beside piano, but also all kind of other words related to human and animal biology. This is why they are less similar in the general model than in the domain one, because it is an ambiguous word in a general context.

Similarity Between Album and Track

Now let’s see another similarity example between two other words album and track where track is an ambiguous word depending on the context.

(.similarity musicians-model "album" "track")
(.similarity google-news-model "album" "track")

As expected, because track is ambiguous, there is a big difference in terms of co-occurence probabilities depending on the context (domain-specific or general).

Similarity Between Pianist and Violinist

However, are domain-specific and general differences always the case? Let’s take a look at two words that are domain specific and unambiguous: pianist and violinist.

(.similarity musicians-model "pianist" "violinist")
(.similarity google-news-model "pianist" "violinist")

In this case, the similarity score between the two terms is almost the same. In both contexts (generals and domain specific), their co-occurrence is similar.

Nearest Words

Now let’s look at the similarity between two distinct words in two new and distinct contexts. Let’s take a look at a few words and see what other words occur most often with them.


(.wordsNearest musicians-model ["music"] [] 7)
music revol samoilovich bunin musical amalgamating assam. voice dance.
(.wordsNearest google-news-model ["music"] [] 8)
music classical music jazz Music Without Donny Kirshner songs musicians tunes

One observation we can make is that the terms from the musicians model are more general than the ones from the general model.


(.wordsNearest musicians-model ["track"] [] 10)
track released. album latest entitled released debut year. titled positive
(.wordsNearest google-news-model ["track"] [] 5)
track tracks Track racetrack horseshoe shaped section

As we know, track is ambiguous. The difference between these two sets of nearest related words is striking. There is a clear conceptual correlation in the musicians’ domain-specific model. But in the general model, it is really going in all directions.


Now let’s take a look at a really general word: year

(.wordsNearest musicians-model ["year"] [] 11)
year ghantous. he was grammy naacap grammy award for best luces del alma year. grammy award grammy for best sitorai sol nominated
(.wordsNearest google-news-model ["year"] [] 10)
year month week months decade years summer year.The September weeks

This one is quite interesting too. Both groups of words makes sense, but only in their respective contexts. With the musicians’ model, year is mostly related to awards (like the Grammy Awards 2016), categories like “song of the year”, etc.

In the context of the general model, year is really related to time concepts: months, seasons, etc.

Playing With Co-Occurrences Vectors

Finally we will play with manipulating the co-occurrences vectors by manipulating them. A really popular word2vec equation is king - man + women = queen. What is happening under the hood with this equation is that we are adding and substracting the co-occurences vectors for each of these words, and we check the nearest word of the resulting co-occurence vector.

Now, let’s take a look at a few of these equations.

Pianist + Renowned = ?

(.wordsNearest musicians-model ["pianist" "renowned"] [] 9)
pianist renowned teacher. composer. prolific virtuoso teacher leading educator.
(.wordsNearest google-news-model ["pianist" "renowned"] [] 7)
renowned pianist pianist composer jazz pianist classical pianists composer pianist virtuoso pianist

These kind of operations are kind of interesting. If we add the two co-occurrence vectors for pianist and renowned then we get that a teacher, an educator, a composer or a virtuoso is a renowned pianist.

For unambiguous surface forms like pianist, then the two models score quite well. The difference between the two examples comes from the way the general training corpus has been created (pre-processed) compared to the musicians corpus.

Metal + Death = ?

(.wordsNearest musicians-model ["metal" "death"] [] 10)
metal death thrash deathcore melodic doom grindcore metalcore mathcore heavy
(.wordsNearest google-news-model ["metal" "death"] [] 5)
death metal Tunstallbled steel Death

This example uses two quite general words with no apparent relationship between them. The results with the musicians’ model are all the highly similar genre of music like trash metal, deathcore metal, etc.

However with the general model, it is a mix of multiple unrelated concepts.

Metal – Death + Smooth = ?

Let’s play some more with these equations. What if we want some kind of smooth metal?

(.wordsNearest musicians-model ["metal" "smooth"] ["death"] 5)
smooth fusion funk hard neo

This one is quite interesting. We substracted the death co-occurrence vector to the metal one, and then we added the smooth vector. What we end-up with is a bunch of music genres that are much smoother than death metal.

(.wordsNearest google-news-model ["metal" "smooth"] ["death"] 5)
smooth metal Brushed aluminum durable polycarbonate chromed steel

In the case of the general model, we end-up with “smooth metal”. The removal of the death vector has no effect on the results, probably since these are three ambiguous and really general terms.

What Is Next

The demo I presented in this article uses public datasets currently linked to KBpedia. You may wonder what are the other possibilities? Another possibility is to link your own private datasets to KBpedia. That way, these private datasets would then become usable, exactly in the same way, to create domain-specific training corpuses on the fly. Another possibility would be to take totally unstructured text like local text documents, or semi-structured text like a set of HTML web pages. Then, tag them using the Cognonto topics analyzer to tag each of the text document using KBpedia reference concepts. Then we could use the KBpedia structure exactly the same way to choose which of these documents we want to include in the domain-specific training corpus.


As we saw, creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models. This is quite an interesting characteristic since smaller models means they are faster to generate, faster to download/upload, faster to query, consumes less memory, etc.

Of the concepts in KBpedia, roughly 33,000 of them correspond to types (or classes) of various sorts. These pre-determined slices are available across all needs and domains to generate such domain-specific corpuses. Further, KBpedia is designed for rapid incorporation of your own domain information to add further to this discriminatory power.

Posted at 19:27

September 25

Bob DuCharme: Semantic web semantics vs. vector embedding machine learning semantics

It's all semantics.

Posted at 16:01

September 24

John Goodwin: Using Recurrent Neural Networks to Hallucinate New Model Army Lyrics

I decided to follow the example of

Posted at 15:36

September 23

Dublin Core Metadata Initiative: Sutton stepping down as DCMI Managing Director<</title>

2016-09-23, Stuart Sutton has announced his intention to step down as DCMI Managing Director effective 30 June 2017. Over the coming months, the DCMI Executive Committee and the Governing Board will be engaged in succession planning and the process of replacing the Managing Director. For additional information concerning the succession and appointment process, contact DCMI Chair-Elect Paul Walk at p[dot]walk[at]ed[dot]ac[dot]uk. Future announcements concerning the succession process will be posted here from time to time.

Posted at 23:59

Dublin Core Metadata Initiative: Synaptica becomes DC-2016 Gold Sponsor<</title>

2016-09-23, DCMI is pleased to announce that Synaptica is supporting DC-2016 in Copenhagen as a Gold Sponsor. Since 1995, Synaptica has been developing innovative software tools for organizing, indexing and classifying information, and for discovering knowledge. All of Synaptica's award-winning software products are built on a foundation of open standards and a commitment to client-led solutions and uncompromising customer service. Synaptica's Linked Canvas will be featured in a demonstration during the Conference Opening Reception. Linked Canvas is an easy-to-use tool designed for the cultural heritage community as well as schools and colleges to build interactive educational resources. For more information about Synaptica and Linked Canvas, visit Visit for more information about the conference and to register.

Posted at 23:59

Dublin Core Metadata Initiative: Danish Bibliographic Centre (DBC) becomes DC-2016 Sponsor<</title>

2016-09-23, The Danish Bibliographic Centre (DBC) joins in supporting DC-2016 in Copenhagen as Sponsor of the Conference Delegate Bags. The DBC's main task in Denmark is the development and maintenance of the bibliographic and IT infrastructure of Danish libraries. The DBC handles registration of books, music, AV materials, Internet documents, articles and reviews in newspapers and magazines in the National Bibliography, develops Danbib, the Danish union catalogue, and the infrastructure for interlibrary loan. Danbib is comprised of the National Bibliography and the holdings of the libraries. DBC also develops — the citizen's access to all Danish publications and the holdings of the Danish libraries. DBC's IT development is based on open source and service oriented architecture. DBC is a public limited company owned by Local Government Denmark and the Danish State.

Posted at 23:59

Dublin Core Metadata Initiative: Ana Alice Baptista named Chair-Elect of the DCMI Governing Board<</title>

2016-09-23, The DCMI Governing Board is pleased to announce that Ana Alice Baptista has been appointed to the DCMI Governing Board as an Independent Member. She also assumes the role of Chair-Elect of the Governing Board at the closing ceremony of DC-2016 in Copenhagen. She will succeed Paul Walk as Chair of the Board in 2017. Ana is a professor at the Information Systems Department and a researcher at ALGORITMI Center, both at University of Minho, Portugal. She graduated in computer engineering and has a PhD on Information Systems and Technologies. She is also a member of the Elpub conference series Executive Committee, participated in several R&D projects, and was an evaluator of project proposals under FP7. For more information about the Governing Board, visit the DCMI website at

Posted at 23:59

Dublin Core Metadata Initiative: Join us! DC-2016 in Copenhagen on 13-16 October<</title>

2016-09-23, DC-2016 in Copenhagen, Denmark on 13-16 October is rapidly approaching. The program promises a rich array of papers, project reports, presentations, demonstrations, posters, special panels, workshops and an exciting keynote by Elsevier's Bradley Allen. You will not want to miss this one! Register now at

Posted at 23:59

Frederick Giasson: Web Page Analysis With Cognonto

Extract Structured Content, Tag Concepts & Entities


Cognonto is brand new. At its core, it uses a structure of nearly 40 000 concepts. It has about 138,000 links to external classes and concepts that defines huge public datasets such as Wikipedia, DBpedia and USPTO. Cognonto is not a children’s toy. It is huge and complex… but it is very usable. Before digging into the structure itself, before starting to write about all the use cases that Cognonto can support, I will first cover all of the tools that currently exist to help you understand Cognonto and its conceptual structure and linkages (called KBpedia).

The embodiment of Cognonto that people can see are the tools we created and that we made available on the web site. Their goal is to show the structure at work, what ties where, how the conceptual structure and its links to external schemas and datasets help discover new facts, how it can drive other services, etc.

This initial blog post will discuss the demo section of the web site. What we call the Cognonto demo is a web page crawler that analyzes web pages to tag concepts, to tag named entities, to extract structured data, to detect language, to identity topics, and so forth. The demo uses the KBpedia structure and its linkages to Wikipedia, Wikidata, Freebase and USPTO to tag content that appears in the analyzed web pages. But there is one thing to keep in mind: the purpose of Cognonto is to link public or private datasets to the structure to expand its knowledge and make these tools (like the demo) even more powerful. This means that a private organization could use Cognonto, add their own datasets and link their own schemas, to improve their own version of Cognonto or to tailor it for their own purpose.

Let’s see what the demo looks like, what is the information it extracts and analyzes from any web page, and how it ties into the KBpedia structure.

Analyzing a web page

The essence of the Cognonto demo is to analyze a web page. The steps performed by the demo are:

  1. Crawling the web page’s content
  2. Extracting text content, defluffing and normalizing it
  3. Detecting the language used to write the content
  4. Extracting the copyright notice
  5. Extracting metadata from the web page (HTML, microformats, RDFa, etc.)
  6. Querying KBpedia to detect named entities
  7. Querying KBpedia to detect concepts
  8. Querying KBpedia to find information about the web page
  9. Analyzing extracted entities and concepts to determine the publisher
  10. Analyzing extracted concepts to determine the most likely topics
  11. Generating the analysis result set

To test the demo and see how it works, let’s analyze a piece of news recently published by CNN: Syria convoy attack: US blames Russia. You can start the analysis process by following this link. The first page will be:

What the demo shows is the header of the analyzed web page. The header is composed of the title of the web page and possibly a short description and an image. All of this information comes from the extracted metadata content of the page. Then you are presented with 5 tabs:

  1. Concepts: which shows the body content and extracted metadata of the web page tagged with all detected KBpedia concepts
  2. Entities: which shows the body content and extracted metadata of the web page tagged with all detected KBpedia named entities that exists in the knowledge base
  3. Analysis: which shows all the different kinds of analysis performed by the demo
  4. graphs: which shows how the topics found during the topic analysis step ties into the KBpedia conceptual structure
  5. export: which shows you what the final resultset looks like

Concepts tab

The concepts tab is the first one that is presented to you. All of the concepts that exist in KBpedia (among its ~40 000 concepts) will appear in the body content of the web page and its extracted metadata. There is one important thing to keep in mind here: the demo does detect what it considers to be the body content of the web page. It will defluff it, which means that it will remove the header, footer and sidebars and all other irrelevant content that can appear in the page surrounding the body content of that page. The model used by the demo works better on articles like web pages. So there are some web pages that may not end with much extracted body content for that very reason.

All of the concepts that appear in red are the ones that the demo considers to be the core concepts of the web page. The ones in blue are all of the other ones. If you mouse over any of these tagged terms, you will be presented a contextual menu that will show you one or multiple concepts that may refer to that surface form (the word in the text). For example, if you mouse over administration, you will be presented with two possible concepts for that word:

However, if you do the same for airstrikes then you will be presented a single unambiguous concept:

If you click on any of those links, then you will be redirected to a KBpedia reference concept view page. You will see exactly how that concepts ties into the broader KBpedia conceptual structure. You will see all of its related (direct and inferred) concepts, and how it links to external schemas, vocabularies and ontologies. It will show you lists of related entities, etc.

What all of this shows you is how these tagged concepts are in fact windows to a much broader universe that can be understood because all of its information is fully structured and can be reasoned upon. This is the crux of the demo. It shows that the content of a web page is not just about its content, but its entire context as well.

Entities tab

The entities tab presents information in exactly the same manner as the Concepts tab. However the content that is tagged is different. Instead of tagging concepts, we tag named entities. These entities (in the case of this demo) come from the entities datasets that we linked to KBpedia, namely: Wikipedia, Wikidata, Freebase and USPTO. These are a different kind of window than the concepts. These are the named things of this World that we detect in the content of the web page.

But there is one important thing to keep in mind: these are the named things that exist in the knowledge base at that moment. The demo is somewhat constrained to tens of millions of fully structured named entities that comes from these various public data sources. However the purpose of a knowledge base is to be nurtured and extended. Organizations could add private datasets into the mix to augment the knowledge of the system or to specialize it to specific domains of interest.

Another important thing to keep in mind is that we have constrained this demo to a specific domain of things, namely organizations. The demo is only considering a subset of entities from the KBpedia knowledge base, namely anything that is an organization. This shows how KBpedia can be sliced and diced to be domain specific. How millions of entities can be categorized in different kinds of domains id what leads to purposeful dedicated services.

The tag that appears in orange in the text is the organization entity that has been detected to be the organization that published that web page. All the other entities appear in blue. If you click on one of these entities, then you will be redirected to the entity view page. That page will show you all the structured information we have related to these entities in the knowledge base, and you will see how it ties to the KBpedia conceptual structure.

Analysis tab

The analysis tab is to core of the demo. It presents some analysis of the web page that uses the tagged concepts and entities to generate new information about the page. These are just some analysis we developed for the demo. All kinds of other analysis could be created in the future depending on the needs of our clients.

Language analysis

The first thing we detect is the language used to write the web page. The analysis is performed on the extracted body content of the page. We can detect about 125 languages at the moment. Cognonto is multilingual at its core, but at the moment we only configured the demo to analyze English web pages. Non-English web pages can be analyzed, but only English surface forms will be detected.

Topic analysis

The topic analysis section shows what the demo considers to be the most important concepts detected in the web page. Depending on a full suite of criteria, one concept will score higher than another. Note that all the concepts exist in the KBpedia conceptual structure. This means that we don’t simply “tag” a concept. We tag a concept that is part of an entire structure with hundreds and thousands of parents or children concepts, and linked to external schemas, vocabularies and ontologies. Again, these are not simple tags, these are windows into a much broader [conceptual] world.

Publisher analysis

The publisher analysis section shows what we consider to be the organization that published the web page. This analysis is much more complex in its processing. It incurs an analysis pipeline that includes multiple machine learning algorithms. However there is one thing that distinguishes it at its core than other more conventional machine learning pipelines: the heavy leveraging of the KBpedia conceptual structure. We use the tagged named entities the demo discovered, we check their types and then we analyze their structure within KBpedia, by using their SuperTypes for further analysis. Then we check the occurrence of their placements in the page and we compute a final likelihood score and we determine if one of these tagged entities can be considered the publisher of the web page.

Organizational Analysis

The organizational analysis is one of the steps that is performed by the Publisher analysis that we wanted to make explicit. What we do here is to show all the organization entities that we detected in the web page, and where in the web page (metadata, body content, etc.) they do appear.

The important thing to understand here is how we detect organization. We do not just check if the entities are of type Organization. What we do is to check if the entities are of type Organization by inference. What does that mean? It means that we use the KBpedia structure to keep all the tagged named entities that can be inferred to be an Organization. All of these organization entities are not defined to be of type kbpedia:Organization. However, what this analysis does is to check if the entities are of type kbpedia:Organization. But how is that possible? Cognonto does so by using the KBpedia structure, and its linkages to external vocabularies, schemas and ontologies, to determine which of the tagged named entities are of type kbpedia:Organization by inference.

Take a look at the kbpedia:Organization page. Take a look at all the Core structure and External structure linkage this single concept has with external conceptual structure. It is this structure that is used to determine if a named entity that exists in the KBpedia knowledge base is an Organization or not. There is no magic, but it is really powerful!

Metadata Extraction

All the extracted metadata by the demo is displayed at the end of the Analysis tab. This meta data comes from the HTML meta elements or some embedded microdata and RDFa structured content. Everything that got detected is displayed in this tab.

Graphs tab

The graph tab shows you a graphical visualization tool. The purpose is just to contextualize the identified concepts of the Topics analysis with the upper structure of KBpedia. It shows how everything is interconnected. But keep in mind that these are just tiny snapshots of the whole picture. There exists millions of links between these concepts and billions of inferred facts!

Here is a hierarchical view of the graph:


Here is a network view of that same graph:


Export tab

The export that is just a way for the user to visualize the resultset generated by Cognonto, enabling the web user interface to display the information you are seeing. It shows that all the information is structured and could be used by other computer systems for other means.


At the core of everything there is one thing: the KBpedia conceptual structure. It is what is being used across the board. It is what instructs machine learning algorithms, what helps us to analyze textual content such as web pages, this is what helps us to identify concepts and entities, it is what helps us to contextualize content, etc. This is the heart of Cognonto and everything else is just nuts and bolts. KBpedia can, and should, be extended with other private and public data sources. Cognonto/KBpedia is a living thing: it heals, it adapts and it evolves.

Posted at 17:48

September 21

Frederick Giasson: Cognonto

I am proud to announce the start of a new venture called Cognonto. I am particularly proud of it because even if it is just starting, it is in fact more than eight years old. It is the embodiment of eight years of research, of experimentation, of a big deal of frustration and of great joy with my long-time partner Mike. cognonto_logo-square

Eight years ago, we set a 5-to-10-year vision for our work as partners. We defined an initial series of technological goals for which we outlined a series of yearly milestones. The goals were related to help solving decades old problems with data integration and interoperability using a completely new research field (at the time): the Semantic Web.

And there we are eight years later, after working for an endless number of hours to create all kinds of different projects and services to pay for the research and the pieces of technologies we develop for these purposes. Cognonto is the embodiment of that effort, but it also created a series of other purposeful projects such as the creation of Stuctured Dynamics, UMBEL, the Open Semantic Framework and a series of other open source collaterals.

We spent eight years to create, sanitize, to make coherent and consistent, to generate and regenerate a conceptual structure of now 38,930 reference concepts with 138,868 mapping links to 27 external schemas, vocabularies and datasets. This led to the creation of KBpedia, which is the knowledge graph that drives Cognonto. The full statistics are available here.

I can’t thank Mike enough for this long and wonderful journey that led to the creation of Cognonto. I sent him an endless number of concepts lists that he diligently screened, assessed and mapped. We spent hundred of hours to discuss the knots and bolts of the structure, to argue about its core concepts and how it should be defined and used. It was not without pain, but I believe that the result is truly astonishing.

I won’t copy/paste the Cognonto press release here, a link will suffice. I it is just not possible for me to write a better introduction than the two pagers that Mike wrote for the press release. I would also suggest that you read his Cognonto introduction blog post: Cognonto is on the Hunt for Big AI Game.

In the coming weeks, I will write a lot about Cognonto, what it is, how it can be used, what are its use cases, how the information that is presented in the demo and the knowledge graph sections should be interpreted and what these pages tell you.

Posted at 13:19

September 19

Leigh Dodds: Why are bulk downloads of open data important?

I was really pleased to see that at the GODAN Summit last week

Posted at 17:47

September 15

Leigh Dodds: People like you are in this dataset

One of the recent projects we’ve done at

Posted at 22:38

September 02

Dublin Core Metadata Initiative: Tongfang Co., Ltd. becomes DC-2016 Reception Sponsor<</title>

2016-09-02, DCMI is pleased to announce that Tongfang Co., Ltd. has become the DC-2016 Reception Sponsor. Tongfang is a high-tech company established in 1997. Over the years, Tongfang has taken 'developing into a world-class high-tech enterprise' as its goal, and 'serving the society with science and technology' as its mission. The company, by making use of strengths of Tsinghua University in research and human resources, has been implementing such strategies as 'technology + capital', 'cooperation and development' and 'branding + internationalization'. With a corporate culture featuring 'action, exploration and excellence; loyalty, responsibility and value', Tongfang has been making explorations and innovations in industries of information, energy and environment. As of 2013, Tongfang's has a total asset of more than $5 billion. Its annual revenue was over $3.5 billion. For more information on becoming a sponsor for DC-2016, see Visit the conference website at

Posted at 23:59

Dublin Core Metadata Initiative: RFID System Technology Co. Ltd. becomes DC-2016 Gold Sponsor<</title>

2016-09-02, DCMI is pleased to announce that Shanghai RFID System Technology Co., Ltd., has become a Gold Sponsor of DC-2016 in Copenhagen. Founded on October 10, 2004, Shanghai RFID System Technology Co., Ltd. is a leading automatic management solution provider for libraries in China. The library of Chenyi College, Jimei University was the first library installed with RFID automatic system in China. In the past 12 years, the company has provided services for more than 400 libraries in China such as National Library of China, Shanghai Library, and Hangzhou Library. The services provided include library self-service system, book automatic sorting system, digital reading solution, device monitoring, mini library, cloud and big data platform, knowledge discovery system, and mobile applications in library. For more information on becoming a sponsor for DC-2016, see Visit the conference website at

Posted at 23:59

August 31

Leigh Dodds: Help me use your data

I’ve been interviewed a couple of times recently by people interested in understanding how best to publish data to make it useful for others.  Once by a startup and a couple of times by researchers. The core of the discussion has essentially been the same question: “how do you know if a dataset will be useful to you?”

I’ve given essentially the same answer each time. When I’m sifting through dataset descriptions, either in a portal or via a web search, my first stage of filtering involves looking for:

  1. A brief summary of the dataset: e.g. a title and a description
  2. The licence
  3. Some idea of its coverage, e.g. geographic coverage, scope of time series, level of aggregation, etc
  4. Whether it’s in a usable format

Beyond, that there’s a lot more that I’m interested in: the provenance of the data, its timeliness and a variety of quality indicators. But those pieces of information are what I’m looking for right at the start. I’ll happily jump through hoops to massage some data into a better format. But if the licence or coverage isn’t right then its useless to me.

We can frame these as questions:

  1. What is it? (Description)
  2. Can I use it? (Licence)
  3. Will it help answer my question? (in whole, or  part)
  4. How difficult will it be to use? (format, technical characteristics)

It’s frustrating how often these essentials aren’t readily available.

Here’s an example of why this is important.

A weather data example

I’m currently working on a project that needs access to local weather observations. I want openly licensed temperature readings for my local area.

My initial port of call was the

Posted at 18:12

AKSW Group - University of Leipzig: AKSW Colloquium, 05.09.2016. LOD Cloud Statistics, OpenAccess at Leipzig University.

On the upcoming Monday (05.09.2016), AKSW group will discuss topics related to Semantic Web and LOD Cloud Statistics. Also, we will have invited speaker from University of Leipzig Library (UBL) Dr. Astrid Vieler talking about OpenAccess at Leipzig University.

LODStats: The Data Web Census Dataset

by Ivan Ermilov et al.
Presented by: Ivan Ermilov

Abstract: Over the past years, the size of the Data Web has increased significantly, which makes obtaining general insights into its growth and structure both more challenging and more desirable. The lack of such insights hinders important data management tasks such as quality, privacy and coverage analysis. In this paper, we present the LODStats dataset, which provides a comprehensive picture of the current state of a significant part of the Data Web. LODStats is based on RDF datasets from, and data catalogs and at the time of writing lists over 9 000 RDF datasets. For each RDF dataset, LODStats collects comprehensive statistics and makes these available in adhering to the LDSO vocabulary. This analysis has been regularly published and enhanced over the past five years at the public platform We give a comprehensive overview over the resulting dataset.

OpenAccess at Leipzig University

Invited talk by Dr. Astrid Vieler from Leipzig University Library (UBL). The talk will be about Open Access in general and the Open Access Policy of our University in special. She will tell us more about our right, which we have toward the publishers, and she gives us advice and hints on how we can increase the visibility of our publications.

After the talks, there is more time for discussion in smaller groups as well as coffee and cake. The colloquium starts at 3 p.m. and is located on 7th floor (Leipzig, Augustusplatz 10, Paulinum).

Posted at 09:23

August 28

Bob DuCharme: Converting between MIDI and RDF: readable MIDI and more fun with RDF

Listen to my fun!

Posted at 17:24

August 24

Frederick Giasson: Winnipeg City’s NOW [Data] Portal

The Winnipeg City’s NOW (Neighbourhoods Of Winnipeg) Portal is an initiative to create a complete neighbourhood web portal for its citizens. At the core of the project we have a set of about 47 fully linked, integrated and structured datasets of things of interests to Winnipegers. The focal point of the portal is Winnipeg’s 236 neighbourhoods, which define the main structure of the portal. The portal has six main sections: topics of interests, maps, history, census, images and economic development. The portal is meant to be used by citizens to find things of interest in their neibourhood, to learn their history, to see the images of the things of interest, to find tools to help economic development, etc.

The NOW portal is not new; Structured Dynamics was also its main technical contractor for its first release in 2013. However we just finished to help Winnipeg City’s NOW team to migrate their older NOW portal from OSF 1.x to OSF 3.x and from Drupal 6 to Drupal 7; we also trained them on the new system. Major improvements accompany this upgrade, but the user interface design is essentially the same.

The first thing I will do is to introduce each major section of the portal and I will explain the main features of each. Then I will discuss the new improvements of the portal.


A NOW portal user won’t notice any of this, but the main feature of the portal is the data it uses. The portal manages 47 datasets (and growing) of fully structured, integrated and linked datasets of things of interests to Winnipegers. What the portal does is to manage entities. Each kind of entity (swimming pools, parks, places, images, addresses, streets, etc.) are defined with multiple properties and values. Several of the entities reference other entities in other datasets (for example, an assessment parcel from the Assessment Parcels dataset references neighbourhoods entities and property addresses entities from their respective datasets).

The fact that these datasets are fully structured and integrated means that we can leverage these characteristics to create a powerful search experience by enabling filtering of the information on any of the properties, to bias the searches depending where a keyword search match occurs, etc.

Here is the list of all the 47 datasets that currently exists in the portal:

  1. Aboriginal Service Providers
  2. Arenas
  3. Neighbourhoods of Winnipeg City
  4. Streets
  5. Economic Development Images
  6. Recreation & Leisure Images
  7. Neighbourhoods Images
  8. Volunteer Images
  9. Library Images
  10. Parks Images
  11. Census 2006
  12. Census 2001
  13. Winnipeg Internal Websites
  14. Winnipeg External Websites
  15. Heritage Buildings and Resources
  16. NOW Local Content Dataset
  17. Outdoor Swimming Pools
  18. Zoning Parcels
  19. School Divisions
  20. Property Addresses
  21. Wading Pools
  22. Electoral wards of Winnipeg City
  23. Assessment Parcels
  24. Libraries
  25. Community Centres
  26. Police Service Centers
  27. Community Gardens
  28. Leisure Centres
  29. Parks and Open Spaces
  30. Community Committee
  31. Commercial real estates
  32. Sports and Recreation Facilities
  33. Community Characterization Areas
  34. Indoor Swimming Pools
  35. Neighbourhood Clusters
  36. Fire and Paramedic Stations
  37. Bus Stops
  38. Fire and Paramedic Service Images
  39. Animal Services Images
  40. Skateboard Parks
  41. Daycare Nurseries
  42. Indoor Soccer Fields
  43. Schools
  44. Truck Routes
  45. Fire Stations
  46. Paramedic Stations
  47. Spray Parks Pads

Structured Search

The most useful feature of the portal to me is its full-text search engine. It is simple, clean and quite effective. The search engine is configured to try to give the most relevant results a NOW portal user may be searching. For example, it will positively bias some results that comes from some specific datasets, or matches that occurs in specific property values. The goal of this biasing is to improve the quality of the returned results. This is somewhat easy to do since the context of the portal is well known and we can easily boost scoring of search results since everything is fully structured.

Another major gain is that all the search results are fully templated. The search results do not simply return a title and some description for your search results. It does template all the information the system has about the matched results, but also displays the most relevant information to the users in the search results.

For example, if I search for a indoor swimming pool, in most of the cases it may be to call the front desk to get some information about the pool. This is why different key information will be displayed directly in the search results. That way, most of the users won’t even have to click on the result to get the information they were looking for directly in the search results page.

Here is an example of a search for the keywords main street. As you can notice, you are getting different kind of results. Each result is templated to get the core information about these entities. You have the possibility to focus on particular kind of entities, or to filter by their location in specific neighbourhoods.


Templated Search Results

Now let’s see some of the kind of entities that can be searched on the portal and how they are presented to the users.

Here is an example of an assessment parcel that is located in the St. John’s neighbourhood. The address, the value, the type and the location of the parcel on a map is displayed directly into the search results.

Another kind of entity that can be searched are the property addresses. These are located on a map, the value of the parcels and the building and the zoning of the address is displayed. The property is also linked to its assessment parcel entity which can be clicked to get additional information about the parcel.

Another interesting type of entity that can be searched are the streets. What is interesting in this case is that you get the complete outline of the street directly on a map. That way you know where it starts and where it ends and where it is located in the city.

There are more than a thousand geo-localized images of all different things in the city that can be searched. A thumbnail of the image and the location of the thing that appears on the image appears in the search results.

If you were searching for a nursery for your new born child, then you can quickly see the name, location on a map and the phone number of the nursery directly in the search result.

There are just a few examples of the fifty different kind of entities that can appear like this in the search results.


The mapping tool is another powerful feature of the portal. You can search like if you were using the full-text search engine (the top search box on the portal) however you will only get the results that can be geo-localized on a map. You can also simply browse entities from a dataset or you can filter entities by their properties/values. You can persist entities you find on the map and save the map for future reference.

In the example below, it shows that someone searched for a street (main street) and then he persisted it on the map. Then he search for other things like nurseries and selected the ones that are near the street he persisted, etc. That way he can visualize the different known entities in the portal on a map to better understand where things are located in the city, what exists near a certain location, within a neighbourhood, etc.


Census Analysis

Census information is vital to the good development of a city. They are necessary to understand the trends of a sector, who populates it, etc., such that the city and other organizations may properly plan their projects to have has much impact as possible.

These are some of the reason why one of the main section of the site is dedicated to census data. Key census indicators have been configured in the portal. Then users can select different kind of regions (neighbourhood clusters, community areas and electoral wards) to get the numbers for each of these indicators. Then they can select multiple of these regions to compare each other. A chart view and a table view is available for presenting the census data.

History, Images & Points of Interest

The City took the time to write the history of each of its neighbourhoods. In additional to that, they hired professional photographs to photograph the points of interests of the city, to geo-localize them and to write a description for each of these photos. Because of this dedication, users of the portal can learn a much about the city in general and the neighbourhood they live in. This is what the History and Image sections of the website are about.

Historic buildings are displayed on a map and they can be browsed from there.

Images of points of interests in the neighbourhood are also located on a map.

Find Your Neighbourhood

Ever wondered in which neighbourhood you live in? No problem, go on the home page, put your address in the Find your Neighbourhood section and you will know it right away. From there you can learn more about your neighbourhood like its history, the points of interest, etc.

Your address will be located on a map, and your neighbourhood will be outlined around it. Not only you will know in which neighbourhood you live, but you will also know where you live within it. From there you can click on the name of the neigbourhood to get to the neighbourhood’s page and start learning more about it like its history, to see photos of points of interest that exists in your neighbourhood, etc.

Browsing Content by Topic

Because all the content of the portal is fully structured, it is easy to browse its content using a well defined topic structure. The city developed its own ontology that is used to help the users browse the content of the portal by browsing topics of interest. In the example below, I clicked the Economic Development node and then the Land use topic. Finally I clicked the Map button to display things that are related to land use: in this case, zoning and assessment parcels are displayed to the user.

This is another way to find meaningful and interesting content from the portal.

Depending on the topic you choose, and the kind of information related to that topic, you may end up with different options like a map, a list of links to documents related to that topic, etc.

Export Content

Now that I made an overview of each of the main features of the portal, let’s go back to the geeky things. The first thing I said about this portal is that at its core, all information it manages is fully structured, integrated and linked data. If you get to the page of an entity, you have the possibility to see the underlying data that exists about it in the system. You simply have to click the Export tab at the top of the entity’s page. Then you will have access to the description of that entity in multiple different formats.

In the future, the City should (or at least I hope will) make the whole set of datasets fully downloadable. Right now you only have access to that information via that export feature per entity. I hope because this NOW portal is fully disconnected from another initiative by the city:, which uses Socrata. The problem is that barely any of the datasets from NOW are available on, and the ones that are appearing are the raw ones (semi-structured, un-documented, un-integrated and non-linked) all the normalization work, the integration work, the linkage work done by the NOW team hasn’t been leveraged to really improve the datasets catalog.

New with the upgrades

Those who are familiar with the NOW portal will notice a few changes. The user interface did not change that much, but multiple little things got improved in the process. I will cover the most notable of these changes.

The major changes that happened are in the backend of the portal. The data management in OSF for Drupal 7 is incompatible with what was available in Drupal 6. The management of the entities became easier, the configuration of OSF networks became a breeze. A revisioning system has been added, the user interface is more intuitive, etc. There is no comparison possible. However, portal users’ won’t notice any of this, since these are all site administrator functions.

The first thing that users will notice is the completely new full-text search engine. The underlying search engine is almost the same, but the presentation is far better. All entity types have gotten their own special template, which are displayed in a special way in the search results. Most of the time results should be much more relevant, filtering is easier and cleaner. The search experience is much better in my view.

The overall site performance is much better since different caching strategies have been put in place in OSF 3.x and OSF for Drupal. This means that most of the features of the portal should react more swiftly.

Now every type of entity managed by the portal is templated: their webpage is templated in specific ways to optimize the information they want to convey to users along with their search result “mini page” when they get returned as the result of a search query.

Multi-linguality is now fully supported by the portal, however not everything is currently templated. However expect a fully translated NOW portal in French in the future.

Creating a Network of Portals

One of the most interesting features that goes with this upgrade is that the NOW portal is now in a position to participate into a network of OSF instances. What does that mean? Well, it means that the NOW portal could create partnerships with other local (regional, national or international) organizations to share datasets (and their maintenance costs).

Are there other organizations that uses this kind of system? Well, there is at least another one right in Winnipeg City:, also developed by Structured Dynamics. MyPeg uses RDF to model its information and uses OSF to manage its information. MyPeg is a non-profit organization that uses census (and other indicator) data to do studies on the well being of Winnipegers. The team behind are research experts in indicator data. Their indicator datasets (which includes census data) is top notch.

Let’s hypothetize that there would be interest between the two groups to start collaborating. Let’s say that the NOW portal would like to use MyPeg’s census datasets instead of its own since they are more complete, accurate and include a larger number of important indicators. What they basically want is to outsource the creation and maintenance of the census/indicators data to a local, dedicated and highly professional organization. The only things they would need to do is to:

  1. Formalize their relationship by signing a usage agreement
  2. The NOW portal would need to configure the OSF network into their OSF for Drupal instance
  3. The NOW portal would need to register the datasets it want to use from

Once these 3 steps are done, taking no more than a couple of minutes, then the system administrators of the NOW portal could start using the indicator datasets like they were existing on their own network. (The reverse could also be true for MyPeg.) Everything would be transparent to them. From then on, all the fixes and updates performed by to their indicator datasets would immediately appear on the NOW portal and accessible to its users.

This is one possibility to collaborate. Another possibility would be to simply on a routine basis (every month, every 6 months, every year) share the serialized datasets such that the NOW portal re-import the dataset from the files shared by This is also possible since both organizations use the same Ontology to describe the indicator data. This means that no modification is required by the City to take that new information into account, they only have to import and update their local datasets. This is the beauty of ontologies.


The new NOW portal is a great service for citizens of Winnipeg City. It is also a really good example of a web portal that leverages fully structured, integrated and linked data. To me, the NOW portal is a really good example of the features that should go along with a municipal data portal.

Posted at 17:33

Copyright of the postings is owned by the original blog authors. Contact us.