Planet RDF

It's triples all the way down

November 30

Ebiquity research group UMBC: PhD Proposal: Ankur Padia, Dealing with Dubious Facts in Knowledge Graphs

the skeptic

Dissertation Proposal

Dealing with Dubious Facts
in Knowledge Graphs

Ankur Padia

1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC

Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.

An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.

Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Paul McNamee (JHU), Partha Talukdar (IISc, India)

Posted at 02:25

November 26

AKSW Group - University of Leipzig: AKSW Colloquium, 28.11.2016, NED using PBOH + Large-Scale Learning of Relation-Extraction Rules.

In the upcoming Colloquium, November the 28th at 3 PM, two papers will be presented:

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Diego Moussallem will discuss the paper “Probabilistic Bag-Of-Hyperlinks Model for Entity Linking” by Octavian-Eugen Ganea et. al. which was accepted at WWW 2016.

Abstract:  Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e., linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods

Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web

Afterward, René Speck will present the paper “Large-Scale Learning of Relation-Extraction Rules with
Distant Supervision from the Web”
by Sebastian Krause et. al. which was accepted at ISWC 2012.

Abstract: We present a large-scale relation extraction (RE) system which learns grammar-based RE rules from the Web by utilizing large numbers of relation instances as seed. Our goal is to obtain rule sets large enough to cover the actual range of linguistic variation, thus tackling the long-tail problem of real-world applications. A variant of distant supervision learns several relations in parallel, enabling a new method of rule filtering. The system detects both binary and n-ary relations. We target 39 relations from Freebase, for which 3M sentences extracted from 20M web pages serve as the basis for learning an average of 40K distinctive rules per relation. Employing an efficient dependency parser, the average run time for each relation is only 19 hours. We compare these rules with ones learned from local corpora of different sizes and demonstrate that the Web is indeed needed for a good coverage of linguistic variation

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:30

November 21

Frederick Giasson: Leveraging KBpedia Aspects To Generate Training Sets Automatically

In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning 1 , 2 , 3 using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.

So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:

  1. Using the links that exist between each KBpedia reference concept and their related Wikipedia pages
  2. Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of
    named entities.

Now we will introduce a third way to create a different kind of training corpus:

  1. Using the KBpedia aspects linkages.

Aspects are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. It is another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.

Creating New Domain Using KBpedia Aspects

To continue with the musical domain, there exists two aspects of interest:

  1. Music
  2. Genres

What we will do first is to query the KBpedia Knowledge Graph using theSPARQL query language to get the list of all of the KBpedia reference concepts that are related to the Music or the Genre aspects. Then, for each of these reference concepts, we will count the number of named entities that can be reached in the complete KBpedia structure.

prefix kko: <http://kbpedia.org/ontologies/kko#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix dcterms: <http://purl.org/dc/terms/> 
prefix schema: <http://schema.org/>

select distinct ?class count(distinct ?entity) as ?nb
from <http://dbpedia.org>
from <http://www.uspto.gov>
from <http://wikidata.org>
from <http://kbpedia.org/1.10/>
where
{
  ?entity dcterms:subject ?category .

  graph <http://kbpedia.org/1.10/>
  {
    {?category <http://kbpedia.org/ontologies/kko#hasMusicAspect> ?class .}
    union
    {?category <http://kbpedia.org/ontologies/kko#hasGenre> ?class .}
  }
}
order by desc(?nb)
reference concept nb
http://kbpedia.org/kko/rc/Album-CW 128772
http://kbpedia.org/kko/rc/Song-CW 74886
http://kbpedia.org/kko/rc/Music 51006
http://kbpedia.org/kko/rc/Single 50661
http://kbpedia.org/kko/rc/RecordCompany 5695
http://kbpedia.org/kko/rc/MusicalComposition 5272
http://kbpedia.org/kko/rc/MovieSoundtrack 2919
http://kbpedia.org/kko/rc/Lyric-WordsToSong 2374
http://kbpedia.org/kko/rc/Band-MusicGroup 2185
http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup 2078
http://kbpedia.org/kko/rc/Ensemble 1438
http://kbpedia.org/kko/rc/Orchestra 1380
http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup 1335
http://kbpedia.org/kko/rc/Choir 754
http://kbpedia.org/kko/rc/Concerto 424
http://kbpedia.org/kko/rc/Symphony 299
http://kbpedia.org/kko/rc/Singing 154

Seventeen KBpedia reference concepts are related to the two aspects we want to focus on. The next step is to take these 17 reference concepts and to create a new domain corpus with them. We will use the new version of KBpedia to create the full set of reference concepts that will scope our domain by inference.

Next we will try to use this information to create two totally different kinds of training corpuses:

  1. One that will rely on the links between the reference concepts and Wikipedia pages
  2. One that will rely on the linkages to external vocabularies to create a list of named entities that will be used as
    the training corpus

Creating Model With Reference Concepts

The first training corpus we want to test is one that uses the linkage between KBpedia reference concepts and Wikipedia pages. The first thing is to generate the domain training corpus with the 17 seed reference concepts and then to infer other related reference concepts.

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])


(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Song-CW"
                       "http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Single"
                       "http://kbpedia.org/kko/rc/RecordCompany"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MovieSoundtrack"
                       "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                       "http://kbpedia.org/kko/rc/Band-MusicGroup"
                       "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Ensemble"
                       "http://kbpedia.org/kko/rc/Orchestra"
                       "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Choir"
                       "http://kbpedia.org/kko/rc/Symphony"
                       "http://kbpedia.org/kko/rc/Singing"
                       "http://kbpedia.org/kko/rc/Concerto"]
  kbpedia
  "resources/aspects-concept-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

(create-pruned-pages-dictionary-csv "resources/aspects-concept-corpus-dictionary.csv"
                                    "resources/aspects-concept-corpus-dictionary.pruned.csv" 
                                    "resources/aspects-corpus-normalized/")

Once pruned, we end-up with a domain which has 108 reference concepts which will enable us to create models with 108 features. The next step is to create the actual semantic interpreter and the SVM models:

;; Load dictionaries
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv")

;; Create the semantic interpreter
(build-semantic-interpreter "aspects-concept-pruned" "resources/semantic-interpreters/aspects-concept-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the SVM model vectors
(build-svm-model-vectors "resources/svm/aspects-concept-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

;; Train the linear SVM classifier
(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Then we have to evaluate this new model using the gold standard:

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive:  28
False positive:  0
True negative:  923
False negative:  66

Precision:  1.0
Recall:  0.29787233
Accuracy:  0.93510324
F1:  0.45901638

Now let’s try to find better hyperparameters using grid search:

(svm-grid-search "grid-search-aspects-concept-pruned-tests" 
                       "resources/svm/aspects-concept-pruned/" 
                       "resources/gold-standard-full.csv"
                       :selection-metric :f1
                       :grid-parameters [{:c [1 2 4 16 256]
                                          :e [0.001 0.01 0.1]
                                          :algorithm [:l2l2]
                                          :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.84444445 
 :c 1
 :e 0.001 
 :algorithm :l2l2
 :weight 30}

After running the grid search with these initial broad range values, we found a configuration that gives us 0.8444 for the F1 score. So far, this score is the best to date we have gotten for the full gold standard2, 3. Let’s see all of the metrics for this configuration:

(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights {1 30.0}
                 :v nil
                 :c 1 
                 :e 0.001
                 :algorithm :l2l2)

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive:  76
False positive:  10
True negative:  913
False negative:  18

Precision:  0.88372093
Recall:  0.80851066
Accuracy:  0.972468
F1:  0.84444445

These results are also the best balance between precision and recall that we have gotten so far2, 3. Better precision can be obtained if necessary but only at the expense of lower recall.

Let’s take a look at the improvements we got compared to the previous training corpuses we had:

  • Precision: +4.16%
  • Recall: +35.72%
  • Accuracy: +2.06%
  • F1: +20.63%

This new training corpus based on the KBpedia aspects, after hyperparameter optimization, did increase all the metrics we calculate. The more stiking improvement is the recall which improved by more than 35%.

Creating Model With Entities

The next training corpus we want to test is one that uses the linkage between KBpedia reference concepts and linked external vocabularies to get a series of linked named entities as the positive training set of for each of the features of the model.

The first thing to do is to is to create the positive training set populated with named entities related to the reference concepts. We will get a random sample of ~50 named entities per reference concept:

(require '[cognonto-rdf.query :as query])
(require '[clojure.java.io :as io])
(require '[clojure.data.csv :as csv])
(require '[clojure.string :as string])

(defn generate-domain-by-rc
  [rc domain-file nb]
  (with-open [out-file (io/writer domain-file :append true)]
    (doall
     (->> (query/select
           (str "prefix kko: <http://kbpedia.org/ontologies/kko#>
                 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
                 prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

                 select distinct ?entity
                 from <http://dbpedia.org>
                 from <http://www.uspto.gov>
                 from <http://wikidata.org>
                 from <http://kbpedia.org/1.10/>
                 where
                 {
                   ?entity dcterms:subject ?category .
                   graph <http://kbpedia.org/1.10/>
                   {
                     ?category ?aspectProperty <" rc "> .
                   }
                 }
                 ORDER BY RAND() LIMIT " nb) kb-connection)
          (map (fn [entity]
                 (csv/write-csv out-file [[(string/replace (:value (:entity entity)) "http://dbpedia.org/resource/" "")
                                           (string/replace rc "http://kbpedia.org/kko/rc/" "")]])))))))


(defn generate-domain-by-rcs 
  [rcs domain-file nb-per-rc]
  (with-open [out-file (io/writer domain-file)]
    (csv/write-csv out-file [["wikipedia-page" "kbpedia-rc"]])
    (doseq [rc rcs] (generate-domain-by-rc rc domain-file nb-per-rc))))

(generate-domain-by-rcs ["http://kbpedia.org/kko/rc/"
                         "http://kbpedia.org/kko/rc/Concerto"
                         "http://kbpedia.org/kko/rc/DoubleAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Psychedelic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Religious"
                         "http://kbpedia.org/kko/rc/PunkMusic"
                         "http://kbpedia.org/kko/rc/BluesMusic"
                         "http://kbpedia.org/kko/rc/HeavyMetalMusic"
                         "http://kbpedia.org/kko/rc/PostPunkMusic"
                         "http://kbpedia.org/kko/rc/CountryRockMusic"
                         "http://kbpedia.org/kko/rc/BarbershopQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/FolkMusic"
                         "http://kbpedia.org/kko/rc/Verse"
                         "http://kbpedia.org/kko/rc/RockBand"
                         "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                         "http://kbpedia.org/kko/rc/Refrain"
                         "http://kbpedia.org/kko/rc/MusicalComposition-GangstaRap"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Klezmer"
                         "http://kbpedia.org/kko/rc/HouseMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-AlternativeCountry"
                         "http://kbpedia.org/kko/rc/PsychedelicMusic"
                         "http://kbpedia.org/kko/rc/ReggaeMusic"
                         "http://kbpedia.org/kko/rc/AlternativeRockBand"
                         "http://kbpedia.org/kko/rc/AlternativeRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Trance"
                         "http://kbpedia.org/kko/rc/Ensemble"
                         "http://kbpedia.org/kko/rc/RhythmAndBluesMusic"
                         "http://kbpedia.org/kko/rc/NewAgeMusic"
                         "http://kbpedia.org/kko/rc/RockabillyMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Blues"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Opera"
                         "http://kbpedia.org/kko/rc/Choir"
                         "http://kbpedia.org/kko/rc/SurfMusic"
                         "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/MusicalComposition-JazzRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Country"
                         "http://kbpedia.org/kko/rc/CountryMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-PopRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Romantic"
                         "http://kbpedia.org/kko/rc/Recitative"
                         "http://kbpedia.org/kko/rc/Chorus"
                         "http://kbpedia.org/kko/rc/FusionMusic"
                         "http://kbpedia.org/kko/rc/MovieSoundtrack"
                         "http://kbpedia.org/kko/rc/GreatestHitsAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Christian"
                         "http://kbpedia.org/kko/rc/ClassicalMusic-Baroque"
                         "http://kbpedia.org/kko/rc/MusicalComposition-NewAge"
                         "http://kbpedia.org/kko/rc/MusicalComposition-TraditionalPop"
                         "http://kbpedia.org/kko/rc/TranceMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Celtic"
                         "http://kbpedia.org/kko/rc/LoungeMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Reggae"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Baroque"
                         "http://kbpedia.org/kko/rc/Trio-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/Symphony"
                         "http://kbpedia.org/kko/rc/MusicalComposition-RockAndRoll"
                         "http://kbpedia.org/kko/rc/PopRockMusic"
                         "http://kbpedia.org/kko/rc/IndustrialMusic"
                         "http://kbpedia.org/kko/rc/JazzMusic"
                         "http://kbpedia.org/kko/rc/MusicalChord"
                         "http://kbpedia.org/kko/rc/ProgressiveRockMusic"
                         "http://kbpedia.org/kko/rc/GothicMusic"
                         "http://kbpedia.org/kko/rc/LiveAlbum-CW"
                         "http://kbpedia.org/kko/rc/NewWaveMusic"
                         "http://kbpedia.org/kko/rc/NationalAnthem"
                         "http://kbpedia.org/kko/rc/OldieSong"
                         "http://kbpedia.org/kko/rc/Song-Sung"
                         "http://kbpedia.org/kko/rc/RockMusic"
                         "http://kbpedia.org/kko/rc/Aria"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Disco"
                         "http://kbpedia.org/kko/rc/GospelMusic"
                         "http://kbpedia.org/kko/rc/BluegrassMusic"
                         "http://kbpedia.org/kko/rc/FolkRockMusic"
                         "http://kbpedia.org/kko/rc/RockAndRollMusic"
                         "http://kbpedia.org/kko/rc/Opera-CW"
                         "http://kbpedia.org/kko/rc/HitSong-CW"
                         "http://kbpedia.org/kko/rc/Tune"
                         "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/RapMusic"
                         "http://kbpedia.org/kko/rc/RecordCompany"
                         "http://kbpedia.org/kko/rc/MusicalComposition-ACappella"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Electronica"
                         "http://kbpedia.org/kko/rc/Music"
                         "http://kbpedia.org/kko/rc/GlamRockMusic"
                         "http://kbpedia.org/kko/rc/LoveSong"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Gothic"
                         "http://kbpedia.org/kko/rc/MarchingBand"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Punk"
                         "http://kbpedia.org/kko/rc/BluesRockMusic"
                         "http://kbpedia.org/kko/rc/TechnoMusic"
                         "http://kbpedia.org/kko/rc/SoulMusic"
                         "http://kbpedia.org/kko/rc/ChamberMusicComposition"
                         "http://kbpedia.org/kko/rc/Requiem"
                         "http://kbpedia.org/kko/rc/MusicalComposition"
                         "http://kbpedia.org/kko/rc/ElectronicMusic"
                         "http://kbpedia.org/kko/rc/CompositionMovement"
                         "http://kbpedia.org/kko/rc/StringQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/Riff"
                         "http://kbpedia.org/kko/rc/Anthem"
                         "http://kbpedia.org/kko/rc/HardRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-BluesRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Cyberpunk"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Industrial"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Funk"
                         "http://kbpedia.org/kko/rc/Album-CW"
                         "http://kbpedia.org/kko/rc/HipHopMusic"
                         "http://kbpedia.org/kko/rc/Single"
                         "http://kbpedia.org/kko/rc/Singing"
                         "http://kbpedia.org/kko/rc/SwingMusic"
                         "http://kbpedia.org/kko/rc/Song-CW"
                         "http://kbpedia.org/kko/rc/SalsaMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Jazz"
                         "http://kbpedia.org/kko/rc/ClassicalMusic"
                         "http://kbpedia.org/kko/rc/MilitaryBand"
                         "http://kbpedia.org/kko/rc/SkaMusic"
                         "http://kbpedia.org/kko/rc/Orchestra"
                         "http://kbpedia.org/kko/rc/GrungeRockMusic"
                         "http://kbpedia.org/kko/rc/SouthernRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Ambient"
                         "http://kbpedia.org/kko/rc/DiscoMusic"] "resources/aspects-domain-corpus.csv")

Next let’s create the actual positive training corpus and let’s normalize it:

(cache-aspects-corpus "resources/aspects-entities-corpus.csv" "resources/aspects-corpus/")
(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

We end up with 22 features for which we can get named entities from the KBpedia Knowledge Base. These will be the 22 features of our model. The complete positive training set has 799 documents in it.

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-entities-corpus-dictionary.pruned.csv")

(build-semantic-interpreter "aspects-entities-pruned" "resources/semantic-interpreters/aspects-entities-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/aspects-entities-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

(train-svm-model "svm.aspects.entities.pruned" "resources/svm/aspects-entities-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Now let’s evaluate the model with default hyperparameters:

(evaluate-model "svm.aspects.entities.pruned" "resources/gold-standard-full.csv")
True positive:  9
False positive:  10
True negative:  913
False negative:  85

Precision:  0.47368422
Recall:  0.095744684
Accuracy:  0.906588
F1:  0.15929204

Now let’s try to improve this F1 score using grid search:

(svm-grid-search "grid-search-aspects-entities-pruned-tests" 
                 "resources/svm/aspects-entities-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.44052863
:c 4
:e 0.001
:algorithm :l2l2
:weight 15}

We have been able to greatly improve the F1 score by tweaking the hyperparameters, but the results are still disappointing. There are multiple ways to automatically generate training corpuses, but not all of them are born equal. This is why having a pipeline that can automatically create the training corpuses, optimize the hyperparameters and evaluate the models is more than welcome since this is the bulk of the time a data scientist has to spend to create his models.

Conclusion

After automatically creating multiple different positive and negative training sets, after testing multiple learning methods and optimizing hyperparameters, we found the best training sets with the best learning method and the best hyperparameter to create an initial, optimal, model that has an accuracy of 97.2%, a precision of 88.4%, a recall of
80.9% and overall F1 measure of 84.4% on a gold standard created from real, random, pieces of news from different general and specialized news sites.

The thing that is really interesting and innovative in this method is how a knowledge base of concepts and entities can be used to label positive and negative training sets to feed supervised learners and how the learner can perform well on totally different input text data (in this case, news articles). The same is true when creating training corpuses for unsupervised leaning4.

The most wonderful thing from an operational standpoint is that all of this searching, testing and optimizing can be performed by a computer automatically. The only tasks required by a human is to define the scope of a domain and to manually label a gold standard for performance evaluation and hyperparameters optimization.

Posted at 11:14

November 17

Frederick Giasson: Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 2

In the first part of this series we found the good hyperparameters for a single linear SVM classifier. In part 2, we will try another technique to improve the performance of the system: ensemble learning.

So far, we already reached 95% of accuracy with some tweaking the hyperparameters and the training corpuses but the F1 score is still around ~70% with the full gold standard which can be improved. There are also situations when precision should be nearly perfect (because false positives are really not acceptable) or when the recall should be optimized.

Here we will try to improve this situation by using ensemble learning. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble:

  1. Bootstrap aggregating (bagging)
  2. Asymmetric bagging 1
  3. Random subspace method (feature bagging)
  4. Asymmetric bagging + random subspace method (ABRS) 1
  5. Bootstrap aggregating + random subspace method (BRS)

Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let’s introduce each of these different strategies.

Note that in this article I am only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this article, we will stick to a single linear SVM with multiple different training corpuses and features.

Ensemble Learning Strategies

Bootstrap Aggregating (bagging)

The idea behind bagging is to draw a subset of positive and negative training samples at random and with replacement. Each model of the ensemble will have a different training set but some of the training sample may appear in multiple different training sets.

Asymmetric Bagging

Asymmetric Bagging has been proposed by Tao, Tang, Li and Wu 1. The idea is to use asymmetric bagging when the number of positive training samples is largely unbalanced relatively to the negative training samples. The idea is to create a subset of random (with replacement) negative training samples, but by always keeping the full set of positive training samples.

Random Subspace method (feature bagging)

The idea behind feature bagging is the same as bagging, but works on the features of the model instead of the training sets. It attempts to reduce the correlation between estimators (features) in an ensemble by training them on random samples of features instead of the entire feature set.

Asymmetric Bagging + Random Subspace method (ABRS)

Asymmetric Bagging and Random Subspace Method has also been proposed by Tao, Tang, Li and Wu 1. The problems they had with their content-based image retrieval system are the same we have with this kind of automatic training corpuses generated from knowledge graph:

  1. SVM is unstable on small-sized training set
  2. SVM’s optimal hyperplane may be biased when the positive training sample is much less than the negative feedback sample (this is why we used weights in this case), and
  3. The training set is smaller than the number of features in the SVM model.

The third point is not immediately an issue for us (except if you have a domain with many more features than we had in our example), but becomes one when we start using asymmetric bagging.

What we want to do here is to implement asymmetric bagging and the random subspace method to create dynamic_learning_859c7c3c1099242193bc675bd7b1bf25c900754e number of individual models. This method is called ABRS-SVM which stands for Asymmetric Bagging Random Subspace Support Vector Machines.

The algorithm we will use is:

  1. Let the number of positive training documents be dynamic_learning_2_edc31b30a8bd1852c35517549bcac8b4a7af7fc8, the number of negative training document be dynamic_learning_2_a957f8e8350deff65b7a8982eb9a29c95f5e7773 and the number of features in the training data be dynamic_learning_2_8541d15dc6a8dbbfe4c3da369275938094ab9a70.
  2. Choose dynamic_learning_859c7c3c1099242193bc675bd7b1bf25c900754e to be the number of individual models in the ensemble.
  3. For all individual model dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, choose dynamic_learning_2_81af6d98760653700a014a8f1362a186300f0207 where dynamic_learning_90ae52c890d2e8bd9b3a9376696d65719d104954 to be the number of negative training documents for dynamic_learning_d15a4e4ae61385fcd2221a2be30a7f59da7bd4ca
  4. For all individual models dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, choose dynamic_learning_2_36954911a1d2865eaf400325a9f0b3d9d08a9993 where dynamic_learning_cd852c5f4f131fdce3c46aa64249678cb4456717 to be the number of input variables for ensemble-learning_8af9860053a468761786b279cda937b39be994c5.
  5. For each individual model dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, create a training set by choosing dynamic_learning_2_36954911a1d2865eaf400325a9f0b3d9d08a9993 features from dynamic_learning_2_8541d15dc6a8dbbfe4c3da369275938094ab9a70 with replacement, by choosing dynamic_learning_2_81af6d98760653700a014a8f1362a186300f0207 negative training documents from dynamic_learning_2_a957f8e8350deff65b7a8982eb9a29c95f5e7773 with replacement, by choosing all positive training documents dynamic_learning_2_edc31b30a8bd1852c35517549bcac8b4a7af7fc8 and then train the model.

Bootstrap Aggregating + Random Subspace method (BRS)

Bagging with features bagging is the same as asymmetric bagging with the random subspace method except that we use bagging instead of asymmetric bagging. (ABRS should be used if your positive training sample is severely unbalanced compared to your negative training sample. Otherwise BRS should be used.)

SVM Learner

We use the linear Semantic Vector Machine (SVM) as the learner to use for the ensemble. What we will be creating is a series of SVM models that will be different depending on the ensemble method(s) we will use to create the ensemble.

Build Training Document Vectors

The first step we have to do is to create a structure where all the positive and negative training documents will have their vector representation. Since this is the task that takes the most time in the whole process, we will calculate them using the (build-svm-model-vectors) function and we will serialize the structure on the file system. That way, to create the ensemble’s models, we will only have to load it from the file system without having the re-calculate it each time.

Train, Classify and Evaluate Ensembles

The goal is to create a set of X number of SVM classifiers where each of them use different models. The models can differ in their features or their training corpus. Then each of the classifier will try to classify an input text according to their own model. Finally each classifier will vote to determine if that input text belong, or not, to the domain.

There are four hyperparameters related to ensemble learning:

  1. The mode to use
  2. The number of models we want to create in the ensemble
  3. The number of training documents we want in the training corpus, and
  4. The number of features.

Other hyperparameters could include the ones of the linear SVM classifier, but in this example we will simply reuse the best parameters we found above. We now train the ensemble using the (train-ensemble-svm) function.

Once the ensemble is created and trained, then we have to use the (classify-ensemble-text) function to classify an input text using the ensemble we created. That function takes two parameters: :mode, which is the ensemble’s mode, and :vote-acceptance-ratio, which defines the number of positive votes that is required such that the ensemble positively classify the input text. By default, the ratio is 50%, but if you want to optimize the precision of the ensemble, then you may want to increase that ratio to 70% or even 95% as we will see below.

Finally the ensemble, configured with all its hyperparameters, will be evaluated using the (evaluate-ensemble) function, which is the same as the (evaluate-model) function, but which uses the ensemble instead of a single SVM model to classify all of the articles. As before, we will characterize the assignments in relation to the gold standard.

Let’s now train different ensembles to try to improve the performance of the system.

Asymmetric Bagging

The current corpus training set is highly unbalanced. This is why the first test we will do is to apply the asymmetric bagging strategy. What this does is that each of the SVM classifiers will use the same positive training set with the same number of positive documents. However, each of them will take a random number of negative training documents (with replacement).

(use 'cognonto-esa.core)
(use 'cognonto-esa.ensemble-svm)

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")
(load-semantic-interpreter "base-pruned" "resources/semantic-interpreters/base-pruned/")

(reset! ensemble [])

(train-ensemble-svm "ensemble.base.pruned.ab.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :ab 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 100
                    :nb-training-documents 3500)

Now let’s evaluate this ensemble with a vote acceptance ratio of 50%

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.50)
True positive:  48
False positive:  6
True negative:  917
False negative:  46

Precision:  0.8888889
Recall:  0.5106383
Accuracy:  0.9488692
F1:  0.6486486

Let’s increase the vote acceptance ratio to 90%:

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.90)
True positive:  37
False positive:  2
True negative:  921
False negative:  57

Precision:  0.94871795
Recall:  0.39361703
Accuracy:  0.94198626
F1:  0.556391

In both cases, the precision increases considerably compared to the non-ensemble learning results. However, the recall did drop at the same time, which dropped the F1 score as well. Let’s now try with the ABRS method

Asymmetric Bagging + Random Subspace method (ABRS)

The goal of the random subspace method is to select a random set of features. This means that each model will have their own feature set and will make predictions according to them. With the ABRS strategy, we will conclude with highly different models since none will have the same negative training sets nor the same features.

Here what we test is to define each classifier with 65 randomly chosen features out of 174 to restrict the negative training corpus to 3500 randomly selected documents. Then we choose to create 300 models to try to get a really heterogeneous population of models.

(reset! ensemble [])
(train-ensemble-svm "ensemble.base.pruned.abrs.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :abrs 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 300
                    :nb-features 65
                    :nb-training-documents 3500)
(evaluate-ensemble "ensemble.base.pruned.abrs.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :abrs
                   :vote-acceptance-ratio 0.50)
True positive:  41
False positive:  3
True negative:  920
False negative:  53

Precision:  0.9318182
Recall:  0.43617022
Accuracy:  0.9449361
F1:  0.59420294

For these features and training sets, using the ABRS method did not improve on the AB method we tried above.

Conclusion

This use case shows three totally different ways to use the KBpedia Knowledge Graph to automatically create positive and negative training sets. We demonstrated how the full process can be automated where the only requirement is to get a list of seed KBpedia reference concepts.

We also quantified the impact of using new versions of KBpedia, and how different strategies, techniques or algorithms can have different impacts on the prediction models.

Creating prediction models using supervised machine learning algorithms (which is currently the bulk of the learners currently used) has two global steps:

  1. Label training sets and generate gold standards, and
  2. Test, compare, and optimize different learners, ensembles and hyperparameters.

Unfortunately, today, given the manual efforts required by the first step, the overwhelming portion of time and budget is spent here to create a prediction model. By automating much of this process, Cognonto and KBpedia substantially reduces this effort. Time and budget can now be re-directed to the second step of “dialing in” the learners, where the real payoff occurs. of training corpuses.

Further, as we also demonstrated, once we automated this process of labeling and reference standards, then we can also automate the testing and optimization of multiple different kind of prediction algorithms, hyperparameters configuration, etc. In short, for both steps, KBpedia provides significant reductions in times and efforts to get to desired results.

Footnotes

1Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval

Posted at 11:05

Frederick Giasson: Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 1

In my previous blog post, Create a Domain Text Classifier Using Cognonto, I explained how one can use the KBpedia Knowledge Graph to automatically create positive and negative training corpuses for different machine learning tasks. I explained how SVM classifiers could be trained and used to check if an input text belongs to the defined domain or not.

This article is the first of two articles.In first part I will extend on this idea to explain how the KBpedia Knowledge Graph can be used, along with other machine learning techniques, to cope with different situations and use cases. I will cover the concepts of feature selection, hyperparameter optimization, and ensemble learning (in part 2 of this series). The emphasis here is on the testing and refining of machine learners, versus the set up and configuration times that dominate other approaches.

Depending on the domain of interest, and depending on the required precision or recall, different strategies and techniques can lead to better predictions. More often than not, multiple different training corpuses, learners and hyperparameters need to be tested before ending up with the initial best possible prediction model. This is why I will strongly emphasize the fact that the KBpedia Knowledge Graph and Cognonto can be used to automate fully the creation of a wide range of different training corpuses, to create models, to optimize their hyperparameters, and to evaluate those models.

New Knowledge Graph and Reasoning

For this article, I will use the latest version of the KBpedia Knowledge Graph version 1.10 that we just released. A knowledge graph such as KBpedia is not static. It constantly evolves, gets fixed, and improves. New concepts are created, deprecated concepts are removed, new linkage to external data sources are created, etc. This growth means that any of these changes can have a [positive] impact on the creation of the positive and negative training sets. Applications based on KBpedia should be tested against any new knowledge graph that is released to see if its models will improve. Better concepts, better structure, and more linkages will often lead to better training sets as well.

Such growth in KBpedia is also why automating, and more importantly testing, this process is crucial. Upon the release of major new versions we are able to automate all of these steps to see the final impacts of upgrading the knowledge graph:

  1. Aggregate all the reference concepts that scope the specified domain (by inference)
  2. Create the positive and negative training corpuses
  3. Prune the training corpuses
  4. Configure the classifier (in this case, create the semantic vectors for ESA)
  5. Train the model (in this case, the SVM model)
  6. Optimize the hyperparameters of the algorithm (in this case, the linear SVM hyperparameters), and
  7. Evaluate the model on multiple gold standards.

Because each of these steps belongs to an automated workflow, we can easily check the impact of updating the KBpedia Knowledge Graph on our models.

Reasoning Over The Knowledge Graph

A new step I am adding in this current use case is to use a reasoner to reason over the KBpedia Knowledge Graph. The reasoner is used when we define the scope of the domain to classify. We will browse the knowledge graph to see which seed reference concepts we should add to the scope. Then we will use a reasoner to extend the models to include any new sub-classes relevant to the scope of the domain. This means that we may add further specific features to the final model.

Update Domain Training Corpus Using KBpedia 1.10 and a Reasoner

Recall our prior use case used Music as its domain scope. The first step is to use the new KBpedia version 1.10 along with a reasoner to create the full scope of this updated Music domain.

The result of using this new version and a reasoner is that we now end up with 196 features (reference documents) instead of 64. This also means that we will have 196 documents in our positive training set if we only use the Wikipedia pages linked to these reference concepts (and not their related named entities).

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])

(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Musician"
                       "http://kbpedia.org/kko/rc/MusicPerformanceOrganization"
                       "http://kbpedia.org/kko/rc/MusicalInstrument"
                       "http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Album-IBO"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MusicalText"
                       "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"
                       "http://kbpedia.org/kko/rc/MusicalPerformer"]
  kbpedia
  "resources/domain-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

Create Training Corpuses

The next step is to create the actual training corpuses: the general and domain ones. We have to load the dictionaries we created in the previous step, and then to locally cache and normalize the corpuses. Remember that the normalization steps are:

  1. Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
  2. Normalize the text with the following rules:
    1. remove diacritics characters
    2. remove everything between brackets like: [edit] [show]
    3. remove punctuation
    4. remove all numbers
    5. remove all invisible control characters
    6. remove all [math] symbols
    7. remove all words with 2 characters or fewer
    8. remove line and paragraph seperators
    9. remove anything that is not an alpha character
    10. normalize spaces
    11. put everything in lower case, and
    12. remove stop words.
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

(cache-corpus)

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

Create New Gold Standard

Because we never have enough instances in our gold standards to test against, let’s create a third one, but this time adding a music related news feed that will add more positive examples to the gold standard.

(defn create-gold-standard-from-feeds
  [name]
  (let [feeds ["http://www.music-news.com/rss/UK/news"
               "http://rss.cbc.ca/lineup/topstories.xml"
               "http://rss.cbc.ca/lineup/world.xml"
               "http://rss.cbc.ca/lineup/canada.xml"
               "http://rss.cbc.ca/lineup/politics.xml"
               "http://rss.cbc.ca/lineup/business.xml"
               "http://rss.cbc.ca/lineup/health.xml"
               "http://rss.cbc.ca/lineup/arts.xml"
               "http://rss.cbc.ca/lineup/technology.xml"
               "http://rss.cbc.ca/lineup/offbeat.xml"
               "http://www.cbc.ca/cmlink/rss-cbcaboriginal"
               "http://rss.cbc.ca/lineup/sports.xml"
               "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"
               "http://rss.cbc.ca/lineup/canada-calgary.xml"
               "http://rss.cbc.ca/lineup/canada-montreal.xml"
               "http://rss.cbc.ca/lineup/canada-pei.xml"
               "http://rss.cbc.ca/lineup/canada-ottawa.xml"
               "http://rss.cbc.ca/lineup/canada-toronto.xml"
               "http://rss.cbc.ca/lineup/canada-north.xml"
               "http://rss.cbc.ca/lineup/canada-manitoba.xml"
               "http://feeds.reuters.com/news/artsculture"
               "http://feeds.reuters.com/reuters/businessNews"
               "http://feeds.reuters.com/reuters/entertainment"
               "http://feeds.reuters.com/reuters/companyNews"
               "http://feeds.reuters.com/reuters/lifestyle"
               "http://feeds.reuters.com/reuters/healthNews"
               "http://feeds.reuters.com/reuters/MostRead"
               "http://feeds.reuters.com/reuters/peopleNews"
               "http://feeds.reuters.com/reuters/scienceNews"
               "http://feeds.reuters.com/reuters/technologyNews"
               "http://feeds.reuters.com/Reuters/domesticNews"
               "http://feeds.reuters.com/Reuters/worldNews"
               "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]]

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

This routine creates this third gold standard. Remember, we use the gold standard to evaluate different methods and models to classify an input text to see if it belongs to the domain or not.

For each piece of news aggregated that way, I manually determined if the candidate document belongs to the domain or not. This task can be tricky, and requires a clear understanding of the proper scope for the domain. In this example, I consider an article to belong to the music domain if it mentions music concepts such as musical albums, songs, multiple music related topics, etc. If only a singer is mentioned in an article because he broke up with his girlfriend, without further mention of anything related to music, I won’t tag it as being part of the domain.

[However, under a different interpretation of what should be in the domain wherein any mention of a singer qualifies, then we could extend the classification process to include named entities (the singer) extraction to help properly classify those articles. This revised scope is not used in this article, but it does indicate how your exact domain needs should inform such scoping decisions.]

You can download this new third gold standard from here.

Evaluate Initial Domain Model

Now that we have updated the training corpuses using the updated scope of the domain compared to the previous tests, let’s analyze the impact of using a new version of KBpedia and to use a reasoner to increase the number of features in our model. Let’s run our automatic process to evaluate the new models. The remaining steps that needs to be run are:

  1. Configure the classifier (in this case, create the semantic vectors for ESA)
  2. Train the model (in this case, the SVM model), and
  3. Evaluate the model on multiple gold standards.

Note: the see the full explanation of how ESA and the SVM classifiers works, please refer to the Create a Domain Text Classifier
Using Cognonto
article for more background information.

;; Load positive and negative training corpuses
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive:  21
False positive:  3
True negative:  306
False negative:  6

Precision:  0.875
Recall:  0.7777778
Accuracy:  0.97321427
F1:  0.8235294

The performance changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +10.33%
  • Recall: -12.16%
  • Accuracy: +0.31%
  • F1: +0.26%

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +6.18%
  • Recall: -29.35%
  • Accuracy: -1.19%
  • F1: -14.63%

What we can say is that the new scope for the domain greatly improved the precision of the model. This happens because the new model is probably more complex and better scoped, which leads it to be more selective. However, because of this the recall of the model suffers since some of the positive case of our gold standard are not considered to be positive but negative, which now creates new false positives. As you can see, there is almost always a tradeoff between precision and recall. However, you could have 100% precision by only having one result right, but then the recall would suffer greatly. This is why the F1 score is important since it is a weighted average of the precision and the recall.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive:  28
False positive:  3
True negative:  355
False negative:  22

Precision:  0.9032258
Recall:  0.56
Accuracy:  0.9387255
F1:  0.69135803

Again, with this new gold standard, we can see the same pattern: the precision is pretty good, but the recall is not that great since about half the true positives did not get noticed by the model.

Now, what could we do to try to improve this situation? The next thing we will investigate is to use feature selection and pruning.

Features Selection Using Pruning and Training Corpus Pruning

A new method that we will investigate to try to improve the performance of the models is called feature selection. As its name says, what we are doing is to select specific features to create our prediction model. The idea here is that not all features are born equal and different features may have different (positive or negative) impacts on the model.

In our specific use case, we want to do feature selection using a pruning technique. What we will do is to count the number of tokens for each of our features, and each of the Wikipedia page related to these features. If the number of tokens in an article is too small (below 100), then we will drop that feature.

[Note: feature selection is a complex topic; other options and nuances are not further discussed here.]

The idea here is not to give undue importance to a feature for which we lack proper positive documents in the training corpus. Depending on the feature, it may, or may not, have an impact on the overall model’s performance.

Pruning the general and domain specific dictionaries is really simple. We only have to read the current dictionaries, to read each of the documents mentioned in the dictionary from the cache, to calculate the number of tokens in each, and then to keep them or to drop them if they reach a certain threshold. Finally we write a new dictionary with the pruned features and documents:

(defn create-pruned-pages-dictionary-csv
  [dictionary-file prunned-file normalized-corpus-folder & {:keys [min-tokens]
                                                            :or {min-tokens 100}}]
  (let [dictionary (rest
                    (with-open [in-file (io/reader dictionary-file)]
                      (doall
                       (csv/read-csv in-file))))]
    (with-open [out-file (io/writer prunned-file)]
      (csv/write-csv out-file (->> dictionary
                                   (mapv (fn [[title rc]]
                                           (when (.exists (io/as-file (str normalized-corpus-folder title ".txt")))
                                             (when (> (->> (slurp (str normalized-corpus-folder title ".txt"))
                                                           tokenize
                                                           count) min-tokens)
                                               [[title rc]]))))
                                   (apply concat)
                                   (into []))))))

Then we can prune the general and domain specific dictionaries using this simple function:

(create-pruned-pages-dictionary-csv "resources/general-corpus-dictionary.csv"
                                    "resources/general-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

(create-pruned-pages-dictionary-csv "resources/domain-corpus-dictionary.csv"
                                    "resources/domain-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

As a result of this specific pruning approach, the number of features drops from 197 to 175.

Evaluating Pruned Training Corpuses and Selected Features

Now that the training corpuses have been pruned, let’s load them and then evaluate their performance on the gold standards.

;; Load positive and negative pruned training corpuses
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base-pruned/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base-pruned/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.pruned.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive:  21
False positive:  2
True negative:  307
False negative:  6

Precision:  0.9130435
Recall:  0.7777778
Accuracy:  0.97619045
F1:  0.84000003

The performances changes related to the initial results (using KBpedia 1.02) are:

  • Precision: +18.75%
  • Recall: -12.08%
  • Accuracy: +0.61%
  • F1: +2.26%

In this case, compared with the previous results (non-pruned with KBpedia 1.10), we improved the precision without decreasing the recall which is the ultimate goal. This means that the F1 score increased by 2.26% just by pruning, for this gold standard.

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +6.18%
  • Recall: -29.35%
  • Accuracy: -1.19%
  • F1: -14.63%

In this case, the results are identical (with non-pruned with KBpedia 1.10). Pruning did not change anything. Considering the relatively small size of the gold standard, this is to be expected since the model also did not drastically change.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive:  27
False positive:  7
True negative:  351
False negative:  23

Precision:  0.7941176
Recall:  0.54
Accuracy:  0.9264706
F1:  0.64285713

Now let’s check how these compare to the non-pruned version of the training corpus:

  • Precision: -12.08%
  • Recall: -3.7%
  • Accuracy: -1.31%
  • F1: -7.02%

Both false positives and false negatives increased with this change, which also led to a decrease in the overall metrics. What happened?

Different things may have happened in fact. Maybe the new set of features are not optimal, or maybe the hyperparameters of the SVM classifier are offset. This is what we will try to figure out by working with two new methods that we will use to try to continue to improve our model: hyperparameters optimization using grid search and using ensembles learning.

Hyperparameters Optimization Using Grid Search

Hyperparameters are parameters that are not learned by the estimators. They are a kind of configuration option for an algorithm. In the case of a linear SVM, hyperparameters are C, epsilon, weight and the algorithm used. Hyperparameter optimization is the task of trying to find the right parameter values in order to optimize the performance of the model.

There are multiple different strategies that we can use to try to find the best values for these hyperparameters, but the one we will use is called the grid search, which exhaustively searches across a manually defined subset of possible hyperparameter values.

The grid search function we want to define will enable us to specify the algorithm(s), the weight(s), C and the stopping tolerence. Then we will want the grid search to keep the hyperparameters that optimize the score of the metric we want to focus on. We also have to specify the gold standard we want to use to evaluate the performance of the different models.

Here is the function that implements that grid search algorithm:

(defn svm-grid-search
  [name model-path gold-standard & {:keys [grid-parameters selection-metric]
                                    :or {grid-parameters [{:c [1 2 4 16 256]
                                                           :e [0.001 0.01 0.1]
                                                           :algorithm [:l2l2]
                                                           :weight [1 15 30]}]
                                         selection-metric :f1}}]
  (let [best (atom {:gold-standard gold-standard
                    :selection-metric selection-metric
                    :score 0.0
                    :c nil
                    :e nil
                    :algorithm nil
                    :weight nil})
        model-vectors (read-string (slurp (str model-path "model.vectors")))]
    (doseq [parameters grid-parameters]
      (doseq [algo (:algorithm parameters)]
        (doseq [weight (:weight parameters)]
          (doseq [e (:e parameters)]
            (doseq [c (:c parameters)]
              (train-svm-model name model-path
                               :weights {1 (double weight)}
                               :v nil
                               :c c
                               :e e
                               :algorithm algo
                               :model-vectors model-vectors)
              (let [results (evaluate-model name gold-standard :output false)]              
                (println "Algorithm:" algo)
                (println "C:" c)
                (println "Epsilon:" e)
                (println "Weight:" weight)
                (println selection-metric ":" (get results selection-metric))
                (println)

                (when (> (get results selection-metric) (:score @best))
                  (reset! best {:gold-standard gold-standard
                                :selection-metric selection-metric
                                :score (get results selection-metric)
                                :c c
                                :e e
                                :algorithm algo
                                :weight weight}))))))))
    @best))

The possible algorithms are:

  1. :l2lr_primal
  2. :l2l2
  3. :l2l2_primal
  4. :l2l1
  5. :multi
  6. :l1l2_primal
  7. :l1lr
  8. :l2lr

To simplify things a little bit for this task, we will merge the three gold standards we have into one. We will use that gold standard moving forward. The merged gold standard can be downloaded from here. We now have a single gold standard with 1017 manually vetted web pages.

Now that we have a new consolidated gold standard, let’s calculate the performance of the models when the training corpuses are pruned and not. This will become the new basis to compare the subsequent results in this article. The metrics when they training corpuses are pruned:

True positive: 56
false positive: 10
True negative: 913
False negative: 38

Precision: 0.8484849
Recall: 0.59574467
Accuracy: 0.95280236
F1: 0.7

Now, let’s run the grid search that will try to optimize the F1 score of the model using the pruned training corpuses and using the full gold standard:

(svm-grid-search "grid-search-base-pruned-tests" 
                 "resources/svm/base-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.7096774
 :c 2
 :e 0.001
 :algorithm :l2l2
 :weight 30}

With a simple subset of the possible hyperparameter space, we found that by increasing the c parameter to 2 we could improve the performance of the F1 score on the gold standard by 1.37%. It is not a huge gain, but it is still an appreciable gain given the miinimal effort invested so far (basically: waiting for the grid search to finish). Subsequently we could tweak the subset of parameters to try to improve a little further. Let’s try with c = [ 1.5 , 2 , 2.5 ] and weight = [30, 40]. Let’s also try to check with other algorithms as well like L2-regularized L1-loss support vector regression (dual).

The goal here is to configure the initial grid search with general parameters with a wide range of possible values. Then subsequently we can use that tool to fine tune some of the parameters that were returning good results. In any case, the more computer power and time you have, the more tests you will be able to perform.

Posted at 11:00

November 16

Semantic Web Company (Austria): Triplifying a real dictionary

The Linked Data Lexicography for High-End Language Technology (LDL4HELTA) project was started in cooperation between Semantic Web Company (SWC) and K Dictionaries. LDL4HELTA combines lexicography and Language Technology with semantic technologies and Linked (Open) Data mechanisms and technologies. One of the implementation steps of the project is to create a language graph from the dictionary data.

The input data, described further, is a Spanish dictionary core translated into multiple languages and available in XML format. This data should be triplified (which means to be converted to RDF – Resource Description Framework) for several purposes, including to enrich it with external resources. The triplified data needs to comply with Semantic Web principles.

To get from a dictionary’s XML format to its triples, I learned that you must have a model. One piece of the sketched model, representing two Spanish words which have senses that relate to each other, is presented in Figure 1.

Figure 1

Figure 1: Language model example (click to enlarge)

This sketched model first needs to be created by a linguist who understands both the language complexity and Semantic Web principles. The extensive model [1] was developed at the Ontology Engineering Group of the Universidad Politécnica de Madrid (UPM).

Language is very complex. With this we all agree! How complex it really is, is probably often underestimated, especially when you need to model all its details and triplify it.

So why is the task so complex?

To start with, the XML structure is complex in itself, as it contains nested structures. Each word constitutes an entry. One single entry can contain information about:

  • Pronunciation
  • Inflection
  • Range Of Application
  • Sense Indicator
  • Compositional Phrase
  • Translations
  • Translation Example
  • Alternative Scripting
  • Register
  • Geographical Usage
  • Sense Qualifier
  • Provenance
  • Version
  • Synonyms
  • Lexical sense
  • Usage Examples
  • Homograph information
  • Language information
  • Specific display information
  • Identifiers
  • and more…

Entries can have predefined values, which can recur but their fields can also have so-called free values, which can vary too. Such fields are:

  • Aspect
  • Tense
  • Subcategorization
  • Subject Field
  • Mood
  • Grammatical Gender
  • Geographical Usage
  • Case
  • and more…

As mentioned above, in order to triplify a dictionary one needs to have a clear defined model. Usually, when modelling linked data or just RDF it is important to make use of existing models and schemas to enable easier and more efficient use and integration. One well-known lexicon model is Lemon. Lemon contains good pieces of information to cover our dictionary needs, but not all of them. We started using also the Ontolex model, which is much more complex and is considered to be the evolution of Lemon. However, some pieces of information were still missing, so we created an additional ontology to cover all missing corners and catch the specific details that did not overlap with the Ontolex model (such as the free values).

An additional level of complexity was the need to identify exactly the missing pieces in Ontolex model and its modules and create the part for the missing information. This was part of creating the dictionary’s model which we calledontolexKD.

As a developer you never sit down to think about all the senses or meanings or translations of a word (except if you specialize in linguistics), so just to understand the complexity was a revelation for me. And still, each dictionary contains information that is specific to it and which needs to be identified and understood.

The process used in order to do the mapping consists of several steps. Imagine this as a processing pipeline which manipulates the XML data. UnifiedViews is an ETL tool, specialized in the management of RDF data, in which you can configure your own processing pipeline. One of its use cases is to triplify different data formats. I used it to map XML to RDF and upload it into a triple store. Of course this particular task can also be achieved with other such tools or methods for that matter. In UnifiedViews the processing pipeline resembles what appears in Figure 2.

Figure 2: UnifiedViews pipeline used to triplify XML (click to enlarge)

Figure 2: UnifiedViews pipeline used to triplify XML (click to enlarge)

 

The pipeline is composed out of data processing units (DPUs) which communicate iteratively. In a left-to-right order the process in Figure 2 represents:

  • A DPU used to upload the XML files into UnifiedViews for further processing;
  • A DPU which transforms XML data to RDF using XSLT. The style sheet is part of the configuration of the unit;
  • The .rdf generated files are stored on the filesystem;
  • And, finally, the .rdf generated files are uploaded into a triple store, such as Virtuoso Universal server.

Basically the XML is transformed using XSLT.

Complexity increases also through the URIs (Uniform Resource Identifier) that are needed for mapping the information in the dictionary, because with Linked Data any resource should have a clearly identified and persistent identifier! The start was to represent a single word (headword) under a desired namespace and build on it to associate it with its part of speech, grammatical number, grammatical gender, definition, translation – just to begin with.

The base URIs follow the best practices recommended in the ISA study on persistent URIs following the pattern:http://{domain}/{type}/{concept}/{reference}.

An example of such URIs for the forms of a headword is:

These two URIs represent the singular masculine and singular feminine forms of the Spanish word entendedor.

If the dictionary contains two different adjectival endings, as with entendedor which has different endings for the feminine and masculine forms (entendedora and entendedor), and they are not explicitly mentioned in the dictionary than we use numbers in the URI to describe them. If the gener would be explicitly mentioned the URIs would be:

In addition, we should consider that the aim of triplifying the XML was for all these headwords with senses, forms and translations, to connect and be identified and linked following Semantic Web principles. The actual overlap and linking of the dictionary resources remains open. A second step for improving the triplification and mapping similar entries, if possible at all, still needs to be carried out. As an example, let’s take two dictionaries, say German, which contain a translation into English and an English dictionary which also contains translations into German. We get the following translations:

Bank – bank – German to English

bank – Bank – English to German

The URI of the translation from German to English was designed to look like:

And the translation from English to German would be:

In this case both represent the same translation but have different URIs because they were generated from different dictionaries (mind the translation order). These should be mapped so as to represent the same concept, theoretically, or should they not?

The word Bank in German can mean either a bench or a bank in English. When I translate both English senses back into German I get again the word Bank, but I cannot be sure which sense I translate unless the sense id is in the URI, hence the SE00006110 and SE00006116. It is important to keep the order of translation (target-source) but later map the fact that both translations refer to the same sense, same concept. This is difficult to establish automatically. It is hard even for a human sometimes.

One of the last steps of complexity was to develop a generic XSLT which can triplify all the different languages of this dictionary series and store the complete data in a triple store. The question remains: is the design of such a universal XSLT possible while taking into account the differences in languages or the differences in dictionaries?

The task at hand is not completed from the point of view of enabling the dictionary to benefit from Semantic Web principles yet. The linguist is probably the first one who can conceptualize “the how to do this”.

As a next step we will improve the Linked Data created so far and bring it to the status of a good linked language graph by enriching the RDF data with additional information, such as the history of a term or additional grammatical information etc.


References:

[1] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda, and G. Aguado-de Cea, “Modelling multilingual lexicographic resources for the web of data: the k dictionaries case,” in Proc. of GLOBALEX’16 workshop at LREC’15, Portoroz, Slovenia, May 2016.

Posted at 12:07

November 14

AKSW Group - University of Leipzig: Accepted paper in AAAI 2017

aaai-bannerHello Community! We are very pleased to announce that our paper “Radon– Rapid Discovery of Topological Relations” was accepted for presentation at the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), which will be held in February 4–9 at the Hilton San Francisco, San Francisco, California, USA.

In more detail, we will present the following paper: “Radon– Rapid Discovery of Topological Relations” Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, and Axel-Cyrille Ngonga Ngomo

Abstract. Datasets containing geo-spatial resources are increasingly being represented according to the Linked Data principles. Several time-efficient approaches for discovering links between RDF resources have been developed over the last years. However, the time-efficient discovery of topological relations between geospatial resources has been paid little attention to. We address this research gap by presenting Radon, a novel approach for the rapid computation of topological relations between geo-spatial resources. Our approach uses a sparse tiling index in combination with minimum bounding boxes to reduce the computation time of topological relations. Our evaluation of Radon’s runtime on 45 datasets and in more than 800 experiments shows that it outperforms the state of the art by up to 3 orders of magnitude while maintaining an F-measure of 100%. Moreover, our experiments suggest that Radon scales up well when implemented in parallel.

Acknowledgments
This work is implemented in the link discovery framework LIMES and has been supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227) as well as the BMWI Project GEISER (project no. 01MD16014).

Posted at 13:48

November 13

Bob DuCharme: Pulling RDF out of MySQL

With a command line option and a very short stylesheet.

Posted at 15:09

November 11

Dublin Core Metadata Initiative: SUB Göttingen joins DCMI as Institutional Member

2016-11-11, DCMI is pleased to announce that Göttingen State and University Library (SUB Göttingen) has joined DCMI as an Institutional Member. SUB Göttingen is one of most important research libraries in Germany, plays a leading role in a large number of national and international projects involving the optimization of literature and information provision and the establishment and development of digital research and information infrastructures. Its scope of activities include the cooperative development of a Germany-wide service infrastructure for the acquisition, licensing and provision of electronic resources; the coordination of large-scale joint research projects for developing research infrastructures in the humanities and cultural sciences in Germany; and the consortial establishment of Open Access research infrastructures linked across Europe and the world. Stefanie Rühle with the Data Conversion Group will represent SUB Göttingen on the DCMI Governing Board.

Posted at 23:59

Libby Miller: A speaking camera using Pi3 and Tensorflow

Posted at 12:39

November 10

Leigh Dodds: Donate to the commons this holiday season

Holiday season is nearly upon us. Donating to a charity is an alternative form of gift giving that shows you care, whilst directing your money towards helping those that need it. There are a lot of great and deserving causes you can support, and I’m certainly not going to tell you where you should donate your money.

But I’ve been thinking about the various ways in which I can support projects that I care about. There are a lot of them as it turns out. And it occurred to me that I could ask friends and family who might want to buy me a gift to donate to them instead. It’ll save me getting me getting yet another scarf, pair of socks, or (shudder) a

Posted at 19:23

November 08

Leigh Dodds: The practice of open data

Open data is data that anyone can access, use and share.

Open data is the result of several processes. The most obvious one is the release process that results in data being made available for reuse and sharing.

But there are other processes that may take place before that open data is made available: collecting and curating a dataset; running it through quality checks; or ensuring that data has been properly anonymised.

There are also processes that happen after data has been published. Providing support to users, for example. Or dealing with error reports or service issues with an API or portal.

Some processes are also continuous. Engaging with re-users is something that is best done on an ongoing basis. Re-users can help you decide which datasets to release and when. They can also give you feedback on ways to improve how your data is published. Or how it can be connected and enriched against other sources.

Collectively these processes define the practice of open data.

The practice of open data covers much more than the technical details of helping someone else access your data. It covers a whole range of organisational activities.

Releasing open data can be really easy. But developing your open data practice can take time. It can involve other changes in your organisation, such as creating a more open approach to data sharing. Or getting better at data governance and management.

The extent to which you develop an open data practice depends on how important open data is to your organisation. Is it part of your core strategy or just something you’re doing on a more limited basis?

The breadth and depth of the practice of open data is surprising to many people. The learning process is best experienced. Going through the process of opening a dataset, however small, provides useful insight that can help identify where further learning is needed.

On aspect of the practice of open data involves understanding what data can be open, what can be shared and what must stay closed. Moving data along

Posted at 19:52

November 07

Frederick Giasson: Building and Maintaining the KBpedia Knowledge Graph

The Cognonto demo is powered by an extensive knowledge graph called the KBpedia Knowledge Graph, as organized according to the KBpedia Knowledge Ontology (KKO). KBpedia is used for all kinds of tasks, some of which are demonstrated by the Cognonto use cases. KBpedia powers dataset linkage and mapping tools, machine learning training workflows, entity and concept extractions, category and topic tagging, etc.

The KBpedia Knowledge Graph is a structure of more than 39,000 reference concepts linked to 6 major knowledge bases and 20 popular ontologies in use across the Web. Unlike other knowledge graphs that analyze big corpuses of text to extract “concepts” (n-grams) and their co-occurrences, KBpedia has been created, is curated, is linked, and evolves using humans for the final vetting steps. KBpedia and its build process is thus a semi-automatic system.

The challenge with such a project is to be able to grow and refine (add or remove relations) within the structure without creating unknown conceptual issues. The sheer combinatorial scope of KBpedia means it is not possible for a human to fully understand the impact of adding or removing a relation on its entire structure. There is simply too much complexity in the interaction amongst the reference concepts (and their different kinds of relations) within the KBpedia Knowledge Graph.

What I discuss in this article is how Cognonto creates and then constantly evolves the KBpedia Knowledge Graph. In parallel with our creating KBpedia over the years, we also have needed to develop our own build processes and tools to make sure that every time something changes in KBpedia’s structure that it remains satisfiable and coherent.

The Importance of Testing Each Build

As you may experience for yourself with the Knowledge Graph browser, the KBpedia structure is linked to multiple external sources of information. Each of these sources (six major knowledge bases and another 20 ontologies) has its own world view. Each of these sources use its own concepts to organize its own structure.

What the KBpedia Knowledge Graph does is to merge all these different world views (and their associated instances and entities) into a coherent whole. One of the purposes of the KBpedia Knowledge Graph is to act as a scaffolding for integrating still further external sources, specifically in the knowledge domains relevant to specific clients.

One inherent characteristic of these knowledge sources is that they are constantly changing. Some may be updated only occasionally, others every year, others every few months, others every few weeks, or whatever. In the cases of Wikipedia and Wikidata, two of the most important contributors to KBpedia, thousands of changes occur daily. This dynamism of knowledge sources is an important fact since every time a source is changed, it may mean that its world view may have changed as well. Any of these changes can have an impact on KBpedia and the linkages we have to that external source.

Because of this dynamic environment, we do have to constantly regenerate the KBpedia Knowledge Graph and we constantly have to make sure that any changes in its structure or in the structure of the sources linked to it doesn’t make it insatisfiable nor incoherent.

It is for these reasons that we developed an extensive knowledge graph building process that includes a series of tests that are run every time that the knowledge graph get modified. Each new build is verified that it is still satisfiable and coherent.

The Build Process

The KBpedia Knowledge Graph build process has been developed over the years to create a robust workflow that enables us to regenerate KBpedia every time that something changed in it. The build process ensures that no issues are created every time we change something and regenerate KBpedia. Our build process also calculates a series of statistics and metrics that enable us to follow its evolution.

The process works as follow:

  1. Prepare log files
  2. Perform pre-checks. If any of these test fail, then the generation process won’t start. They check if:
    1. Any index file is corrupted
    2. All mentioned reference concept IDs exist
    3. All mentioned Super Type IDs exist
    4. No reference concept IDs are the same as Super Type IDs
    5. No new concepts IDs are the same as existing IDs
  3. Create the classes and individuals that define the knowledge graph
  4. Save the knowledge graph
  5. Generate the mapping between the reference concepts and the external ontologies/schemas/vocabularies
    1. Wikipedia
    2. Wikidata
    3. DBpedia
    4. Schema.org
    5. Geonames
    6. OpenCyc
    7. General Ontologies (Music Ontology, Bibliographic Ontology, FOAF, and 17 others…)
  6. Execute a series of post-creation tests
    1. Check for missing preferred labels
    2. Check for missing definitions
    3. Check for non-distinct preferred labels
    4. Check for reference concepts that do not have any reference to any Super Type (by inference) (also known as ‘orphans’)
    5. Check to make sure that the core KBpedia Knowledge Graph is satisfiable
    6. Check to make sure that the core KBpedia Knowledge Graph with its external linkages is satisfiable
    7. Check to make sure that the core KBpedia Knowledge Graph with its external linkages and extended inference relationships is satisfiable
  7. Finally, calculate a series of statistics and metrics.

It is important that we be able to do these builds and tests rapidly, so that we can move along new version releases rapidly. Remember, all changes to the KBpedia Knowledge Graph are manually vetted.

To accomplish this aim we actually build KBpedia from a set of fairly straightforward input files (for easy inspection and modification). We can completely rebuild all of KBpedia in less than two hours. About 45 minutes are required for building the overall structure and applying the satisfiability and coherency tests. The typology aspects of KBpedia and their tests adds another hour or so to complete the build. The rapidity of the build cycle means we can test and refine nearly in real time, useful when we are changing or refining big chunks of the structure.

An Escheresque Building Process

Building the KBpedia Knowledge Graph is like M.C. Escher’s hand’s drawing themselves. Because of the synergy between the Knowledge Graph reference concepts, its upper structure, its typologies and its numerous links to external linkages, any addition in one of these areas can lead to improvements in other areas of the knowledge graph. These improvements are informed by analyzing the metrics, statistics, and possible errors logged by the build process.

The Knowledge Graph is constantly evolving, self-healing and expanding. This is why that the build process and more importantly its tests are crucial to make sure that new issues are not introduced every time something changes within the structure.

To illustrate these points, let’s dig a little deeper into the KBpedia Knowledge Graph build process.

The Nature of the KBpedia Build Process

The KBpedia Knowledge Graph is built from a few relatively straightforward assignment files serialized in CSV. Each file has its purpose in the build process and is encoded using UTF-8 for internationalization purposes. KBpedia is just a set of simple indexes serialized as CSV files that can easily be exchanged, updated and re-processed.

The process is 100% repeatable and testable. If issues are found in the future that require a new step or a new test, it can easily be improved by plugging-in a new step or a new test into the processing pipeline. In fact, the current pipeline is the incremental result of years of working this process. I’m sure we will add more steps still as time goes on.

The process is also semi-automatic. Certain tests may cause the process to completely fail. If such a failure happens, then immediate actions are outputed in different log files. If the process does complete, then all kinds of log files and statistics about the KBpedia Knowledge Graph structure are written to the file system. Once completed, the human operator can easily check these logs and update the input files to improve something he may have found after analyzing the output files.

Building KBpedia is really an iterative process. It often is generated hundred of times before a new version is released.

Checking for Disjointedness and Inconsistencies

The core and more important test in the process is the satisfiability test that is run once the KBpedia Knowledge Graph is generated. An unsatisfiable class is a class that does not “satisfy” (is inconsistent with) the structure of the knowledge graph. In KBpedia, what needs to be satisfied are the disjoint assertions that exists at the upper level of the knowledge graph. If an assertion between two reference concepts (like a sub-class-of or an equivalent-to relationship) leads to an unsatisfiable disjoint assertion, then an error is raised and the issue will need to be fixed by the human operator.

Here is an example of an unsatisfiable class. In this example, someone wants to say that a musical group (kbpedia:MusicPerformanceOrganization) is a sub-class-of a musician (kbpedia:Musician). This new assertion is obviously an error (since musicians may also be individuals), but the human operator didn’t noticed it when he created the new relationship between the two reference concepts. So, how does the build process catch such errors? Here is how:

cognonto-workflow

Because the two classes belong to two disjoint super classes, then the KBpedia generator finds this issue and returns an error along with logging report that explains why that new assertion makes the structure unsatisfiable. This testing and audit report is pretty powerful (and essential) to be able to maintain the integrity of the knowledge graph.

Unsatisfiability of Linked External Concepts

The satisfiability testing of external concepts linked to KBpedia is performed in two steps:

  1. The testing first checks the satisfiability of the core KBpedia Knowledge Graph and, then
  2. It checks the satisfiability of the KBpedia Knowledge Graph in relation to all of its other links to external data sources.

This second step is essential to make sure that any external concept we link to KBpedia is done properly and does not trigger any linking errors. In fact, we are trying to minimize the number of errors using the unsatisfiability testing. The process of checking if external concepts linked to the KBpedia Knowledge Graph satisfies the structure is the same. If their inclusion leads to such an issue, then it means that the links are the issue, since we know that the KBpedia core structure is satisfiable (since it was the previous step). Once detected, the linkage error(s) will be reviewed and fixed by the human operator and the structure will be regenerated. In the early phases of a new build, these fixes are accumulated and processed in batches. At the end of a new build, only one or a few errors remain to be corrected.

A Fully Connected Graph

Another important test is to make sure that the KBpedia Knowledge Graph is fully connected. We don’t want to have islands of concepts in the graph, we want to make sure that every concept is reachable using sub-class-of, super-class-of or equivalent-to relationships. If the build process detects that some concepts are disconnected from the graph, then new relationships will need to be created to reconnect the graph. These “orphan” tests ensure the integrity and completeness of the overall graph structure.

Typologies Have Their Own Tests

What is a typology? As stated by Merriam Webster, a typology is “a system used for putting things into groups according to how they are similar.” The KBpedia typologies, of which there are about 80, are the classification of types that are closely related, which we term Super Types. Three example Super Types are People, Activities and Products. The Super Types are found in the upper reaches of the KBpedia Knowledge Graph. (See further this article by Mike Bergman describing the upper structure of KBpedia and its relation to the typologies.) Thousands of disjointedness assertions have been defined between individual Super Types to other Super Types. These assertions enforce the fact that the reference concepts related to a Super Type A are not similar to the reference concepts related to, say, Super Type B.

These disjointedness assertions are a major factor in how we can rapidly slice-and-dice the KBpedia knowledge space to rapidly create training corpuses and positive and negative training sets for machine learning. These same disjointedness relationships are what we use to make sure that the KBpedia Knowledge Graph structure is satisfiable and coherent.

Another use of the typologies is to have a general overview of the knowledge graph. Each typology is a kind of lens that shows different parts of the knowledge graph. The build process creates a log of each of the typologies with all the reference conepts that belong to it. Similarly, the build process also creates a mini-ontology for each typology that can be inspected in an ontology editor. We use these outputs to more easily assess the various structures within KBpedia and to find possible conceptual issues as part of our manual vetting before final approvals.

Knowledge is Dynamic and So Must Be Builds and Testing

Creating, maintaining and evolving a knowledge graph the size of KBpedia is a non-trivial task. It is also a task that must be done frequently and rapidly whenever the underlying nature of KBpedia’s constituent knowledge bases dynamically changes. These demands require a robust build process with multiple logic and consistency tests. At every step we have to make sure that the entire structure is satisfiable and coherent. Fortunately, after development over a number of years, we now have processes in place that are battle tested and can continue to be expanded as the KBpedia Knowledge Graph constantly evolves.

Posted at 19:57

November 04

Leigh Dodds: Discogs: a business based on public domain data

When I’m discussing business models around open data I regularly refer to a few different examples. Not all of these have well developed case studies, so I thought I’d start trying to capture them here. In this first write-up I’m going to look at

Posted at 22:28

Libby Miller: Machine learning links

[work in progress – I’m updating it gradually]

Machine Learning

Posted at 16:11

November 01

Leigh Dodds: Checking Fact Checkers

As of last month

Posted at 19:39

Leigh Dodds: Elinor Ostrom and data infrastructure

One of the topics that most interests me at the moment is how we design systems and organisations that contribute to the creation and maintenance of the open data commons.

This is more than a purely academic interest. If we can understand the characteristics of successful open data projects like Open Street Map or Musicbrainz then we could replicate them in other areas. My hope is that we may be able to define a useful tool-kit of organisational and technical design patterns that make it more likely for other similar projects to proceed. These patterns might also give us a way to evaluate and improve other existing systems.

A lot of the current discussion around this topic is going on under the “

Posted at 19:03

October 30

Bob DuCharme: My SQL quick reference

Pun intended.

Posted at 16:49

October 25

Frederick Giasson: Create a Domain Text Classifier Using Cognonto

A common task required by systems that automatically analyze text is to classify an input text into one or multiple classes. A model needs to be created to scope the class (what belongs to it and what does not) and then a classification algorithm uses this model to classify an input text.

Multiple classification algorithms exists to perform such a task: Support Vector Machine (SVM), K-Nearest Neigbours (KNN), C4.5 and others. What is hard with any such text classification task is not so much how to use these algorithms: they are generally easy to configure and use once implemented in a programming language. The hard – and time-consuming – part is to create a sound training corpus that will properly define the class you want to predict. Further, the steps required to create such a training corpus must be duplicated for each class you want to predict.

Since creating the training corpus is what is time consuming, this is where Cognonto provides its advantages.

In this article, we will show you how Cognonto’s KBpedia Knowledge Graph can be used to automatically generate training corpuses that are used to generate classification models. First, we define (scope) a domain with one or multiple KBpedia reference concepts. Second, we aggregate the training corpus for that domain using the KBpedia Knowledge Graph and its linkages to external public datasets that are then used to populate the training corpus of the domain. Third, we use the Explicit Semantic Analysis (ESA) algorithm to create a vectorial representation of the training corpus. Fourth, we create a model using (in this use case) an SVM classifier. Finally, we predict if an input text belongs to the class (scoped domain) or not.

This use case can be used in any workflow that needs to pre-process any set of input texts where the objective is to classify relevant ones into a defined domain.

Unlike more traditional topic taggers where topics are tagged in an input text with weights provided for each of them, we will see how it is possible to use the semantic interpreter to tag main concepts related to an input text even if the surface form of the topic is not mentioned in the text. We accomplish this by leveraging ESA’s semantic interpreter.

General and Specific Domains

In this article, two concepts are at the center of everything: what I call the general domain and the specific domain(s). What I call the general domain can be seen as the set of all specific domains. It includes the set of classes that generally define common things of the World. What we call a specific domain is one or multiple classes that scope a domain of interest. A specific domain is a subset of classes of the general domain.

In Cognonto, the general domain is defined by all the ~39,000 KBpedia reference concepts. A specific domain is any sub-set of the ~39,000 KBpedia reference concept that adequately scopes a domain of interest.

The purpose of this use case is to show how we can determine if an input text belongs to a specific domain of interest. What we have to do is to create two training corpuses: one that defines the general domain, and one that defines the specific domain. However, how do we go about defining these corpuses? One way would be to do this manually, but it would take an awful lot of time to do.

This is the crux of the matter: we will generate the general domain corpus and specific domain ones automatically using the KBpedia Knowledge Graph and all of its linkages to external public datasets. The time and resources thus saved from creating the training corpuses can be spent testing different classification algorithms, tweaking their parameters, evaluating them, etc.

What is so powerful in leveraging the KBpedia Knowledge Graph in this manner is that we can generate training sets for all kind of domains of interests automatically.

Training Corpuses

The first step we have to do is to define the training corpuses that we will use to create the semantic interpreter and the SVM classification models. We have to create the general domain training corpus and the domain specific training corpus. The example domain I have chosen for this use case is scoped by the ideas of Music, Musicians, Music Records, Musical Groups, Musical Instruments, etc.

Define The General Training Corpus

The general training corpus is quite easy to create. The only thing I have to do is to query the KBpedia Knowledge Graph to get all the Wikipedia pages linked to all the KBpedia reference concepts. These pages will become the general training corpus.

Note that in this article I will only use the linkages to the Wikipedia dataset, but I could also use any other datasets that are linked to the KBpedia Knowledge Graph in exactly the same way. Here is how we aggregate all the documents that will belong to a training corpus:

Note all I need do is to use the KBpedia structure, query it, and then write the general corpus into a CSV file. This CSV file will be used later for most of the subsequent tasks.

(define-general-corpus "resources/kbpedia_reference_concepts_linkage.n3" "resources/general-corpus-dictionary.csv")

Define The Specific Domain Training Corpus

The next step is to define the training corpuse of the specific domain for this use case, the music domain. To do so, I need merely search KBpedia to find all the reference concepts I am interested in that will scope my music domain. These domain-specific KBpedia reference concepts will be the features of the SVM models we will test below.

What the define-domain-corpus function does below is simply to query KBpedia to get all the Wikipedia articles related to these concepts, their sub-classes and to create the training corpus from them.

In this article we only define a binary classifier. However, if we would want to create a multi-class classifier then we would have to define multiple specific domain training corpuses exactly the same way. The only time we would have to spend is to search KBpedia (using the Cognonto user interface) to find the reference concepts we want to use to scope the domains we want to define. We will show how quickly this can be done with impressive results in a later use case.

(define-domain-corpus ["http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Musician"
                       "http://kbpedia.org/kko/rc/MusicPerformanceOrganization"
                       "http://kbpedia.org/kko/rc/MusicalInstrument"
                       "http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Album-IBO"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MusicalText"
                       "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"
                       "http://kbpedia.org/kko/rc/MusicalPerformer"]
  "resources/kbpedia_reference_concepts_linkage.n3"
  "resources/domain-corpus-dictionary.csv")

Create Training Corpuses

Once the training corpuses are defined, we want to cache them locally to be able to play with them, without having to re-download them from the Web or re-create them each time.

(cache-corpus)

The cache is composed of 24,374 Wikipedia pages, which is about 2G of raw data. However, we have some more processing to perform on the raw Wikipedia pages since what we ultimately want is a set of relevant tokens (words) that will be used to calculate the value of the features of our model using the ESA semantic interpreter. Since we may want to experiment with different normalization rules, what we do is to re-write each document of the corpus in another folder that we will be able to re-create as required if the normalization rules change in the future. We can quickly re-process these input files and save them in separate folders for testing and comparative purposes.

The normalization steps performed by this function are to:

  1. Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
  2. Normalize the text with the following rules:
    1. remove diacritics characters
    2. remove everything between brackets like: [edit] [show]
    3. remove punctuation
    4. remove all numbers
    5. remove all invisible control characters
    6. remove all [math] symbols
    7. remove all words with 2 characters or fewer
    8. remove line and paragraph seperators
    9. remove anything that is not an alpha character
    10. normalize spaces
    11. put everything in lower case, and
    12. remove stop words.

Normalization steps could be dropped or others included, but these are the standard ones Cognonto applies in its baseline configuration.

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

After cleaning, the size of the cache is now 208M (instead of the initial 2G for the raw web pages).

Note that unlike what is discussed in the original ESA research papers by Evgeniy Gabrilovich we are not pruning any pages (the ones with less than X number of tokens, etc. This could be done but at a subsequent tweaking step, which our results below indicate is not really necessary.

Now that the training corpuses are created we can now build the semantic interpreter to create the vectors that will be used to train the SVM classifier.

Build Semantic Interpreter

What we want to do is to classify (determine) if an input text belongs to a class as defined by a domain. The relatedness of the input text is based on how closely the specific domain corpus is related to the general one. This classification is performed with some classifiers like SVM, KNN and C4.5. However, each of these algorithms need to use some kind of numerical vector, upon which the actual classifier requires to model and classify the candidate input text. Creating this numeric vector is the job of the ESA Semantic Interpreter.

Let’s dive a little further into the Semantic Interpreter to understand how it operates. Note that you can skip the next section and continue with the following one.

How Does the Semantic Interpreter Work?

The Semantic Interpreter is a process that maps fragments of natural language into a weighted sequence of text concepts ordered by their relevance to the input.

Each concept in the domain is accompanied by a document from the KBpedia Knowledge Graph, which acts as its representative term set to capture the idea (meaning) of the concept. The overall corpus is based on the combined documents from KBpedia that match the slice retrieved from the knowledge graph based on the domain query(ies).

The corpus is composed of create-domain-tagger-using-cognonto_6c2f6b56eb1f033153b7937febd62b32a841a7e2 concepts that come from the domain ontology associated with create-domain-tagger-using-cognonto_1f55d35233c47f3073a838b1d223ba9f2d3a48e2 KBpedia Knowledge Base documents. We build a sparse matrix create-domain-tagger-using-cognonto_4c6d0f6f5b83f376901c5df05c69c39809726770 where each of the create-domain-tagger-using-cognonto_a802029e846072ba5e3b7b45e8121a2747fc69ff columns corresponds to a concept and where each of the rows corresponds to a word that occurs in the related entity documents create-domain-tagger-using-cognonto_5fff667160a8636abcf6862cf1ef27512e971759. The matrix entry create-domain-tagger-using-cognonto_9b3449aa2048b4d8b80316d793e20d6cbbf0a588 is the TF-IDF value of the word create-domain-tagger-using-cognonto_aa2db9982673a03cc8f12a2138763f0df783456e in document create-domain-tagger-using-cognonto_1704d2d70bd876bdd667179e3762e9d2e989c614.

create-domain-tagger-using-cognonto_15694bce42dc50920f98a49030daad4ff3c85ae0

The TF-IDF value of a given term is calculated as:

create-domain-tagger-using-cognonto_0bb23280e5a8be39baf5a6637d2cd49801ab9d2b

where create-domain-tagger-using-cognonto_88b08b2e81707614f5c9f387c3a61331a297e5b4 is the number of words in the document create-domain-tagger-using-cognonto_1704d2d70bd876bdd667179e3762e9d2e989c614, where the term frequency is defined as:

create-domain-tagger-using-cognonto_ce8e3a814bb5e034cdea09c5fcc994df62b7e3e9

and where the document frequency create-domain-tagger-using-cognonto_7c74065b76d1b965532a2af02031be0d5220045b is the number of documents where the term create-domain-tagger-using-cognonto_3fdf772586413ab3eb6a0ddc20c9d93b71e48633 appears.

Unlike the standard ESA system, pruning is not performed on the matrix to remove the least-related concepts for any given word. We are not doing the pruning due to the fact that the ontologies are highly domain specific as opposed to really broad and general vocabularies. However, with a different mix of training text, and depending on the use case, the stardard ESA model may benefit from pruning the matrix.

Once the matrix is created, we do perform cosine normalization on each column:

create-domain-tagger-using-cognonto_9dd1ea666bd4fde0874ee9aaa00e9c5c3f0e4325

where create-domain-tagger-using-cognonto_9b3449aa2048b4d8b80316d793e20d6cbbf0a588 is the TF-IDF weight of the word create-domain-tagger-using-cognonto_3fdf772586413ab3eb6a0ddc20c9d93b71e48633 in the concept document create-domain-tagger-using-cognonto_bfde35dfaa942ab76c69d72eb189e4cf1de36bd3, where create-domain-tagger-using-cognonto_df5ef8f1a0a99c333958e31956bc99cf67d3da0b is the square root of the sum of exponent of the TF-IDF weight of each word create-domain-tagger-using-cognonto_3fdf772586413ab3eb6a0ddc20c9d93b71e48633 in document create-domain-tagger-using-cognonto_1704d2d70bd876bdd667179e3762e9d2e989c614. This normalization removes, or at least lowers, the effect of the length of the input documents.

Creating the First Semantic Interpreter

The first semantic interpreter we will create is composed of the general corpus which has 24,374 Wikipedia pages and the music domain-specific corpus composed of 62 Wikipedia pages. The 62 Wikipedia pages that compose the music domain corpus come from the selected KBpedia reference concepts and their sub-classes that we defined in the Define The Specific Domain Training Corpus section above.

(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--base.csv")

(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

Evaluating Models

Before building the SVM classifier, we have to create a gold standard that we will use to evaluate the performance of the models we will test. What I did is to aggregate a list of news feeds from the CBC and from Reuters and then I crawled each of them to get the news they were containing. Once I aggregated each of them in a spreadsheet, I manually classified each of them. The result is a gold standard of 336 news pages which were classified as being related to the music domain or not. It can be downloaded from here.

Subsequently, three days later, I re-crawled the same feeds to create a second gold standard that has 345 new spages. It can be downloaded from here. I will use both to evaluate the different SVM models we will create below. (I created the two standards because of some internal tests and statistics we are compiling.)

Both gold standards got created this way:

(defn create-gold-standard-from-feeds
  [name]
  (let [feeds ["http://rss.cbc.ca/lineup/topstories.xml"
               "http://rss.cbc.ca/lineup/world.xml"
               "http://rss.cbc.ca/lineup/canada.xml"
               "http://rss.cbc.ca/lineup/politics.xml"
               "http://rss.cbc.ca/lineup/business.xml"
               "http://rss.cbc.ca/lineup/health.xml"
               "http://rss.cbc.ca/lineup/arts.xml"
               "http://rss.cbc.ca/lineup/technology.xml"
               "http://rss.cbc.ca/lineup/offbeat.xml"
               "http://www.cbc.ca/cmlink/rss-cbcaboriginal"
               "http://rss.cbc.ca/lineup/sports.xml"
               "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"
               "http://rss.cbc.ca/lineup/canada-calgary.xml"
               "http://rss.cbc.ca/lineup/canada-montreal.xml"
               "http://rss.cbc.ca/lineup/canada-pei.xml"
               "http://rss.cbc.ca/lineup/canada-ottawa.xml"
               "http://rss.cbc.ca/lineup/canada-toronto.xml"
               "http://rss.cbc.ca/lineup/canada-north.xml"
               "http://rss.cbc.ca/lineup/canada-manitoba.xml"
               "http://feeds.reuters.com/news/artsculture"
               "http://feeds.reuters.com/reuters/businessNews"
               "http://feeds.reuters.com/reuters/entertainment"
               "http://feeds.reuters.com/reuters/companyNews"
               "http://feeds.reuters.com/reuters/lifestyle"
               "http://feeds.reuters.com/reuters/healthNews"
               "http://feeds.reuters.com/reuters/MostRead"
               "http://feeds.reuters.com/reuters/peopleNews"
               "http://feeds.reuters.com/reuters/scienceNews"
               "http://feeds.reuters.com/reuters/technologyNews"
               "http://feeds.reuters.com/Reuters/domesticNews"
               "http://feeds.reuters.com/Reuters/worldNews"
               "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]]

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

Each of the different models we will test in the next sections will be evaluated using the following function:

(defn evaluate-model
  [evaluation-no gold-standard-file]
  (let [gold-standard (rest
                       (with-open [in-file (io/reader gold-standard-file)]
                         (doall
                          (csv/read-csv in-file))))
        true-positive (atom 0)
        false-positive (atom 0)
        true-negative (atom 0)
        false-negative (atom 0)]

    (with-open [out-file (io/writer (str "resources/evaluate-" evaluation-no ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])

      (doseq [[class title url] gold-standard]
        (when-not (.exists (io/as-file (str "resources/gold-standard-cache/" (md5 url))))
          (spit (str "resources/gold-standard-cache/" (md5 url)) (slurp url)))
        (let [predicted-class (classify-text (-> (slurp (str "resources/gold-standard-cache/" (md5 url)))
                                                 defluff-content))]
          (println predicted-class " :: " title)
          (csv/write-csv out-file [[predicted-class title url]] :append true)
          (when (and (= class "1")
                     (= predicted-class 1.0))
            (swap! true-positive inc))

          (when (and (= class "0")
                     (= predicted-class 1.0))
            (swap! false-positive inc))

          (when (and (= class "0")
                     (= predicted-class 0.0))
            (swap! true-negative inc))

          (when (and (= class "1")
                     (= predicted-class 0.0))
            (swap! false-negative inc))))

      (println "True positive: " @true-positive)
      (println "false positive: " @false-positive)
      (println "True negative: " @true-negative)
      (println "False negative: " @false-negative)

      (println)

      (let [precision (float (/ @true-positive (+ @true-positive @false-positive)))
            recall (float (/ @true-positive (+ @true-positive @false-negative)))]
        (println "Precision: " precision)
        (println "Recall: " recall)
        (println "Accuracy: " (float (/ (+ @true-positive @true-negative) (+ @true-positive @false-negative @false-positive @true-negative))))
        (println "F1: " (float (* 2 (/ (* precision recall) (+ precision recall)))))))))

What this function does is to calculate the number of true-positive, false-positive, true-negative and false-negatives scores within the gold standard by applying the current model, and then to calculate the precision, recall, accuracy and F1 metrics. You can read more about how binary classifiers can be evaluated from here.

Build SVM Model

Now that we have numeric vector representations of the music domain and now that we have a way to evaluate the quality of the models we will be creating, we can now create and evaluate our prediction models.

The classification algorithm I choose to use for this article is the Support Vector Machine (SVM). I use the Java port of the LIBLINEAR library. Let’s create the first SVM model:

(build-svm-model-vectors "resources/svm/base/")
(train-svm-model "svm.w0" "resources/svm/base/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

This initial model is created using a training set that is composed of 24,311 documents that doesn’t belong to the class (the music specific domain), and 62 documents that does belong to that class.

Now, let’s evaluate how this initial model perform against the the two gold standards:

(evaluate-model "w0" "resources/gold-standard-1.csv" )
True positive:  5
False positive:  0
True negative:  310
False negative:  21

Precision:  1.0
Recall:  0.1923077
Accuracy:  0.9375
F1:  0.32258064
(evaluate-model "w0" "resources/gold-standard-2.csv" )
True positive:  2
false positive:  1
True negative:  319
False negative:  23

Precision:  0.6666667
Recall:  0.08
Accuracy:  0.93043476
F1:  0.14285713

Well, this first run looks like to be really poor! The issue here is a common issue with how the SVM classifier is being used. Ideally, the number of documents that belong to the class and the number of documents that do not belong to the class should be about the same. However, because of the way we defined the music specific domain, and because of the way we created the training corpuses, we ended up with two really unbalanced sets of training documents: 24,311 that doesn’t belong to the class and only 63 that does belong to the class. That is the reason why we are getting these kinds of poor results.

What can we do from here? We have two possibilities:

  1. We use LIBLINEAR’s weight modifier parameter to modify the weight of the terms that exists in the 63 documents that belong to the class. Because the two sets are so unbalanced, the weight should theorically be around 386, or
  2. We add thousands of new documents that belong to the class we want to predict.

Let’s test both options. We will initially play with the weights to see how much we can improve the current situation.

Improving Performance Using Weights

What we will do now is to create a series of models that will differ in the weight we will define to improve the weight of the classified terms in the SVM process.

Weight 10

(train-svm-model "svm.w10" "resources/svm/base/"
                 :weights {1 10.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w10" "resources/gold-standard-1.csv")
True positive:  17
False positive:  1
True negative:  309
False negative:  9

Precision:  0.9444444
Recall:  0.65384614
Accuracy:  0.9702381
F1:  0.77272725
(evaluate-model "w10" "resources/gold-standard-2.csv")
True positive:  15
False positive:  2
True negative:  318
False negative:  10

Precision:  0.88235295
Recall:  0.6
Accuracy:  0.9652174
F1:  0.71428573

This is already a clear improvement for both gold standards. Let’s see if we continue to see improvements if we continue to increase the weight.

Weight 25

(train-svm-model "svm.w25" "resources/svm/base/"
                 :weights {1 25.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w25" "resources/gold-standard-1.csv")
True positive:  20
False positive:  3
True negative:  307
False negative:  6

Precision:  0.8695652
Recall:  0.7692308
Accuracy:  0.97321427
F1:  0.8163265
(evaluate-model "w25" "resources/gold-standard-2.csv")
True positive:  21
False positive:  5
True negative:  315
False negative:  4

Precision:  0.8076923
Recall:  0.84
Accuracy:  0.973913
F1:  0.82352936

The general metrics continued to improve. By increasing the weight, the precision dropped a little bit, but the recall improved quite a bit. The overall F1 score significantly improved. Let’s see with the Weight at 50.

Weight 50

(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w50" "resources/gold-standard-1.csv")
True positive:  23
False positive:  7
True negative:  303
False negative:  3

Precision:  0.76666665
Recall:  0.88461536
Accuracy:  0.9702381
F1:  0.82142854
(evaluate-model "w50" "resources/gold-standard-2.csv")
True positive:  23
False positive:  6
True negative:  314
False negative:  2

Precision:  0.79310346
Recall:  0.92
Accuracy:  0.9768116
F1:  0.8518519

The trend continues: decline in precision increase of recall and overall F1 score is better in both cases. Let’s try with a weight of 200

Weight 200

(train-svm-model "svm.w200" "resources/svm/base/"
                 :weights {1 200.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w200" "resources/gold-standard-1.csv")
True positive:  23
False positive:  7
True negative:  303
False negative:  3

Precision:  0.76666665
Recall:  0.88461536
Accuracy:  0.9702381
F1:  0.82142854
(evaluate-model "w200" "resources/gold-standard-2.csv")
True positive:  23
False positive:  6
True negative:  314
False negative:  2

Precision:  0.79310346
Recall:  0.92
Accuracy:  0.9768116
F1:  0.8518519

Results are the same, it looks like improving the weights up to a certain point adds further to the predictive power. However, the goal of this article is not to be an SVM parametrization tutorial. Many other tests could be done such as testing different values for the different SVM parameters like the C parameter and others.

Improving Performance Using New Music Domain Documents

Now let’s see if we can improve the performance of the model even more by adding new documents that belong to the class we want to define in the SVM model. The idea of adding documents is good, but how may we quickly process thousands of new documents that belong to that class? Easy, we will use the KBpedia Knowledge Graph and its linkage to entities that exists into the KBpedia Knowledge Base to get thousands of new documents highly related to the music domain we are defining.

Here is how we will proceed. See how we use the type relationship between the classes and their individuals:

The millions of completely typed instances in KBpedia enable us to retrieve such large training sets efficiently and quickly.

Extending the Music Domain Model

To extend the music domain model I added about 5000 albums, musicians and bands documents using the relationships querying strategy outlined in the figure above. What I did is just to add 3 new features but with thousands of new training documents in the corpus.

What I had to do was to:

  1. Extend the domain pages with the new entities
  2. Cache the new entities’ Wikipedia pages
  3. Build a new semantic interpreter that take the new documents into account, and
  4. Build a new SVM model that use the new semantic interpreter’s output.
(extend-domain-pages-with-entities)
(cache-corpus)
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--extended.csv")

(build-semantic-interpreter "domain-extended" "resources/semantic-interpreters/domain-extended/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/domain-extended/")

Evaluating the Extended Music Domain Model

Just like what we did for the first series of tests, we now will create different SVM models and evaluate them. Since we now have a nearly balanced set of training corpus documents, we will test much smaller weights (no weight, and then 2 weight).

(train-svm-model "svm.w0" "resources/svm/domain-extended/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w0" "resources/gold-standard-1.csv")
True positive:  20
False positive:  12
True negative:  298
False negative:  6

Precision:  0.625
Recall:  0.7692308
Accuracy:  0.9464286
F1:  0.6896552
(evaluate-model "w0" "resources/gold-standard-2.csv")
True positive:  18
False positive:  17
True negative:  303
False negative:  7

Precision:  0.51428574
Recall:  0.72
Accuracy:  0.93043476
F1:  0.6

As we can see, the model is scoring much better than the previous one when the weight is zero. However, it is not as good as the previous one when weights are modified. Let’s see if we can benefit increasing the weight for this new training set:

(train-svm-model "svm.w2" "resources/svm/domain-extended/"
                 :weights {1 2.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w2" "resources/gold-standard-1.csv")
True positive:  21
False positive:  23
True negative:  287
False negative:  5

Precision:  0.47727272
Recall:  0.8076923
Accuracy:  0.9166667
F1:  0.59999996
(evaluate-model "w2" "resources/gold-standard-2.csv")
True positive:  20
False positive:  33
True negative:  287
False negative:  5

Precision:  0.3773585
Recall:  0.8
Accuracy:  0.8898551
F1:  0.51282054

Overall the models seems worse with weight 2, let’s try with weight 5:

(train-svm-model "svm.w5" "resources/svm/domain-extended/"
                 :weights {1 5.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w5" "resources/gold-standard-1.csv")
True positive:  25
False positive:  52
True negative:  258
False negative:  1

Precision:  0.32467532
Recall:  0.96153843
Accuracy:  0.8422619
F1:  0.4854369
(evaluate-model "w2" "resources/gold-standard-2.csv")
True positive:  23
False positive:  62
True negative:  258
False negative:  2

Precision:  0.27058825
Recall:  0.92
Accuracy:  0.81449276
F1:  0.41818184

The performances are just getting worse. But this makes sense at the same time. Now that the training set is balanced, there are many more tokens that participate into the semantic interpreter and so in the vectors generated by it and used by the SVM. If we increase the weight of a balanced training set, then this intuitively should re-unbalance the training set and worsen the performances. This is what is apparently happening.

Re-balancing the training set using this strategy does not look to be improving the prediction model, at least not for this domain and not for these SVM parameters.

Improving Using Manual Features Selection

So far, we have been able to test different kind of strategies to create different training corpuses, to select different features, etc. We have been able to do this within a day, mostly waiting for the desktop computer to build the semantic interpreter and the vectors for the training sets. It has been possible thanks to the KBpedia Knowledge Graph that enabled us to easily and automatically slice-and-dice the knowledge structure to perform all these tests quickly and efficiently.

There are other things we could do to continue to improve the prediction model, such as manually selecting features returned by KBpedia. Then we could test different parameters of the SVM classifier, etc. However, such tweaks are the possible topics of later use cases.

Multiclass Classification

Let me add a few additional words about multiclass classification. As we saw, we can easily define domains by selecting one or multiple KBpedia reference concepts and all of their sub-classes. This general process enables us to scope any domain we want to cover. Then we can use the KBpedia Knowledge Graph’s relationship with external data sources to create the training corpus for the scoped domain. Finally, we can use SVM as a binary classifier to determine if an input text belongs to the domain or not. However, what if we want to classify an input text with more than one domain?

This can easily be done by using the one-vs-rest (also called the one-vs-all) multiclass classification strategy. The only thing we have to do is to define multiple domains of interest, and then to create a SVM model for each of them. As noted above, this effort is almost solely one of posing one or more queries to KBpedia for a given domain. Finally, to predict if an input text belongs to any of each domain models we defined, we need to apply an SVM option (like LIBLINEAR) that already implements multi-class SVM classification.

Conclusion

In this article, we tested multiple, different strategies to create a good prediction model using SVM to classify input texts into a music-related class. We tested unbalanced training corpuses, balanced training corpuses, different set of features, etc. Some of these tests improved the prediction model; others made it worse. The key point that should be remembered is that any machine learning effort requires bounding, labeling, testing and refining multiple parameters in order to obtain the best results. Use of the KBpedia Knowledge Graph and its linkage to external public datasets enables Cognonto to now do this previously lengthy and time-consuming tasks quickly and efficiently.

Within a few hours, we created a classifier with an accuracy of about 97% that classifies input text to belong to a music domain or not. We demonstrate how we can create such classifiers more-or-less automatically using the KBpedia Knowledge Graph to define the scope of the domain and to classify new text into that domain based on relevant KBpedia reference concepts. Finally, we note how we may create multi-class classifiers using exactly the same mechanisms.

Posted at 00:49

October 24

Libby Miller: A presence robot with Chromium, WebRTC, Raspberry Pi 3 and EasyRTC

Here’s how to make a presence robot with Chromium 51, WebRTC, Raspberry Pi 3 and EasyRTC. It’s actually very easy, especially now that Chromium 51 comes with Raspian Jessie, although it’s taken me a long time to find the exact incantation.

If you’re going to use it for real, I’d suggest using the

Posted at 21:53

October 21

Leigh Dodds: “Open”

For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.

XYZ is “open” because:

  • It’s on the web
  • It’s free to use
  • It’s published under an open licence
  • It’s published under a custom licence, which limits some types of use (usually commercial, often everything except personal)
  • It’s published under an open licence, but we’ve not checked too deeply in whether we can do that
  • It’s free to use, so long as you do so within our app or application
  • There’s a restricted/limited access free version
  • There’s documentation on how it works
  • It was (or is) being made in public, with equal participation by anyone
  • It was (or is) being made in public, lead by a consortium or group that has limitation on membership (even if just fees)
  • It was (or is) being made privately, but the results are then being made available publicly for you to use

I gather that at

Posted at 14:51

Leigh Dodds: Current gaps in the open data standards framework

In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)

To define the scope of those standards, lets try and answer two questions.

Question 1: What are the various activities that we might want to carry out around an open dataset?

  • A. Discover the metadata and documentation about a dataset
  • B. Download or otherwise extract the contents of a dataset
  • C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
  • D. Monitor a dataset for updates
  • E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
  • F. Mirror a dataset to another location, e.g. exporting its metadata and contents
  • G. Link or reconcile some data against a dataset or register

Question 2: What are the various activities that we might want to carry out around an open data catalogue?

  • V. Find whether a dataset exists, e.g. via a search or similar interface
  • X. List the contents of the platform, e.g. its datasets or other published assets
  • Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
  • Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains

Now, based on that quick review: which of these areas of functionality are covered by existing standards?

Posted at 14:22

October 17

AKSW Group - University of Leipzig: AKSW Colloquium, 17.10.2016, Version Control for RDF Triple Stores + NEED4Tweet

In the upcoming Colloquium, October the 17th at 3 PM, two papers will be presented:

Version Control for RDF Triple Stores

Marvin Frommhold will discuss the paper “Version Control for RDF Triple Stores” by Steve Cassidy and James Ballantine which forms the foundation of his own work regarding versioning for RDF.

Abstract:  RDF, the core data format for the Semantic Web, is increasingly being deployed both from automated sources and via human authoring either directly or through tools that generate RDF output. As individuals build up large amounts of RDF data and as groups begin to collaborate on authoring knowledge stores in RDF, the need for some kind of version management becomes apparent. While there are many version control systems available for program source code and even for XML data, the use of version control for RDF data is not a widely explored area. This paper examines an existing version control system for program source code, Darcs, which is grounded in a semi-formal theory of patches, and proposes an adaptation to directly manage versions of an RDF triple store.

NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and
Disambiguation

Afterwards, Diego Esteves will present the paper “NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and
Disambiguation” by Mena B. Habib and Maurice van Keulen which was accepted at ACL 2015.

Abstract: In this demo paper, we present NEED4Tweet, a Twitterbot for named entity extraction (NEE) and disambiguation (NED) for Tweets. The straightforward application of state-of-the-art extraction and disambiguation approaches on informal text widely used in Tweets, typically results in significantly degraded performance due to the lack of formal structure; the lack of sufficient context required;
and the seldom entities involved. In this paper, we introduce a novel framework
that copes with the introduced challenges. We rely on contextual and semantic features more than syntactic features which are less informative. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 07:55

October 14

AKSW Group - University of Leipzig: LIMES 1.0.0 Released

Dear all,

the LIMES Dev team is happy to announce LIMES 1.0.0.

LIMES, the Link Discovery Framework for Metric Spaces, is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. Our approaches facilitate different approximation techniques to compute estimates of the similarity between instances. These estimates are then used to filter out a large amount of those instance pairs that do not suffice the mapping conditions. By these means, LIMES can reduce the number of comparisons needed during the mapping process by several orders of magnitude. The approaches implemented in LIMES include the original LIMES  algorithm for edit distances, HR3, HYPPO and ORCHID.

Additionally, LIMES supports the first planning technique for link discovery HELIOS, that minimizes the overall execution of a link specification, without any loss of completeness. Moreover, LIMES implements supervised and unsupervised machine-learning algorithms for finding accurate link specifications. The algorithms implemented here include the supervised, active and unsupervised versions of EAGLE and WOMBAT.

 

Website: http://aksw.org/Projects/LIMES.html

Download: https://github.com/AKSW/LIMES-dev/releases/tag/1.0.0

GitHub: https://github.com/AKSW/LIMES-dev

User manual: http://aksw.github.io/LIMES-dev/user_manual/

Developer manual: http://aksw.github.io/LIMES-dev/developer_manual/

 

What is new in LIMES 1.0.0:

  • New LIMES GUI
  • New Controller that supports manual and graphical configuration
  • New machine learning pipeline: supports supervised, unsupervised and active learning algorithms
  • New dynamic planning for efficient link discovery
  • Updated execution engine to handle dynamic planning
  • Added support for qualitative (Precision, Recall, F-measure etc.) and quantitative (runtime duration etc.) evaluation metrics for mapping evaluation, in the presence of a gold standard
  • Added support for configuration files in XML and RDF formats
  • Added support for pointsets metrics such as Mean, Hausdorff and Surjection
  • Added support for MongeElkan, RatcliffObershelp string measures
  • Added support for Allen’s algebra temporal relations for event data
  • Added support for all topological relations derived from the DE-9IM model
  • Migrated the codebase to Java 8 and Jena 3.0.1

We would like to thank everyone who helped to create this release. We also acknowledge the support of the SAKE  and HOBBIT projects.

Kind regards,

The LIMES Dev team

 

Posted at 09:38

October 11

AKSW Group - University of Leipzig: DL-Learner 1.3 (Supervised Structured Machine Learning Framework) Released

Dear all,

the Smart Data Analytics group at AKSW is happy to announce DL-Learner 1.3.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

Website: http://dl-learner.org
GitHub page: https://github.com/AKSW/DL-Learner
Download: https://github.com/AKSW/DL-Learner/releases
ChangeLog: http://dl-learner.org/development/changelog/

DL-Learner is used for data analysis tasks within other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern-based and evolutionary techniques for learning on structured data. For a practical example, see http://dl-learner.org/community/carcinogenesis/. It also offers a plugin for Protégé, which can give suggestions for axioms to add.

In the current release, we added a large number of new algorithms and features. For instance, DL-Learner supports terminological decision tree learning, it integrates the LEAP and EDGE systems as well as the BUNDLE probabilistic OWL reasoner. We migrated the system to Java 8, Jena 3, OWL API 4.2 and Spring 4.3. We want to point to some related efforts here:

We want to thank everyone who helped to create this release, in particular we want to thank Giuseppe Cota who visited the core developer team and significantly improved DL-Learner. We also acknowledge support by the recently SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as Big Data Europe and HOBBIT projects.

Kind regards,

Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin

 

Posted at 19:41

October 07

Frederick Giasson: Mapping Datasets, Schema and Ontologies Using the Cognonto Mapper

There are many situations were we want to link named entities from two different datasets or to find duplicate entities to remove in a single dataset. The same is true for vocabulary terms or ontology classes that we want to integrate and map together. Sometimes we want to use such a linkage system to help save time when creating gold standards for named entity recognition tasks.

There exist multiple data linkage & deduplication frameworks developed in several different programming languages. At Cognonto, we have our own system called the Cognonto Mapper.

Most mapping frameworks work more or less the same way. They use one or two datasets as sources of entities (or classes or vocabulary terms) to compare. The datasets can be managed by a conventional relational database management system, a triple store, a spreadsheet, etc. Then they have complex configuration options that let the user define all kinds of comparators that will try to match the values of different properties that describe the entities in each dataset. (Comparator types may be simple string comparisons, the added use of alternative labels or definitions, attribute values, or various structural relationships and linkages within the dataset.) Then the comparison is made for all the entities (or classes or vocabulary terms) existing in each dataset. Finally, an entity similarity score is calculated, with some threshold conditions used to signal whether the two entities (or classes or vocabulary terms) are the same or not.

The Cognonto Mapper works in this same general way. However, as you may suspect, it has a special trick in its toolbox: the SuperType Comparator. The SuperType Comparator leverages the KBpedia Knowledge Ontology to help disambiguate two given entities (or classes or vocabulary terms) based on their type and the analysis of their types in the KBPedia Knowledge Ontology. When we perform a deduplication or a linkage task between two large datasets of entities, it is often the case that two entities will be considered a nearly perfect match based on common properties like names, alternative names and other common properties even if they are two completely different things. This happens because entities are often ambiguous when only considering these basic properties. The SuperType Comparator’s role is to disambiguate the entities based on their type(s) by leveraging the disjointedness of the SuperType structure that governs the overall KBpedia structure. The SuperType Comparator greatly reduces the time needed to curate the deduplication or linkage tasks in order to determine the final mappings.

We first present a series of use cases for the Mapper below, followed by an explanation of how the Cognonto Mapper works, and then some conclusions.

Usages Of The Cognonto Mapper

When should the Cognonto Mapper, or other deduplication and mapping services, be used? While there are many tasks that warrant the usage of such a system, let’s focus for now on some use cases related to Cognonto and machine learning in general.

Mapping Across Schema

One of Cognonto’s most important use cases is to use the Mapper to link new vocabularies, schemas or ontologies to the KBpedia Knowledge Ontology (KKO). This is exactly what we did for the 24 external ontologies and schemas that we have integrated into KBpedia. Creating such a mapping can be a long and painstaking process. The Mapper greatly helps linking similar concepts together by narrowing the candidate pool of initial set of mappings, thereby increasing the efficiency of the analyst charged with selecting the final mappings between the two ontologies.

Creating ‘Gold Standards’

In my last article, I created a gold standard of 511 random web pages where I determined the publisher of the web page by hand. That gold standard was used to measure the performance of a named entities recognition task. However, to create the actual gold standard, I had to check in each dataset (5 of them with million of entities) if that publisher was existing in any of them. Performing such a task by hand means that I would have to send at least 2555 search queries to try find a matching entity. Let’s say that I am fast, and that I can write the query, send it, look at the results, and copy/paste the URI of the good entity in the gold standard within 30 seconds, it still means that I would complete such a task in roughly 21 hours. It is also clearly impossible to do that 8 hours per day for ~3 days for a sane person, so this task would probably take at least 1 week to complete.

This is why automating this mapping process is really important and this is what the Cognonto Mapper does. The only thing that is needed is to configure 5 mapper sessions. Each session tries to map the entities I identified by hand from the 511 web pages to each of the other datasets. Then I only need run the mapper for each dataset, review the matches, find the missing ones by hand and then merge the results into the final gold standard.

Curating Unknown Entities

In Cognonto, we have an unknown entities tagger that is used to detect possible publisher organizations that are not currently existing in the KBpedia knowledge base. In some cases, what we want to do is to save these detected unknown entities in an unknown entities dataset. Then this dataset will be used to review detected entities to include them back into the KBpedia knowledge base (such that they become new). In the review workflow, one of the steps should be to try to find similar entities to make sure that what was detected by the entities tagger was a totally new entity, and not a new surface form for that entity (which would become an alternative label for the entity and not an entirely new one). Such a checkup in the review workflow would be performed by the Cognonto Mapper.

How Does the SuperType Comparator Work?

As I mentioned in the introduction, the Cognonto Mapper is yet another linkage & deduplication framework. However, it has a special twist: its SuperType Comparator and the leveraging of the KBpedia Knowledge Ontology. Good, but how does it work? There is no better way to understand how it works than studying how two entities can be disambiguated based on their type. So, let’s do this.

Let’s consider this use case. We want to map two datasets together: Wikipedia and Musicbrainz. One of the Musicbrainz entity we want to map to Wikipedia is a music group called Attila with Billy Joel and Jon Small. Attila also exists in Wikipedia, but it is highly ambiguous and may refer to multiple different things. If we setup our linkage task to only work on the preferred and possible alternative labels, they we would have a match between the name of that album and multiple other things in Wikipedia with a matching likelihood that is probably nearly identical. However, how could we update the configuration to try to solve this issue? We have no choice, we will have to use the Cognonto Mapper SuperType Comparator.

Musicbrainz RDF dumps normally map a Musicbrainz group to a mo:MusicGroup. In the Wikipedia RDF dump the Attila rock band has a type dbo:Band. Both of these classes are linked to the KBpedia reference concept kbpedia:Band-MusicGroup. This means that the entities of both of these datasets are well connected into KBpedia.

Let’s say that the Cognonto Mapper does detect that the Attila entity in the Musicbrainz dataset has 4 candidates in Wikipedia:

  1. Attila, the rock band
  2. Attila, the bird
  3. Attila, the film
  4. Attila, the album

If the comparison is only based on the preferred label, the likelihood will be the same for all these entities. However, what happens when we start using the SuperType Comparator and the KBpedia Knowledge Ontology?

First we have to understand the context of each type. Using KBpedia, we can determine that rock bands, birds, albums and films are disjoint according to their super types: kko:Animals, kko:Organizations kko:AudioInfo and kko:VisualInfo.

Now that we understand each of the entities the system is trying to link together, and their context within the KBpedia Knowledge Ontology, let’s see how the Cognonto Mapper will score each of these entities based on their type to help disambiguate where labels are identical.

(println 
 "mo:MusicGroup -> dbo:Band"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Band"))
(println 
 "mo:MusicGroup -> dbo:Bird"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Bird"))
(println 
 "mo:MusicGroup -> dbo:Film"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Film"))
(println 
 "mo:MusicGroup -> dbo:Album"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Album"))
Classes Similarity
mo:MusicGroup -> dbo:Band 1.0
mo:MusicGroup -> dbo:Bird 0.2
mo:MusicGroup -> dbo:Film 0.2
mo:MusicGroup -> dbo:Album 0.2

In these cases, the SuperType Comparator did assign a similarity of 1.0 to the mo:MusicGroup and the dbo:Band entities since those two classes are equivalent. All the other checks returns 0.20. When the comparator finds two entities that have disjoint SuperTypes, then it assigns the similarity value 0.20 to them. Why not 0.00 if they are disjoint? Well, there may be errors in the knowledge base, so that setting the comparator score to a very low level, it is still available for evaluation, even though its score is much reduced.

In this case the matching is unambiguous and the selection of the right linkage to perform is obvious. However you will see below that it is not (and often not) that simple to make such a clear selection.

Now let’s say that the next entity to match from the Musicbrainz dataset is another entity called Attila, but this time it refers to Attila, the album by Mina. Since the basis of the comparison (comparing the Musicbrainz Attila album instead of the band), the entire process will yield different results. The main difference is that the album will be compared to a film and an album from the Wikipedia dataset. As you can notice in the graph below, these two entities belong to the super types kko:AudioInfo and kko:VisualInfo which are not disjoint.

(println 
 "mo:MusicalWork -> dbo:Band"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Band"))
(println 
 "mo:MusicalWork -> dbo:Bird"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Bird"))
(println 
 "mo:MusicalWork -> dbo:Film"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Film"))
(println 
 "mo:MusicalWork -> dbo:Album"
 (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Album"))
Classes Similarity
mo:MusicalWork -> dbo:Band 0.2
mo:MusicalWork -> dbo:Bird 0.2
mo:MusicalWork -> dbo:Film 0.8762886597938144
mo:MusicalWork -> dbo:Album 0.9555555555555556

As you can see, the main difference is that we don’t have a perfect match between the entities. We thus need to compare between their types, and two of the entities are ambiguous based on their SuperType (their super types are non-disjoint). In this case, what the SuperType Comparator does is to check the set of super classes of both entities, and calculate a similarity measure between the two sets of classes and compute a similarity measure. It is why we have 0.8762 for one and 0.9555 for another.

A musical work and an album are two nearly identical concepts. In fact, a musical work is a conceptual work of an album (a record). A musical work is also strongly related to films since films includes musical works, etc. However, the relationship between a musical work and an album is stronger than with a film and this is what the similarity measure shows.

In this case, even if we have two ambiguous entities of an album and a film for which we don’t have disjoint super types, we are still able to determine which one to choose to create the mappiing based on the calculation of the similarity measure.

Conclusion

As we saw, there are multiple reasons why we would want to leverage the KBpedia Knowledge Ontology to help mapping and deduplication frameworks such as the Cognonto Mapper to disambiguate possible entity matches. KBpedia is not only good for mapping datasets together, it is also quite effective to help with some machine learning tasks such as creating gold standards or curating detected unknown entities. In the context of Cognonto, it is quite effective to map external ontologies, schemas or vocabularies to the KBpedia Knowledge Ontology. It is an essential tool for extending KBpedia to domain- and enterprise-specific needs.

In this article I focused on the SuperType Comparator that is leveraging the type structure of the KBpedia Knowledge Ontology. However, we can also use other structural features in KBpedia (such as an Aspects Comparator based on the aspects structure of KBpedia), singly or in combination, to achieve other mapping or disambiguation objectives.

Posted at 12:20

October 05

AKSW Group - University of Leipzig: OntoWiki 1.0.0 released

Dear Semantic Web and Linked Data Community,
we are proud to finally announce the releases of OntoWiki 1.0.0 and the underlying Erfurt Framework in version 1.8.0.
After 10 years of development we’ve decided to release the teenager OntoWiki from the cozy home of 0.x versions.
Since the last release of 0.9.11 in January 2014 we did a lot of testing to stabilize OntoWikis behavior and accordingly made a lot of bug fixes, also we are now using PHP Composer for dependency management, improved the testing work flow, gave a new structure and home to the documentation and we have created a neat project landing page.

The development of OntoWiki is completely open source and we are happy for any contribution, especially to the code and the documentation, which is also kept in a Git repository with easy to edit Markdown pages. If you have questions about the usage of OntoWiki besides the documentation you can also use or mailinglist or the stackoverflow tag “ontowiki”.

Please see https://ontowiki.net/ for further information.

We also had a Poster for advertising the OntoWiki release at SEMANTiCS Conference:

OntoWiki 1.0

Philipp Frischmuth, Natanael Arndt, Michael Martin: OntoWiki 1.0: 10 Years of Development – What’s New in OntoWiki

We are happy for your feedback, in the name of the OntoWiki team,
Philipp, Michael and Natanael

Our Fingers on the Mouse

Posted at 14:50

October 04

Frederick Giasson: Improving Machine Learning Tasks By Integrating Private Datasets

In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like Wikipedia, Wikidata, Open Corporates and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own curated data.

In this article, I want to demonstrate how high-value private datasets may be integrated into the Cognonto’s KBpedia knowledge base to produce a significant impact on the quality of the results of some machine learning tasks. To demonstrate this impact, I have created a demo that is supported by a “gold standard” of 511 web pages taken at random, to which we have tagged the organization that published the web page. This demo is related to the publisher analysis portion of the Cognonto demo. We will use this gold standard to calculate the performance metrics of the publisher analyzer but more precisely, we will analyze the performance of the analyzer depending on the datasets it has access to perform its predictions.

Cognonto Publisher’s Analyzer

The Cognonto publisher’s analyzer is a portion of the overall Cognonto demo that tries to determine the publisher of a web page from analyzing the web page’s content. There are multiple moving parts to this analyzer, but its general internal workflow works as follows:

  1. It crawls a given webpage URL
  2. It extracts the page’s content and extracts its meta-data
  3. It tags all of the organizations (anything that is considered an organization in KBpedia) across the extracted content using the organization entities that exist in the knowledge base
  4. It tries to detect unknown entities that will eventually be added to the knowledge base after curation
  5. It performs an in-depth analysis of the organization entities (known or unknown) that got tagged in the content of the web page, and analyzes which of these is the most likely to be the publisher of the web page.

Such a machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the Cognonto analyzer is its knowledge base. We leverage Cognonto to detect known organization entities. We use the knowledge in the KB for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of datasets that compose the Cognonto knowledge base, and the KBpedia conceptual reference structure composed of roughly ~39,000 reference concepts.

Given the central role of the knowledge base in such an analysis process, we want to have a better idea of the impact of the datasets in the performance of such a system.

For this demo, I use three public datasets already in KBpedia and that are used by the Cognonto demo: Wikipedia (via DBpedia), Freebase and USPTO. Then I add two private datasets of high quality, highly curated and domain-related information augment the listing of potential organizations. What I will do is to run the Cognonto publisher analyzer on each of these 511 web pages. Then I will check which one got properly identified given the gold standard and finally I will calculate different performance metrics to see the impact of including or excluding a certain dataset.

Gold Standard

The gold standard is composed of 511 randomly selected web pages that got crawled and cached. When we run the tests below, the cached version of the HTML pages is used to make sure that we get the same HTML for each page for each test. When the pages are crawled, we execute any possible JavaScript code that the pages may contain before caching the HTML code of the page. That way, if some information in the page was injected by some JavaScript code, then that additional information will be cached as well.

The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used the determine if the publisher analyzer properly detects the actual publisher of the web page.

We also add a set of 10 web pages manually for which we are sure that no publisher can be determined for the web page. These are the 10 True Negative (see below) instances of the gold standard.

The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system.

Metrics

The goal of this analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understanding the performance of the system. The metrics calculation is based on the confusion matrix.

The True Positive, False Positive, True Negative and False Negative (see Type I and type II errors for definitions) should be interpreted that way in the context of a named entities recognition task:

  1. True Positive (TP): test identifies the same entity as in the gold standard
  2. False Positive (FP): test identifies a different entity than what is in the gold standard
  3. True Negative (TN): test identifies no entity; gold standard has no entity
  4. False Negative (FN): test identifies no entity, but gold standard has one

The we have a series of metrics that can be used to measure the performance of of the system:

  1. Precision: is the proportion of properly predicted publishers amongst all of the publishers that exists in the gold standard (TP / TP + FP)
  2. Recall: is the proportion of properly predicted publishers amongst all the predictions that have been made (good and bad) (TP / TP + FN)
  3. Accuracy: it is the proportion of correctly classified test instances; the publishers that could be identified by the system, and the ones that couldn’t (the web pages for which no publisher could be identified). ((TP + TN) / (TP + TN + FP + FN))
  4. f1: the test’s equally weighted combination of precision and recall
  5. f2: the test’s weighted combination of precision and recall, with a preference for recall
  6. f0.5: the test’s weighted combination of precision and recall, with a preference for precision.

The F-score test the accuracy of the general prediction system. The F-score is a measure that combines precision and recall is the harmonic mean of precision and recall. The f2 measure weighs recall higher than precision (by placing more emphasis on false negatives), and the f0.5 measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall.

In general, I think that the metric that gives the best overall performance of this named entities recognition system is the accuracy one. I emphasize those test results below.

Running The Tests

The goal with these tests is to run the gold standard calculation procedure with different datasets that exist in the Cognonto knowledge base to see the impact of including/excluding these datasets on the gold standard metrics.

Baseline: No Dataset

The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics.

(table (generate-stats :js :execute :datasets []))
True positives:  2
False positives:  5
True negatives:  19
False negatives:  485

+--------------+--------------+
| key          | value        |
+--------------+--------------+
| :precision   | 0.2857143    |
| :recall      | 0.0041067763 |
| :accuracy    | 0.04109589   |
| :f1          | 0.008097166  |
| :f2          | 0.0051150895 |
| :f0.5        | 0.019417476  |
+--------------+--------------+

One Dataset Only

Now, let’s see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task.

Wikipedia (via DBpedia) Only

Let’s test the impact of adding a single general purpose dataset, the publicly available: Wikipedia (via DBpedia):

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"]))
True positives:  121
False positives:  57
True negatives:  19
False negatives:  314

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.6797753  |
| :recall      | 0.27816093 |
| :accuracy    | 0.2739726  |
| :f1          | 0.39477977 |
| :f2          | 0.31543276 |
| :f0.5        | 0.52746296 |
+--------------+------------+

Freebase Only

Now, let’s test the impact of adding another single general purpose dataset, this one the publicly available: Freebase:

(table (generate-stats :js :execute :datasets ["http://rdf.freebase.com/ns/"]))
True positives:  11
False positives:  14
True negatives:  19
False negatives:  467

+--------------+-------------+
| key          | value       |
+--------------+-------------+
| :precision   | 0.44        |
| :recall      | 0.023012552 |
| :accuracy    | 0.058708414 |
| :f1          | 0.043737575 |
| :f2          | 0.028394425 |
| :f0.5        | 0.09515571  |
+--------------+-------------+

USPTO Only

Now, let’s test the impact of adding still a different publicly available specialized dataset: USPTO:

(table (generate-stats :js :execute :datasets ["http://www.uspto.gov"]))
True positives:  6
False positives:  13
True negatives:  19
False negatives:  473

+--------------+-------------+
| key          | value       |
+--------------+-------------+
| :precision   | 0.31578946  |
| :recall      | 0.012526096 |
| :accuracy    | 0.04892368  |
| :f1          | 0.024096385 |
| :f2          | 0.015503876 |
| :f0.5        | 0.054054055 |
+--------------+-------------+

Private Dataset #1

Now, let’s test the first private dataset:

(table (generate-stats :js :execute :datasets ["http://cognonto.com/datasets/private/1/"]))
True positives:  231
False positives:  109
True negatives:  19
False negatives:  152

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.67941177 |
| :recall      | 0.60313314 |
| :accuracy    | 0.4892368  |
| :f1          | 0.6390042  |
| :f2          | 0.61698717 |
| :f0.5        | 0.6626506  |
+--------------+------------+

Private Dataset #2

And, then, the second private dataset:

(table (generate-stats :js :execute :datasets ["http://cognonto.com/datasets/private/2/"]))
True positives:  24
False positives:  21
True negatives:  19
False negatives:  447

+--------------+-------------+
| key          | value       |
+--------------+-------------+
| :precision   | 0.53333336  |
| :recall      | 0.050955415 |
| :accuracy    | 0.08414873  |
| :f1          | 0.093023255 |
| :f2          | 0.0622084   |
| :f0.5        | 0.1843318   |
+--------------+-------------+

Combined Datasets – Public Only

A more realistic analysis is to use a combination of datasets. Let’s see what happens to the performance metrics if we start combining public datasets.

Wikipedia + Freebase

First, let’s start by combining Wikipedia and Freebase.

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://rdf.freebase.com/ns/"]))
True positives:  126
False positives:  60
True negatives:  19
False negatives:  306

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.67741936 |
| :recall      | 0.29166666 |
| :accuracy    | 0.28375733 |
| :f1          | 0.407767   |
| :f2          | 0.3291536  |
| :f0.5        | 0.53571427 |
+--------------+------------+

Adding the Freebase dataset to the DBpedia one had the following effects on the different metrics:

metric Impact in %
precision -0.03%
recall +4.85%
accuracy +3.57%
f1 +3.29%
f2 +4.34%
f0.5 +1.57%

As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset.

Wikipedia + USPTO

Let’s switch Freebase for the other specialized public dataset, USPTO.

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"]))
True positives:  122
False positives:  59
True negatives:  19
False negatives:  311

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.67403316 |
| :recall      | 0.2817552  |
| :accuracy    | 0.27592954 |
| :f1          | 0.39739415 |
| :f2          | 0.31887087 |
| :f0.5        | 0.52722555 |
+--------------+------------+

Adding the USPTO dataset to the DBpedia instead of Freebase had the following effects on the different metrics:

metric Impact in %
precision -0.83%
recall +1.29%
accuracy +0.73%
f1 +0.65%
f2 +1.07%
f0.5 +0.03%

As we may have expected the gains are smaller than Freebase. Maybe partly because it is smaller and more specialized. Because it is more specialized (enterprises that have patents registered in US), maybe the gold standard doesn’t represent well the organizations belonging to this dataset. But in any case, these are still gains.

Wikipedia + Freebase + USPTO

Let’s continue and now include all three datasets.

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"]))
True positives:  127
False positives:  62
True negatives:  19
False negatives:  303

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.6719577  |
| :recall      | 0.29534882 |
| :accuracy    | 0.2857143  |
| :f1          | 0.41033927 |
| :f2          | 0.3326349  |
| :f0.5        | 0.53541315 |
+--------------+------------+

Now let’s see the impact of adding both Freebase and USPTO to the Wikipedia dataset:

metric Impact in %
precision +1.14%
recall +6.18%
accuracy +4.30%
f1 +3.95%
f2 +5.45%
f0.5 +1.51%

Now let’s see the impact of using highly curated, domain related, private datasets.

Combined Datasets – Public enhanced with private datasets

The next step is to add the private datasets of highly curated data that are specific to the domain of identifying web page publisher organizations. As the baseline, we will use the three public datasets: Wikipedia, Freebase and USPTO and then we will add the private datasets.

Wikipedia + Freebase + USPTO + PD #1

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"
                                               "http://cognonto.com/datasets/private/1/"]))
True positives:  279
False positives:  102
True negatives:  19
False negatives:  111

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.7322835  |
| :recall      | 0.7153846  |
| :accuracy    | 0.58317024 |
| :f1          | 0.7237354  |
| :f2          | 0.7187017  |
| :f0.5        | 0.7288401  |
+--------------+------------+

Now, let’s see the impact of adding the private dataset #1 along with Wikipedia, Freebase and USPTO:

metric Impact in %
precision +8.97%
recall +142.22%
accuracy +104.09%
f1 +76.38%
f2 +116.08%
f0.5 +36.12%

Adding the highly curated and domain specific private dataset #1 had a dramatic impact on all the metrics of the combined public datasets. Now let’s see what is the impact of the public datasets on the private dataset #1 metrics when it is used alone:

metric Impact in %
precision +7.77%
recall +18.60%
accuracy +19.19%
f1 +13.25%
f2 +16.50%
f0.5 +9.99%

As we can see, the public datasets does significantly increase the performance of the highly curated and domain specific private dataset #1.

Wikipedia + Freebase + USPTO + PD #2

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"
                                               "http://cognonto.com/datasets/private/2/"]))
True positives:  138
False positives:  69
True negatives:  19
False negatives:  285

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.6666667  |
| :recall      | 0.32624114 |
| :accuracy    | 0.3072407  |
| :f1          | 0.43809524 |
| :f2          | 0.36334914 |
| :f0.5        | 0.55155873 |
+--------------+------------+

Not all of the private datasets have equivalent impact. Let’s see the impact of adding the private dataset #2 instead of the #1:

metric Impact in %
precision -0.78%
recall +10.46%
accuracy +7.52%
f1 +6.75%
f2 +9.23%
f0.5 +3.00%

Wikipedia + Freebase + USPTO + PD #1 + PD #2

Now let’s see what happens when we use all the public and private datasets.

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"
                                               "http://cognonto.com/datasets/private/1/"
                                               "http://cognonto.com/datasets/private/2/"]))
True positives:  285
False positives:  102
True negatives:  19
False negatives:  105

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.7364341  |
| :recall      | 0.7307692  |
| :accuracy    | 0.59491193 |
| :f1          | 0.7335907  |
| :f2          | 0.7318952  |
| :f0.5        | 0.7352941  |
+--------------+------------+

Let’s see the impact of adding the private datasets #1 and #2 to the public datasets:

metric Impact in %
precision +9.60%
recall +147.44%
accuracy +108.22%
f1 +78.77%
f2 +120.02%
f0.5 +37.31%

Adding Unknown Entities Tagger

There is one last feature with the Cognonto publisher analyzer: it is possible for it to identify unknown entities from the web page. (An “unknown entity” is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page.

(table (generate-stats :js :execute :datasets :all))
True positives:  345
False positives:  104
True negatives:  19
False negatives:  43

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.76837415 |
| :recall      | 0.88917524 |
| :accuracy    | 0.7123288  |
| :f1          | 0.82437277 |
| :f2          | 0.86206895 |
| :f0.5        | 0.78983516 |
+--------------+------------+

As we can see, the overall accuracy improved by 19.73% when considering the unknown entities compared to the public and private datasets.

metric Impact in %
precision +4.33%
recall +21.67%
accuracy +19.73%
f1 +12.37%
f2 +17.79%
f0.5 +7.42%

Analysis

When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that worse dataset may in fact cover one prediction area not covered in a better one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones.

Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public dataset by significantly improving the accuracy of the system. We got a gain 19.19% in accuracy by adding the public datasets to the better scoring private dataset #1. Nearly 20% of improvement in such a predictive system is highly significant.

Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn’t appear to lower the overall performance of the system (even if I am sure that some could), but generally the more the better (but more doesn’t necessarely produce significant accuracy increases).

Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improved the overall accuracy by another 19.73%. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the long tail of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such “unknown” entities tells us where to focus efforts to add to the known database of existing publishers.

Conclusion

As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significan impact than others, but overall, each dataset contribute to the overall improvement of the predictions.

Posted at 15:00

Semantic Web Company (Austria):

Posted at 07:16

October 02

W3C Read Write Web Community Group: Read Write Web — Q3 Summary — 2016

Summary

The community group celebrates its 5th birthday this quarter.  With almost 3000 posts (roughly 2 per day) from around 100 members a large number of topics have been raised, discussed and resolved.  A bit thank you to to everyone that has been involved!

On the subject of statistics, there was a great paper produced by AKSW: LODStats: The Data Web Census Dataset which provides a comprehensive picture of the current state of a significant part of the Data Web.  There was also a status update from the LDP Next Community Group and Data on the Web Best Practices is now a Candidate Recommendation.

TPAC 2016 got under way in Lisbon.  While there was not a dedicated RWW session this year, many members of the group attended various related topics.  There was some interest reported around the work on Verified Claims, which hopes to form a working group quite soon.

Communications and Outreach

Apart from TPAC, I was able to attend the 3rd annual Hackers Congress at Paralelni Polis, which aims to spread ideas of decentralization in technology.  I was able to interact with some thought leaders in the crypto currency space and try to explain the decentralized nature of the web and how it can grow organically using standards to read and write.  I also got a chance to talk to people form the remote storage project.

 

Community Group

Having reached the 5 year milestone, it is perhaps a good time to reflect on the direction of the community group.  Do we want to keep going as we are, focus on specific topics, me more discussion oriented or more standards creation oriented?  I’ll send out a questionnaire on this.

A thread on ways to (re) decentralize the web generated some discussion.  There was also some discussion around the Internet of Things and a possible new framework for using Linked Data to read and write.

solid

Applications

More work has been done in modularizing the solid linked data browser / editor into separate modular chunks (solid-ui, solid-app-set), that can be used to create apps on data, using a javascript shim.  An analogy I like to think of is RSS being a structured data format but with some code it can become a useful application.  Solid app set allows this to happen for any class of data.  I am really enjoying this paradigm and have started to translate the apps I write.  Here is an example of a playlist pane translation of a clients side app.  Tim has written a lot more of these in the same repo.

Node solid server has progressed, with the permissions system being broken down into its own module, solid permissions.  Also improvements have been made to the profile ui and the dashboard, which are still works in progress.

Much progress has been made on the document editor, dokie.li, which is now also driving the Linked Data Notifications spec towards Candidate Recommendation, and integrating with the Web annotations specs.  An slightly older screencast of functionality is available here, but I am told new ones will also be published very soon.

cognoto

Last but not Least…

We welcome the launch of Cognoto.

Cognonto (a portmanteau of ‘cognition’ and ‘ontology’) exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact and entity extraction and tagging.

Check a sample term, or read more from this comprehensive blog post.

Posted at 16:04

Copyright of the postings is owned by the original blog authors. Contact us.