Planet RDF

It's triples all the way down

February 24

AKSW Group - University of Leipzig: The USPTO Linked Patent Dataset release

Dear all,

We are happy to announce USPTO Linked Patent Dataset release.

Patents are widely used to protect intellectual property and a measure of innovation output. Each year, the USPTO grants over 150, 000 patents to individuals and companies all over the world. In fact, there were more than 200, 000 patent grants issued in the US in 2013. However, accessing, searching and analyzing those patents is often still cumbersome and inefficient.

Our dataset is the output of converting USPTO XML patents data into RDF from the years 2002 – 2016. This supports the integration with other data sources in order to further simplify use cases such as trend analysis, structured patent search & exploration and societal progress measurements.

The USPTO Linked Patent Dataset contains 13,014,651 entities where 2,355,579 are patents. Other entities represent Applicant, Inventor, Agent, Examiner (primary and secondary),  and assignee. All these entities amount to c.a. 168 million triples are describing the patents information.

The complete description of the dataset and SPARQL endpoint are available on the DataHub: https://datahub.io/dataset/linked-uspto-patent-data.

We really appreciate feedback and are open to collaborations.
If you happen to have use cases utilizing this dataset, please contact us.

 

Posted at 16:18

February 22

AKSW Group - University of Leipzig: Two accepted papers in ESWC 2017

Hello Community! We are very pleased to announce the acceptance of two papers in ESWC 2017 research track. The ESWC 2017 is to be held in Portoroz, Slovenia from 28th of May to the 1st of June. In more detail, we will present the following papers:

  1. “WOMBAT – A Generalization Approach for Automatic Link Discovery” Mohamed Ahmed Sherif, Axel-Cyrille Ngonga Ngomo, Jens Lehmann

    Abstract. A significant portion of the evolution of Linked Data datasets lies in updating the links to other datasets. An important challenge when aiming to update these links automatically under the open-world assumption is the fact that usually only positive examples for the links exist. We address this challenge by presenting and evaluating WOMBAT , a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. WOMBAT is based on generalisation via an upward refinement operator to traverse the space of link specification. We study the theoretical characteristics of WOMBAT and evaluate it on 8 different benchmark datasets. Our evaluation suggests that WOMBAT outperforms state-of-the-art supervised approaches while relying on less information. Moreover, our evaluation suggests that WOMBAT’s pruning algorithm allows it to scale well even on large datasets.

  2. “All That Glitters is not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking” Kunal Jha, Michael Röder and Axel-Cyrille Ngonga Ngomo

    Abstract. The evaluation of Named Entity Recognition as well as Entity Linking systems is mostly based on manually created gold standards. However, the current gold standards have three main drawbacks. First, they do not share a common set of rules pertaining to what is to be marked and linked as an entity. Moreover, most of the gold standards have not been checked by other researchers after they have been published and hence commonly contain mistakes. Finally, they lack actuality as in most cases the reference knowledge base used to link the entities has been refined over time while the gold standards are typically not updated to the newest version of the reference knowledge base. In this work, we analyze existing gold standards and derive a set of rules for annotating documents for named entity recognition and entity linking. We derive Eaglet, a tool that supports the semi-automatic checking of a gold standard based on these rules. A manual evaluation of Eaglet’s results shows that it achieves an accuracy of up to 88% when detecting errors. We apply Eaglet to 13 gold standards and detect 38,453 errors. An evaluation of 10 tools on a subset of these datasets shows a performance difference of up to 10% micro F-measure on average.

 

Acknowledgments
This work is has been supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation action SLIPO (GA no. 731581), the BMWI Project SAKE (project no. 01MD15006E), the BmBF project DIESEL (project no. 01QE1512C) and the BMWI Project GEISER (project no. 01MD16014).

Posted at 16:43

Leigh Dodds: Open Data Camp Pitch: Mapping data ecosystems

I’m going to

Posted at 10:39

February 11

Sandro Hawke: Testing RSS.

Just playing around. I figured no one’s reading this anyway.

 

(I’m trying the slack-rss integration.  It worked, after a minute or two.   Let’s try an update…)

 


Posted at 19:08

February 09

AKSW Group - University of Leipzig: AKSW Colloquium, 13th February, 3pm, Evaluating Entity Linking

Michael Roeder On the 13th of February at 3 PM, Michael Röder will present the two papers “Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job” of van Erp et al. and “Moving away from semantic overfitting in disambiguation datasets” of Postma et al. in P702.

Abstract 1

Entity linking has become a popular task in both natural language processing and semantic web communities. However, we find that the benchmark datasets for entity linking tasks do not accurately evaluate entity linking systems. In this paper, we aim to chart the strengths and weaknesses of current benchmark datasets and sketch a roadmap for the community to devise better benchmark datasets.

Abstract 2

Entities and events in the world have no frequency, but our communication about them and the expressions we use to refer to them do have a strong frequency profile. Language expressions and their meanings follow a Zipfian distribution, featuring a small amount of very frequent observations and a very long tail of low frequent observations. Since our NLP datasets sample texts but do not sample the world, they are no exception to Zipf’s law. This causes a lack of representativeness in our NLP tasks, leading to models that can capture the head phenomena in language, but fail when dealing with the long tail. We therefore propose a referential challenge for semantic NLP that reflects a higher degree of ambiguity and variance and captures a large range of small real-world phenomena. To perform well, systems would have to show deep understanding on the linguistic tail.

The papers are available at lrec-conf.org and aclweb.org.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 14:53

February 03

AKSW Group - University of Leipzig: SLIPO project kick-off meeting

SLIPO, a new InfAI project kicked-off between the 18th and 20th of January in Athens, Greece. Funded by the EU-program “Horizon 2020”, the project is planned to have an operational time until the 31st of December 2019.

Scalable Linking and Integration of Big POI Data (SLIPO) has the goal to transfer the output of the GeoKnow researches to certain challenges of POI data, which becomes more and more indispensable for issues in the fields of tracking, logistics and tourism. Furthermore we are scheduling to improve the scalability of our key research frameworks, such as LIMES, DEER or LinkedGeoData.

For or closer look please visit: http://aksw.org/Projects/SLIPO.html

Our partners through this process are: 

This project has received funding from the European Union’s H2020 research and innovation action program under grant agreement number 731581.

Posted at 11:59

February 02

Dublin Core Metadata Initiative: From MARC silos to Linked Data silos? Data models for bibliographic Linked Data<</title>

2017-02-02, 2017-02-02, Many libraries are experimenting with publishing their metadata as Linked Data to open up bibliographic silos, usually based on MARC records, to the Web. The libraries who have published Linked Data have all used different data models for structuring their bibliographic data. Some are using a FRBR-based model where Works, Expressions and Manifestations are represented separately. Others have chosen basic Dublin Core, dumbing down their data into a lowest common denominator format. And still others are using variations of BIBFRAME. The proliferation of data models limits the reusability of bibliographic data. In effect, libraries have moved from MARC silos to Linked Data silos of incompatible data models. There is currently no universal model for how to represent bibliographic metadata as Linked Data, even though many attempts for such a model have been made. In this webinar, by Osma Suominen of the National Library of Finland, will present: (1) a survey of published bibliographic Linked Data, the data models proposed for representing bibliographic data as RDF, and tools used for conversion from MARC records; (2) an analysis of different use cases for bibliographic Linked Data and how they affect the data model; and (3) recommendations for choosing a data model. For additional information and to register, visit the webinar's webpage at http://dublincore.org:/resources/training/#2017suominen. Registration is managed by ASIS&T.

Posted at 23:59

January 30

AKSW Group - University of Leipzig: AKSW Colloquium 30.Jan.2017

In the upcoming Colloquium, Simon Bin will discuss the paper “SimonTowards Analytics Aware Ontology Based Access to Static and Streaming Data” by Evgeny Kharlamov et.al. that has been presented at ISWC2017.

  Abstract

Real-time analytics that requires integration and aggregation of heterogeneous and distributed streaming and static data is a typical task in many industrial scenarios such as diagnostics of turbines in Siemens. OBDA approach has a great potential to facilitate such tasks; however, it has a number of limitations in dealing with analytics that restrict its use in important industrial applications. Based on our experience with Siemens, we argue that in order to overcome those limitations OBDA should be extended and become analytics, source, and cost aware. In this work we propose such an extension. In particular, we propose an ontology, mapping, and query language for OBDA, where aggregate and other analytical functions are first class citizens. Moreover, we develop query optimisation techniques that allow to efficiently process analytical tasks over static and streaming data. We implement our approach in a system and evaluate our system with Siemens turbine data.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/public/colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 12:24

January 22

Bob DuCharme: Brand-name companies using SPARQL: the sparql.club

Disney! Apple! Amazon! MasterCard!

Posted at 14:37

January 20

AKSW Group - University of Leipzig: AKSW Colloquium, 23.01.2017, Automatic Mappings of Tables to Knowledge Graphs and Open Table Extraction

Automatic Mappings of Tables to Knowledge Graphs and Open Table Extraction

On the upcoming colloquium on 23.01.2017, Ivan Ermilov will present his work on automatic mappings of tables to knowledge graphs, which was published as TAIPAN: Automatic Property Mapping for Tabular Data on EKAW’2016 conference, as well as extension of this work including:

  • Open Table Extraction (OTE) approach, i.e. how to generate meaningful information from a big corpus of tables.
  • How to benchmark OTE and which benchmarks are available.
  • OTE use cases and applications.

 

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 13:02

Leigh Dodds: Mega-City One: Smart City

“A smart city is an urban development vision to integrate multiple information and communication technology (ICT) and Internet of Things (IoT) solutions in a secure fashion to manage a city’s assets – the city’s assets include, but are not limited to, local departments’ information systems, schools, libraries, transportation systems, hospitals, power plants, water supply networks, waste management, law enforcement, and other community services…ICT allows city officials to interact directly with the community and the city infrastructure and to monitor what is happening in the city, how the city is evolving, and how to enable a better quality of life. Through the use of sensors integrated with real-time monitoring systems, data are collected from citizens and devices – then processed and analyzed. The information and knowledge gathered are keys to tackling inefficiency.” –

Posted at 10:56

January 14

Leigh Dodds: A river of research, not news

I already hate the phrase “fake news”. We have better words to describe lies, disinformation, propaganda and slander, so lets just use those.

While the phrase “fake news” might

Posted at 12:47

January 09

AKSW Group - University of Leipzig: PRESS RELEASE: “HOBBIT so far.” is now available

cropped-Hobbit_Logo_Claim_2015_rgb_300_130The latest release informs about the conferences our team attended in 2016 as well as about the published blogposts. Furthermore it gives a short analysis about the survey by which we are able to verify requirements of our benchmarks and the new HOBBIT plattform. Last but not least the release gives a short outlook to our plans in 2017 including the founding of the HOBBIT association.

Have a look at the whole press release on the HOBBIT website .

Posted at 13:22

January 01

W3C Read Write Web Community Group: Read Write Web — Q4 Summary — 2016

Summary

An eventful 2016 draws to a close, steady progress has been made on standards, implementations and apps, for reading and writing to the web.  Some press coverage is starting to emerge, with pieces on a decentralized web and putting data back in the hands of owners.  Also published was a nice review and predictions for trends in 2017 on the (semantic) web.

Linked Data Notifications has become a Candidate Recommendation, and expected to become a full recommendation next year, as the working group has been extended slightly.  Data Best Practices on the Web is also now a Proposed Recommendation.

In the community group a few apps were released and there was a renewed interest in WebID delegation, with a browser extension produced that allows one to delegate authentication to a trusted agent.  I also put together a spec, library and implementation for dealing with GroupURIs.

Communications and Outreach

Some collaboration took place on the WebID front.  There was also an innovative spec put together for working with X.509 fingerprints.  There was also some inquiry of using WebID for cultural purposes with Drupal.  And also an interest from divvy dao, qbix and others to implement Solid.

Some neat slides were created by dmitri on understanding linked data and an introduction to solid.

 

Community Group

A browser extension has been published that allows WebID delegation to function native in the browser.  One advantage of this is to make authentication systems a point of flexibility allowing login via password, oauth or other services to a delegated agent which can then be trusted via PKI.

I wrote a little spec regarding GroupURIs, which given a set of participants in a group, allows one to deterministically create a new URI for that group, that can be generated independently.  This leads to the possibility of things like group chat rooms being created dynamically, I have an implementation and prototype of the spec also working, which allows you to create conference calls with people you know over webRTC.

solid

Applications

A fair amount of work has been done refactoring and making production ready the main solid reference implementations.  For example, the permissions module is now standalone, and both gold and node-solid-server now support WebID with delegation, with gold also offering a proxy server.

Work has continued on adding openid connect to solid and there is a live test server that is prototyping the new login flow.  A new linked data server in perl has been written by kjetilk.  The linked data document editor, dokie.li, continues to improve, and as promised here, is a screencast showing off the new features.

ldfragments

Last but not Least…

Linked Data Fragments has now come up with some impressive demos using the Triple Pattern Fragment technique.  In addition to the endpoint on dbpedia, kudos to Ruben Verborgh who has published 25,000+ triples on his own site.  And finally, an excellent integration with wikidata, viaf and dbpedia, all working in one browser, was created.  Try it here!

Posted at 06:51

December 29

Libby Miller: Moving to Kolab from Google mail (“G-Suite”)

I had a seven year old google ‘business’ account (now called ‘G-Suite’, previously Google Apps, I think) from back when it was free. Danbri put me onto it, and it was brilliant because you can use your own domain with it. It’s been very useful, but I’ve been thinking of moving to a paid-for service for a while.

Posted at 16:18

December 23

Libby Miller: A simple Raspberry Pi-based picture frame using Flickr

I made this Raspberry Pi picture frame – initially with a screen – as a present for my parents for their wedding anniversary. After user testing, I realised that what they really wanted was a tiny version that they could plug into their TV, so I recently made a Pi Zero version to do that.

It uses a Raspberry Pi 3 or Pi Zero with full-screen Chromium. I’ve used Flickr as a backend: I made a new account and used their handy upload-by-email function (which you can set to make uploads private) so that all members of the family can send pictures to it.

frame

I initially assumed that a good frame would be ambient – stay on the same picture for say, 5 minutes, only update itself once a day, and show the latest pictures in order. But user testing – and specifically an uproarious party where we were all uploading pictures to it and wanted to see them immediately – forced a redesign. It now updates itself every 15 minutes, uses the latest two plus a random sample from all available pictures, shows each picture for 20 seconds, and caches the images to make sure that I don’t use up all my parents’ bandwidth allowance.

The only tricky technical issue is finding the best part of the picture to display. It will usually be a landscape display (although you could configure a Pi screen to be vertical), so that means you’re either going to get a lot of black onscreen or you’ll need to figure out a rule of thumb. On a big TV this is maybe less important. I never got amazing results, but I had a play with both heuristics and face detection, and both moderately improved matters.

It’s probably not a great deal different to what you’d get in any off the shelf electronic picture frame, but I like it because it’s fun and easy to make, configurable and customisable. And you can just plug it into your TV. You could make one for several members of a group or family based on the same set of pictures, too.

Version 1:

Posted at 14:13

Libby Miller: LIRC on a Raspberry Pi for a silver Apple TV remote

The idea here is to control a full-screen chromium webpage via a simple remote control (I’m using it for full-screen TV-like prototypes). It’s very straightforward really, but

  • I couldn’t find the right kind of summary of

Posted at 11:54

Libby Miller: Twitter for ESP 8266

I’ve been using the

Posted at 09:51

December 22

Bob DuCharme: A modern neural network in 11 lines of Python

And a great learning tool for understanding neural nets.

Posted at 12:52

December 16

AKSW Group - University of Leipzig: 4th Big Data Europe Plenary at Leipzig University

bde_vertical

The meeting, hosted by our partner InfAI e. V., took place on the 14th to the 15th of December at the University of Leipzig.
The 29 attendees in total, including 15 partners, discussed and reviewed the progress of all work packages in 2016 and planned the activities and workshops taking place in the next six months.

On the second day we talked about several societal challenge pilots in the fields of AgroKnow, transport, security etc. It’s been the last plenary for this year and we thank everybody for their work in 2016. Big Data Europa and our partners are looking forward to 2017.

The next Plenary Meeting will be hosted by VU Amsterdam and will take place in Amsterdam, in June 2017.

Posted at 13:33

December 09

AKSW Group - University of Leipzig: SANSA 0.1 (Semantic Analytics Stack) Released

Dear all,

The Smart Data Analytics group /AKSW are very happy to announce SANSA 0.1 – the initial release of the Scalable Semantic Analytics Stack. SANSA combines distributed computing and semantic technologies in order to allow powerful machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

  • Support for reading and writing RDF files in N-Triples format
  • Support for reading OWL files in various standard formats
  • Querying and partitioning based on Sparqlify
  • Support for RDFS/RDFS Simple/OWL-Horst forward chaining inference
  • Initial RDF graph clustering support
  • Initial support for rule mining from RDF graphs

We want to thank everyone who helped to create this release, in particular, the projects Big Data Europe, HOBBIT and SAKE.

Kind regards,

The SANSA Development Team

Posted at 14:41

AKSW Group - University of Leipzig: AKSW wins award for Best Resources Paper at ISWC 2016 in Japan

iswc2016Our paper, “LODStats: The Data Web Census Dataset”, won the award for Best Resources Paper at the recent conference in Kobe/Japan, which was the premier international forum for Semantic Web and Linked Data Community. The paper presents the LODStats dataset, which provides a comprehensive picture of the current state of a significant part of the Data Web.

Congrats to  Ivan Ermilov, Jens Lehmann, Michael Martin and Sören Auer.

Please find the complete list of winners here.

 

Posted at 14:05

November 30

Ebiquity research group UMBC: PhD Proposal: Ankur Padia, Dealing with Dubious Facts in Knowledge Graphs

the skeptic

Dissertation Proposal

Dealing with Dubious Facts
in Knowledge Graphs

Ankur Padia

1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC

Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.

An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.

Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Paul McNamee (JHU), Partha Talukdar (IISc, India)

Posted at 02:25

November 26

AKSW Group - University of Leipzig: AKSW Colloquium, 28.11.2016, NED using PBOH + Large-Scale Learning of Relation-Extraction Rules.

In the upcoming Colloquium, November the 28th at 3 PM, two papers will be presented:

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Diego Moussallem will discuss the paper “Probabilistic Bag-Of-Hyperlinks Model for Entity Linking” by Octavian-Eugen Ganea et. al. which was accepted at WWW 2016.

Abstract:  Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e., linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods

Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web

Afterward, René Speck will present the paper “Large-Scale Learning of Relation-Extraction Rules with
Distant Supervision from the Web”
by Sebastian Krause et. al. which was accepted at ISWC 2012.

Abstract: We present a large-scale relation extraction (RE) system which learns grammar-based RE rules from the Web by utilizing large numbers of relation instances as seed. Our goal is to obtain rule sets large enough to cover the actual range of linguistic variation, thus tackling the long-tail problem of real-world applications. A variant of distant supervision learns several relations in parallel, enabling a new method of rule filtering. The system detects both binary and n-ary relations. We target 39 relations from Freebase, for which 3M sentences extracted from 20M web pages serve as the basis for learning an average of 40K distinctive rules per relation. Employing an efficient dependency parser, the average run time for each relation is only 19 hours. We compare these rules with ones learned from local corpora of different sizes and demonstrate that the Web is indeed needed for a good coverage of linguistic variation

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:30

November 21

Frederick Giasson: Leveraging KBpedia Aspects To Generate Training Sets Automatically

In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning 1 , 2 , 3 using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.

So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:

  1. Using the links that exist between each KBpedia reference concept and their related Wikipedia pages
  2. Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of
    named entities.

Now we will introduce a third way to create a different kind of training corpus:

  1. Using the KBpedia aspects linkages.

Aspects are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. It is another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.

Creating New Domain Using KBpedia Aspects

To continue with the musical domain, there exists two aspects of interest:

  1. Music
  2. Genres

What we will do first is to query the KBpedia Knowledge Graph using theSPARQL query language to get the list of all of the KBpedia reference concepts that are related to the Music or the Genre aspects. Then, for each of these reference concepts, we will count the number of named entities that can be reached in the complete KBpedia structure.

prefix kko: <http://kbpedia.org/ontologies/kko#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix dcterms: <http://purl.org/dc/terms/> 
prefix schema: <http://schema.org/>

select distinct ?class count(distinct ?entity) as ?nb
from <http://dbpedia.org>
from <http://www.uspto.gov>
from <http://wikidata.org>
from <http://kbpedia.org/1.10/>
where
{
  ?entity dcterms:subject ?category .

  graph <http://kbpedia.org/1.10/>
  {
    {?category <http://kbpedia.org/ontologies/kko#hasMusicAspect> ?class .}
    union
    {?category <http://kbpedia.org/ontologies/kko#hasGenre> ?class .}
  }
}
order by desc(?nb)
reference concept nb
http://kbpedia.org/kko/rc/Album-CW 128772
http://kbpedia.org/kko/rc/Song-CW 74886
http://kbpedia.org/kko/rc/Music 51006
http://kbpedia.org/kko/rc/Single 50661
http://kbpedia.org/kko/rc/RecordCompany 5695
http://kbpedia.org/kko/rc/MusicalComposition 5272
http://kbpedia.org/kko/rc/MovieSoundtrack 2919
http://kbpedia.org/kko/rc/Lyric-WordsToSong 2374
http://kbpedia.org/kko/rc/Band-MusicGroup 2185
http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup 2078
http://kbpedia.org/kko/rc/Ensemble 1438
http://kbpedia.org/kko/rc/Orchestra 1380
http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup 1335
http://kbpedia.org/kko/rc/Choir 754
http://kbpedia.org/kko/rc/Concerto 424
http://kbpedia.org/kko/rc/Symphony 299
http://kbpedia.org/kko/rc/Singing 154

Seventeen KBpedia reference concepts are related to the two aspects we want to focus on. The next step is to take these 17 reference concepts and to create a new domain corpus with them. We will use the new version of KBpedia to create the full set of reference concepts that will scope our domain by inference.

Next we will try to use this information to create two totally different kinds of training corpuses:

  1. One that will rely on the links between the reference concepts and Wikipedia pages
  2. One that will rely on the linkages to external vocabularies to create a list of named entities that will be used as
    the training corpus

Creating Model With Reference Concepts

The first training corpus we want to test is one that uses the linkage between KBpedia reference concepts and Wikipedia pages. The first thing is to generate the domain training corpus with the 17 seed reference concepts and then to infer other related reference concepts.

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])


(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Song-CW"
                       "http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Single"
                       "http://kbpedia.org/kko/rc/RecordCompany"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MovieSoundtrack"
                       "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                       "http://kbpedia.org/kko/rc/Band-MusicGroup"
                       "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Ensemble"
                       "http://kbpedia.org/kko/rc/Orchestra"
                       "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Choir"
                       "http://kbpedia.org/kko/rc/Symphony"
                       "http://kbpedia.org/kko/rc/Singing"
                       "http://kbpedia.org/kko/rc/Concerto"]
  kbpedia
  "resources/aspects-concept-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

(create-pruned-pages-dictionary-csv "resources/aspects-concept-corpus-dictionary.csv"
                                    "resources/aspects-concept-corpus-dictionary.pruned.csv" 
                                    "resources/aspects-corpus-normalized/")

Once pruned, we end-up with a domain which has 108 reference concepts which will enable us to create models with 108 features. The next step is to create the actual semantic interpreter and the SVM models:

;; Load dictionaries
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv")

;; Create the semantic interpreter
(build-semantic-interpreter "aspects-concept-pruned" "resources/semantic-interpreters/aspects-concept-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the SVM model vectors
(build-svm-model-vectors "resources/svm/aspects-concept-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

;; Train the linear SVM classifier
(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Then we have to evaluate this new model using the gold standard:

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive:  28
False positive:  0
True negative:  923
False negative:  66

Precision:  1.0
Recall:  0.29787233
Accuracy:  0.93510324
F1:  0.45901638

Now let’s try to find better hyperparameters using grid search:

(svm-grid-search "grid-search-aspects-concept-pruned-tests" 
                       "resources/svm/aspects-concept-pruned/" 
                       "resources/gold-standard-full.csv"
                       :selection-metric :f1
                       :grid-parameters [{:c [1 2 4 16 256]
                                          :e [0.001 0.01 0.1]
                                          :algorithm [:l2l2]
                                          :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.84444445 
 :c 1
 :e 0.001 
 :algorithm :l2l2
 :weight 30}

After running the grid search with these initial broad range values, we found a configuration that gives us 0.8444 for the F1 score. So far, this score is the best to date we have gotten for the full gold standard2, 3. Let’s see all of the metrics for this configuration:

(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights {1 30.0}
                 :v nil
                 :c 1 
                 :e 0.001
                 :algorithm :l2l2)

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive:  76
False positive:  10
True negative:  913
False negative:  18

Precision:  0.88372093
Recall:  0.80851066
Accuracy:  0.972468
F1:  0.84444445

These results are also the best balance between precision and recall that we have gotten so far2, 3. Better precision can be obtained if necessary but only at the expense of lower recall.

Let’s take a look at the improvements we got compared to the previous training corpuses we had:

  • Precision: +4.16%
  • Recall: +35.72%
  • Accuracy: +2.06%
  • F1: +20.63%

This new training corpus based on the KBpedia aspects, after hyperparameter optimization, did increase all the metrics we calculate. The more stiking improvement is the recall which improved by more than 35%.

Creating Model With Entities

The next training corpus we want to test is one that uses the linkage between KBpedia reference concepts and linked external vocabularies to get a series of linked named entities as the positive training set of for each of the features of the model.

The first thing to do is to is to create the positive training set populated with named entities related to the reference concepts. We will get a random sample of ~50 named entities per reference concept:

(require '[cognonto-rdf.query :as query])
(require '[clojure.java.io :as io])
(require '[clojure.data.csv :as csv])
(require '[clojure.string :as string])

(defn generate-domain-by-rc
  [rc domain-file nb]
  (with-open [out-file (io/writer domain-file :append true)]
    (doall
     (->> (query/select
           (str "prefix kko: <http://kbpedia.org/ontologies/kko#>
                 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
                 prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

                 select distinct ?entity
                 from <http://dbpedia.org>
                 from <http://www.uspto.gov>
                 from <http://wikidata.org>
                 from <http://kbpedia.org/1.10/>
                 where
                 {
                   ?entity dcterms:subject ?category .
                   graph <http://kbpedia.org/1.10/>
                   {
                     ?category ?aspectProperty <" rc "> .
                   }
                 }
                 ORDER BY RAND() LIMIT " nb) kb-connection)
          (map (fn [entity]
                 (csv/write-csv out-file [[(string/replace (:value (:entity entity)) "http://dbpedia.org/resource/" "")
                                           (string/replace rc "http://kbpedia.org/kko/rc/" "")]])))))))


(defn generate-domain-by-rcs 
  [rcs domain-file nb-per-rc]
  (with-open [out-file (io/writer domain-file)]
    (csv/write-csv out-file [["wikipedia-page" "kbpedia-rc"]])
    (doseq [rc rcs] (generate-domain-by-rc rc domain-file nb-per-rc))))

(generate-domain-by-rcs ["http://kbpedia.org/kko/rc/"
                         "http://kbpedia.org/kko/rc/Concerto"
                         "http://kbpedia.org/kko/rc/DoubleAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Psychedelic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Religious"
                         "http://kbpedia.org/kko/rc/PunkMusic"
                         "http://kbpedia.org/kko/rc/BluesMusic"
                         "http://kbpedia.org/kko/rc/HeavyMetalMusic"
                         "http://kbpedia.org/kko/rc/PostPunkMusic"
                         "http://kbpedia.org/kko/rc/CountryRockMusic"
                         "http://kbpedia.org/kko/rc/BarbershopQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/FolkMusic"
                         "http://kbpedia.org/kko/rc/Verse"
                         "http://kbpedia.org/kko/rc/RockBand"
                         "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                         "http://kbpedia.org/kko/rc/Refrain"
                         "http://kbpedia.org/kko/rc/MusicalComposition-GangstaRap"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Klezmer"
                         "http://kbpedia.org/kko/rc/HouseMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-AlternativeCountry"
                         "http://kbpedia.org/kko/rc/PsychedelicMusic"
                         "http://kbpedia.org/kko/rc/ReggaeMusic"
                         "http://kbpedia.org/kko/rc/AlternativeRockBand"
                         "http://kbpedia.org/kko/rc/AlternativeRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Trance"
                         "http://kbpedia.org/kko/rc/Ensemble"
                         "http://kbpedia.org/kko/rc/RhythmAndBluesMusic"
                         "http://kbpedia.org/kko/rc/NewAgeMusic"
                         "http://kbpedia.org/kko/rc/RockabillyMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Blues"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Opera"
                         "http://kbpedia.org/kko/rc/Choir"
                         "http://kbpedia.org/kko/rc/SurfMusic"
                         "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/MusicalComposition-JazzRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Country"
                         "http://kbpedia.org/kko/rc/CountryMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-PopRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Romantic"
                         "http://kbpedia.org/kko/rc/Recitative"
                         "http://kbpedia.org/kko/rc/Chorus"
                         "http://kbpedia.org/kko/rc/FusionMusic"
                         "http://kbpedia.org/kko/rc/MovieSoundtrack"
                         "http://kbpedia.org/kko/rc/GreatestHitsAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Christian"
                         "http://kbpedia.org/kko/rc/ClassicalMusic-Baroque"
                         "http://kbpedia.org/kko/rc/MusicalComposition-NewAge"
                         "http://kbpedia.org/kko/rc/MusicalComposition-TraditionalPop"
                         "http://kbpedia.org/kko/rc/TranceMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Celtic"
                         "http://kbpedia.org/kko/rc/LoungeMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Reggae"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Baroque"
                         "http://kbpedia.org/kko/rc/Trio-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/Symphony"
                         "http://kbpedia.org/kko/rc/MusicalComposition-RockAndRoll"
                         "http://kbpedia.org/kko/rc/PopRockMusic"
                         "http://kbpedia.org/kko/rc/IndustrialMusic"
                         "http://kbpedia.org/kko/rc/JazzMusic"
                         "http://kbpedia.org/kko/rc/MusicalChord"
                         "http://kbpedia.org/kko/rc/ProgressiveRockMusic"
                         "http://kbpedia.org/kko/rc/GothicMusic"
                         "http://kbpedia.org/kko/rc/LiveAlbum-CW"
                         "http://kbpedia.org/kko/rc/NewWaveMusic"
                         "http://kbpedia.org/kko/rc/NationalAnthem"
                         "http://kbpedia.org/kko/rc/OldieSong"
                         "http://kbpedia.org/kko/rc/Song-Sung"
                         "http://kbpedia.org/kko/rc/RockMusic"
                         "http://kbpedia.org/kko/rc/Aria"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Disco"
                         "http://kbpedia.org/kko/rc/GospelMusic"
                         "http://kbpedia.org/kko/rc/BluegrassMusic"
                         "http://kbpedia.org/kko/rc/FolkRockMusic"
                         "http://kbpedia.org/kko/rc/RockAndRollMusic"
                         "http://kbpedia.org/kko/rc/Opera-CW"
                         "http://kbpedia.org/kko/rc/HitSong-CW"
                         "http://kbpedia.org/kko/rc/Tune"
                         "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/RapMusic"
                         "http://kbpedia.org/kko/rc/RecordCompany"
                         "http://kbpedia.org/kko/rc/MusicalComposition-ACappella"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Electronica"
                         "http://kbpedia.org/kko/rc/Music"
                         "http://kbpedia.org/kko/rc/GlamRockMusic"
                         "http://kbpedia.org/kko/rc/LoveSong"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Gothic"
                         "http://kbpedia.org/kko/rc/MarchingBand"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Punk"
                         "http://kbpedia.org/kko/rc/BluesRockMusic"
                         "http://kbpedia.org/kko/rc/TechnoMusic"
                         "http://kbpedia.org/kko/rc/SoulMusic"
                         "http://kbpedia.org/kko/rc/ChamberMusicComposition"
                         "http://kbpedia.org/kko/rc/Requiem"
                         "http://kbpedia.org/kko/rc/MusicalComposition"
                         "http://kbpedia.org/kko/rc/ElectronicMusic"
                         "http://kbpedia.org/kko/rc/CompositionMovement"
                         "http://kbpedia.org/kko/rc/StringQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/Riff"
                         "http://kbpedia.org/kko/rc/Anthem"
                         "http://kbpedia.org/kko/rc/HardRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-BluesRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Cyberpunk"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Industrial"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Funk"
                         "http://kbpedia.org/kko/rc/Album-CW"
                         "http://kbpedia.org/kko/rc/HipHopMusic"
                         "http://kbpedia.org/kko/rc/Single"
                         "http://kbpedia.org/kko/rc/Singing"
                         "http://kbpedia.org/kko/rc/SwingMusic"
                         "http://kbpedia.org/kko/rc/Song-CW"
                         "http://kbpedia.org/kko/rc/SalsaMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Jazz"
                         "http://kbpedia.org/kko/rc/ClassicalMusic"
                         "http://kbpedia.org/kko/rc/MilitaryBand"
                         "http://kbpedia.org/kko/rc/SkaMusic"
                         "http://kbpedia.org/kko/rc/Orchestra"
                         "http://kbpedia.org/kko/rc/GrungeRockMusic"
                         "http://kbpedia.org/kko/rc/SouthernRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Ambient"
                         "http://kbpedia.org/kko/rc/DiscoMusic"] "resources/aspects-domain-corpus.csv")

Next let’s create the actual positive training corpus and let’s normalize it:

(cache-aspects-corpus "resources/aspects-entities-corpus.csv" "resources/aspects-corpus/")
(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

We end up with 22 features for which we can get named entities from the KBpedia Knowledge Base. These will be the 22 features of our model. The complete positive training set has 799 documents in it.

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-entities-corpus-dictionary.pruned.csv")

(build-semantic-interpreter "aspects-entities-pruned" "resources/semantic-interpreters/aspects-entities-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/aspects-entities-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

(train-svm-model "svm.aspects.entities.pruned" "resources/svm/aspects-entities-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Now let’s evaluate the model with default hyperparameters:

(evaluate-model "svm.aspects.entities.pruned" "resources/gold-standard-full.csv")
True positive:  9
False positive:  10
True negative:  913
False negative:  85

Precision:  0.47368422
Recall:  0.095744684
Accuracy:  0.906588
F1:  0.15929204

Now let’s try to improve this F1 score using grid search:

(svm-grid-search "grid-search-aspects-entities-pruned-tests" 
                 "resources/svm/aspects-entities-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.44052863
:c 4
:e 0.001
:algorithm :l2l2
:weight 15}

We have been able to greatly improve the F1 score by tweaking the hyperparameters, but the results are still disappointing. There are multiple ways to automatically generate training corpuses, but not all of them are born equal. This is why having a pipeline that can automatically create the training corpuses, optimize the hyperparameters and evaluate the models is more than welcome since this is the bulk of the time a data scientist has to spend to create his models.

Conclusion

After automatically creating multiple different positive and negative training sets, after testing multiple learning methods and optimizing hyperparameters, we found the best training sets with the best learning method and the best hyperparameter to create an initial, optimal, model that has an accuracy of 97.2%, a precision of 88.4%, a recall of
80.9% and overall F1 measure of 84.4% on a gold standard created from real, random, pieces of news from different general and specialized news sites.

The thing that is really interesting and innovative in this method is how a knowledge base of concepts and entities can be used to label positive and negative training sets to feed supervised learners and how the learner can perform well on totally different input text data (in this case, news articles). The same is true when creating training corpuses for unsupervised leaning4.

The most wonderful thing from an operational standpoint is that all of this searching, testing and optimizing can be performed by a computer automatically. The only tasks required by a human is to define the scope of a domain and to manually label a gold standard for performance evaluation and hyperparameters optimization.

Posted at 11:14

November 17

Frederick Giasson: Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 2

In the first part of this series we found the good hyperparameters for a single linear SVM classifier. In part 2, we will try another technique to improve the performance of the system: ensemble learning.

So far, we already reached 95% of accuracy with some tweaking the hyperparameters and the training corpuses but the F1 score is still around ~70% with the full gold standard which can be improved. There are also situations when precision should be nearly perfect (because false positives are really not acceptable) or when the recall should be optimized.

Here we will try to improve this situation by using ensemble learning. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble:

  1. Bootstrap aggregating (bagging)
  2. Asymmetric bagging 1
  3. Random subspace method (feature bagging)
  4. Asymmetric bagging + random subspace method (ABRS) 1
  5. Bootstrap aggregating + random subspace method (BRS)

Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let’s introduce each of these different strategies.

Note that in this article I am only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this article, we will stick to a single linear SVM with multiple different training corpuses and features.

Ensemble Learning Strategies

Bootstrap Aggregating (bagging)

The idea behind bagging is to draw a subset of positive and negative training samples at random and with replacement. Each model of the ensemble will have a different training set but some of the training sample may appear in multiple different training sets.

Asymmetric Bagging

Asymmetric Bagging has been proposed by Tao, Tang, Li and Wu 1. The idea is to use asymmetric bagging when the number of positive training samples is largely unbalanced relatively to the negative training samples. The idea is to create a subset of random (with replacement) negative training samples, but by always keeping the full set of positive training samples.

Random Subspace method (feature bagging)

The idea behind feature bagging is the same as bagging, but works on the features of the model instead of the training sets. It attempts to reduce the correlation between estimators (features) in an ensemble by training them on random samples of features instead of the entire feature set.

Asymmetric Bagging + Random Subspace method (ABRS)

Asymmetric Bagging and Random Subspace Method has also been proposed by Tao, Tang, Li and Wu 1. The problems they had with their content-based image retrieval system are the same we have with this kind of automatic training corpuses generated from knowledge graph:

  1. SVM is unstable on small-sized training set
  2. SVM’s optimal hyperplane may be biased when the positive training sample is much less than the negative feedback sample (this is why we used weights in this case), and
  3. The training set is smaller than the number of features in the SVM model.

The third point is not immediately an issue for us (except if you have a domain with many more features than we had in our example), but becomes one when we start using asymmetric bagging.

What we want to do here is to implement asymmetric bagging and the random subspace method to create dynamic_learning_859c7c3c1099242193bc675bd7b1bf25c900754e number of individual models. This method is called ABRS-SVM which stands for Asymmetric Bagging Random Subspace Support Vector Machines.

The algorithm we will use is:

  1. Let the number of positive training documents be dynamic_learning_2_edc31b30a8bd1852c35517549bcac8b4a7af7fc8, the number of negative training document be dynamic_learning_2_a957f8e8350deff65b7a8982eb9a29c95f5e7773 and the number of features in the training data be dynamic_learning_2_8541d15dc6a8dbbfe4c3da369275938094ab9a70.
  2. Choose dynamic_learning_859c7c3c1099242193bc675bd7b1bf25c900754e to be the number of individual models in the ensemble.
  3. For all individual model dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, choose dynamic_learning_2_81af6d98760653700a014a8f1362a186300f0207 where dynamic_learning_90ae52c890d2e8bd9b3a9376696d65719d104954 to be the number of negative training documents for dynamic_learning_d15a4e4ae61385fcd2221a2be30a7f59da7bd4ca
  4. For all individual models dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, choose dynamic_learning_2_36954911a1d2865eaf400325a9f0b3d9d08a9993 where dynamic_learning_cd852c5f4f131fdce3c46aa64249678cb4456717 to be the number of input variables for ensemble-learning_8af9860053a468761786b279cda937b39be994c5.
  5. For each individual model dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, create a training set by choosing dynamic_learning_2_36954911a1d2865eaf400325a9f0b3d9d08a9993 features from dynamic_learning_2_8541d15dc6a8dbbfe4c3da369275938094ab9a70 with replacement, by choosing dynamic_learning_2_81af6d98760653700a014a8f1362a186300f0207 negative training documents from dynamic_learning_2_a957f8e8350deff65b7a8982eb9a29c95f5e7773 with replacement, by choosing all positive training documents dynamic_learning_2_edc31b30a8bd1852c35517549bcac8b4a7af7fc8 and then train the model.

Bootstrap Aggregating + Random Subspace method (BRS)

Bagging with features bagging is the same as asymmetric bagging with the random subspace method except that we use bagging instead of asymmetric bagging. (ABRS should be used if your positive training sample is severely unbalanced compared to your negative training sample. Otherwise BRS should be used.)

SVM Learner

We use the linear Semantic Vector Machine (SVM) as the learner to use for the ensemble. What we will be creating is a series of SVM models that will be different depending on the ensemble method(s) we will use to create the ensemble.

Build Training Document Vectors

The first step we have to do is to create a structure where all the positive and negative training documents will have their vector representation. Since this is the task that takes the most time in the whole process, we will calculate them using the (build-svm-model-vectors) function and we will serialize the structure on the file system. That way, to create the ensemble’s models, we will only have to load it from the file system without having the re-calculate it each time.

Train, Classify and Evaluate Ensembles

The goal is to create a set of X number of SVM classifiers where each of them use different models. The models can differ in their features or their training corpus. Then each of the classifier will try to classify an input text according to their own model. Finally each classifier will vote to determine if that input text belong, or not, to the domain.

There are four hyperparameters related to ensemble learning:

  1. The mode to use
  2. The number of models we want to create in the ensemble
  3. The number of training documents we want in the training corpus, and
  4. The number of features.

Other hyperparameters could include the ones of the linear SVM classifier, but in this example we will simply reuse the best parameters we found above. We now train the ensemble using the (train-ensemble-svm) function.

Once the ensemble is created and trained, then we have to use the (classify-ensemble-text) function to classify an input text using the ensemble we created. That function takes two parameters: :mode, which is the ensemble’s mode, and :vote-acceptance-ratio, which defines the number of positive votes that is required such that the ensemble positively classify the input text. By default, the ratio is 50%, but if you want to optimize the precision of the ensemble, then you may want to increase that ratio to 70% or even 95% as we will see below.

Finally the ensemble, configured with all its hyperparameters, will be evaluated using the (evaluate-ensemble) function, which is the same as the (evaluate-model) function, but which uses the ensemble instead of a single SVM model to classify all of the articles. As before, we will characterize the assignments in relation to the gold standard.

Let’s now train different ensembles to try to improve the performance of the system.

Asymmetric Bagging

The current corpus training set is highly unbalanced. This is why the first test we will do is to apply the asymmetric bagging strategy. What this does is that each of the SVM classifiers will use the same positive training set with the same number of positive documents. However, each of them will take a random number of negative training documents (with replacement).

(use 'cognonto-esa.core)
(use 'cognonto-esa.ensemble-svm)

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")
(load-semantic-interpreter "base-pruned" "resources/semantic-interpreters/base-pruned/")

(reset! ensemble [])

(train-ensemble-svm "ensemble.base.pruned.ab.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :ab 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 100
                    :nb-training-documents 3500)

Now let’s evaluate this ensemble with a vote acceptance ratio of 50%

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.50)
True positive:  48
False positive:  6
True negative:  917
False negative:  46

Precision:  0.8888889
Recall:  0.5106383
Accuracy:  0.9488692
F1:  0.6486486

Let’s increase the vote acceptance ratio to 90%:

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.90)
True positive:  37
False positive:  2
True negative:  921
False negative:  57

Precision:  0.94871795
Recall:  0.39361703
Accuracy:  0.94198626
F1:  0.556391

In both cases, the precision increases considerably compared to the non-ensemble learning results. However, the recall did drop at the same time, which dropped the F1 score as well. Let’s now try with the ABRS method

Asymmetric Bagging + Random Subspace method (ABRS)

The goal of the random subspace method is to select a random set of features. This means that each model will have their own feature set and will make predictions according to them. With the ABRS strategy, we will conclude with highly different models since none will have the same negative training sets nor the same features.

Here what we test is to define each classifier with 65 randomly chosen features out of 174 to restrict the negative training corpus to 3500 randomly selected documents. Then we choose to create 300 models to try to get a really heterogeneous population of models.

(reset! ensemble [])
(train-ensemble-svm "ensemble.base.pruned.abrs.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :abrs 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 300
                    :nb-features 65
                    :nb-training-documents 3500)
(evaluate-ensemble "ensemble.base.pruned.abrs.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :abrs
                   :vote-acceptance-ratio 0.50)
True positive:  41
False positive:  3
True negative:  920
False negative:  53

Precision:  0.9318182
Recall:  0.43617022
Accuracy:  0.9449361
F1:  0.59420294

For these features and training sets, using the ABRS method did not improve on the AB method we tried above.

Conclusion

This use case shows three totally different ways to use the KBpedia Knowledge Graph to automatically create positive and negative training sets. We demonstrated how the full process can be automated where the only requirement is to get a list of seed KBpedia reference concepts.

We also quantified the impact of using new versions of KBpedia, and how different strategies, techniques or algorithms can have different impacts on the prediction models.

Creating prediction models using supervised machine learning algorithms (which is currently the bulk of the learners currently used) has two global steps:

  1. Label training sets and generate gold standards, and
  2. Test, compare, and optimize different learners, ensembles and hyperparameters.

Unfortunately, today, given the manual efforts required by the first step, the overwhelming portion of time and budget is spent here to create a prediction model. By automating much of this process, Cognonto and KBpedia substantially reduces this effort. Time and budget can now be re-directed to the second step of “dialing in” the learners, where the real payoff occurs. of training corpuses.

Further, as we also demonstrated, once we automated this process of labeling and reference standards, then we can also automate the testing and optimization of multiple different kind of prediction algorithms, hyperparameters configuration, etc. In short, for both steps, KBpedia provides significant reductions in times and efforts to get to desired results.

Footnotes

1Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval

Posted at 11:05

Frederick Giasson: Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 1

In my previous blog post, Create a Domain Text Classifier Using Cognonto, I explained how one can use the KBpedia Knowledge Graph to automatically create positive and negative training corpuses for different machine learning tasks. I explained how SVM classifiers could be trained and used to check if an input text belongs to the defined domain or not.

This article is the first of two articles.In first part I will extend on this idea to explain how the KBpedia Knowledge Graph can be used, along with other machine learning techniques, to cope with different situations and use cases. I will cover the concepts of feature selection, hyperparameter optimization, and ensemble learning (in part 2 of this series). The emphasis here is on the testing and refining of machine learners, versus the set up and configuration times that dominate other approaches.

Depending on the domain of interest, and depending on the required precision or recall, different strategies and techniques can lead to better predictions. More often than not, multiple different training corpuses, learners and hyperparameters need to be tested before ending up with the initial best possible prediction model. This is why I will strongly emphasize the fact that the KBpedia Knowledge Graph and Cognonto can be used to automate fully the creation of a wide range of different training corpuses, to create models, to optimize their hyperparameters, and to evaluate those models.

New Knowledge Graph and Reasoning

For this article, I will use the latest version of the KBpedia Knowledge Graph version 1.10 that we just released. A knowledge graph such as KBpedia is not static. It constantly evolves, gets fixed, and improves. New concepts are created, deprecated concepts are removed, new linkage to external data sources are created, etc. This growth means that any of these changes can have a [positive] impact on the creation of the positive and negative training sets. Applications based on KBpedia should be tested against any new knowledge graph that is released to see if its models will improve. Better concepts, better structure, and more linkages will often lead to better training sets as well.

Such growth in KBpedia is also why automating, and more importantly testing, this process is crucial. Upon the release of major new versions we are able to automate all of these steps to see the final impacts of upgrading the knowledge graph:

  1. Aggregate all the reference concepts that scope the specified domain (by inference)
  2. Create the positive and negative training corpuses
  3. Prune the training corpuses
  4. Configure the classifier (in this case, create the semantic vectors for ESA)
  5. Train the model (in this case, the SVM model)
  6. Optimize the hyperparameters of the algorithm (in this case, the linear SVM hyperparameters), and
  7. Evaluate the model on multiple gold standards.

Because each of these steps belongs to an automated workflow, we can easily check the impact of updating the KBpedia Knowledge Graph on our models.

Reasoning Over The Knowledge Graph

A new step I am adding in this current use case is to use a reasoner to reason over the KBpedia Knowledge Graph. The reasoner is used when we define the scope of the domain to classify. We will browse the knowledge graph to see which seed reference concepts we should add to the scope. Then we will use a reasoner to extend the models to include any new sub-classes relevant to the scope of the domain. This means that we may add further specific features to the final model.

Update Domain Training Corpus Using KBpedia 1.10 and a Reasoner

Recall our prior use case used Music as its domain scope. The first step is to use the new KBpedia version 1.10 along with a reasoner to create the full scope of this updated Music domain.

The result of using this new version and a reasoner is that we now end up with 196 features (reference documents) instead of 64. This also means that we will have 196 documents in our positive training set if we only use the Wikipedia pages linked to these reference concepts (and not their related named entities).

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])

(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Musician"
                       "http://kbpedia.org/kko/rc/MusicPerformanceOrganization"
                       "http://kbpedia.org/kko/rc/MusicalInstrument"
                       "http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Album-IBO"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MusicalText"
                       "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"
                       "http://kbpedia.org/kko/rc/MusicalPerformer"]
  kbpedia
  "resources/domain-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

Create Training Corpuses

The next step is to create the actual training corpuses: the general and domain ones. We have to load the dictionaries we created in the previous step, and then to locally cache and normalize the corpuses. Remember that the normalization steps are:

  1. Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
  2. Normalize the text with the following rules:
    1. remove diacritics characters
    2. remove everything between brackets like: [edit] [show]
    3. remove punctuation
    4. remove all numbers
    5. remove all invisible control characters
    6. remove all [math] symbols
    7. remove all words with 2 characters or fewer
    8. remove line and paragraph seperators
    9. remove anything that is not an alpha character
    10. normalize spaces
    11. put everything in lower case, and
    12. remove stop words.
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

(cache-corpus)

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

Create New Gold Standard

Because we never have enough instances in our gold standards to test against, let’s create a third one, but this time adding a music related news feed that will add more positive examples to the gold standard.

(defn create-gold-standard-from-feeds
  [name]
  (let [feeds ["http://www.music-news.com/rss/UK/news"
               "http://rss.cbc.ca/lineup/topstories.xml"
               "http://rss.cbc.ca/lineup/world.xml"
               "http://rss.cbc.ca/lineup/canada.xml"
               "http://rss.cbc.ca/lineup/politics.xml"
               "http://rss.cbc.ca/lineup/business.xml"
               "http://rss.cbc.ca/lineup/health.xml"
               "http://rss.cbc.ca/lineup/arts.xml"
               "http://rss.cbc.ca/lineup/technology.xml"
               "http://rss.cbc.ca/lineup/offbeat.xml"
               "http://www.cbc.ca/cmlink/rss-cbcaboriginal"
               "http://rss.cbc.ca/lineup/sports.xml"
               "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"
               "http://rss.cbc.ca/lineup/canada-calgary.xml"
               "http://rss.cbc.ca/lineup/canada-montreal.xml"
               "http://rss.cbc.ca/lineup/canada-pei.xml"
               "http://rss.cbc.ca/lineup/canada-ottawa.xml"
               "http://rss.cbc.ca/lineup/canada-toronto.xml"
               "http://rss.cbc.ca/lineup/canada-north.xml"
               "http://rss.cbc.ca/lineup/canada-manitoba.xml"
               "http://feeds.reuters.com/news/artsculture"
               "http://feeds.reuters.com/reuters/businessNews"
               "http://feeds.reuters.com/reuters/entertainment"
               "http://feeds.reuters.com/reuters/companyNews"
               "http://feeds.reuters.com/reuters/lifestyle"
               "http://feeds.reuters.com/reuters/healthNews"
               "http://feeds.reuters.com/reuters/MostRead"
               "http://feeds.reuters.com/reuters/peopleNews"
               "http://feeds.reuters.com/reuters/scienceNews"
               "http://feeds.reuters.com/reuters/technologyNews"
               "http://feeds.reuters.com/Reuters/domesticNews"
               "http://feeds.reuters.com/Reuters/worldNews"
               "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]]

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

This routine creates this third gold standard. Remember, we use the gold standard to evaluate different methods and models to classify an input text to see if it belongs to the domain or not.

For each piece of news aggregated that way, I manually determined if the candidate document belongs to the domain or not. This task can be tricky, and requires a clear understanding of the proper scope for the domain. In this example, I consider an article to belong to the music domain if it mentions music concepts such as musical albums, songs, multiple music related topics, etc. If only a singer is mentioned in an article because he broke up with his girlfriend, without further mention of anything related to music, I won’t tag it as being part of the domain.

[However, under a different interpretation of what should be in the domain wherein any mention of a singer qualifies, then we could extend the classification process to include named entities (the singer) extraction to help properly classify those articles. This revised scope is not used in this article, but it does indicate how your exact domain needs should inform such scoping decisions.]

You can download this new third gold standard from here.

Evaluate Initial Domain Model

Now that we have updated the training corpuses using the updated scope of the domain compared to the previous tests, let’s analyze the impact of using a new version of KBpedia and to use a reasoner to increase the number of features in our model. Let’s run our automatic process to evaluate the new models. The remaining steps that needs to be run are:

  1. Configure the classifier (in this case, create the semantic vectors for ESA)
  2. Train the model (in this case, the SVM model), and
  3. Evaluate the model on multiple gold standards.

Note: the see the full explanation of how ESA and the SVM classifiers works, please refer to the Create a Domain Text Classifier
Using Cognonto
article for more background information.

;; Load positive and negative training corpuses
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive:  21
False positive:  3
True negative:  306
False negative:  6

Precision:  0.875
Recall:  0.7777778
Accuracy:  0.97321427
F1:  0.8235294

The performance changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +10.33%
  • Recall: -12.16%
  • Accuracy: +0.31%
  • F1: +0.26%

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +6.18%
  • Recall: -29.35%
  • Accuracy: -1.19%
  • F1: -14.63%

What we can say is that the new scope for the domain greatly improved the precision of the model. This happens because the new model is probably more complex and better scoped, which leads it to be more selective. However, because of this the recall of the model suffers since some of the positive case of our gold standard are not considered to be positive but negative, which now creates new false positives. As you can see, there is almost always a tradeoff between precision and recall. However, you could have 100% precision by only having one result right, but then the recall would suffer greatly. This is why the F1 score is important since it is a weighted average of the precision and the recall.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive:  28
False positive:  3
True negative:  355
False negative:  22

Precision:  0.9032258
Recall:  0.56
Accuracy:  0.9387255
F1:  0.69135803

Again, with this new gold standard, we can see the same pattern: the precision is pretty good, but the recall is not that great since about half the true positives did not get noticed by the model.

Now, what could we do to try to improve this situation? The next thing we will investigate is to use feature selection and pruning.

Features Selection Using Pruning and Training Corpus Pruning

A new method that we will investigate to try to improve the performance of the models is called feature selection. As its name says, what we are doing is to select specific features to create our prediction model. The idea here is that not all features are born equal and different features may have different (positive or negative) impacts on the model.

In our specific use case, we want to do feature selection using a pruning technique. What we will do is to count the number of tokens for each of our features, and each of the Wikipedia page related to these features. If the number of tokens in an article is too small (below 100), then we will drop that feature.

[Note: feature selection is a complex topic; other options and nuances are not further discussed here.]

The idea here is not to give undue importance to a feature for which we lack proper positive documents in the training corpus. Depending on the feature, it may, or may not, have an impact on the overall model’s performance.

Pruning the general and domain specific dictionaries is really simple. We only have to read the current dictionaries, to read each of the documents mentioned in the dictionary from the cache, to calculate the number of tokens in each, and then to keep them or to drop them if they reach a certain threshold. Finally we write a new dictionary with the pruned features and documents:

(defn create-pruned-pages-dictionary-csv
  [dictionary-file prunned-file normalized-corpus-folder & {:keys [min-tokens]
                                                            :or {min-tokens 100}}]
  (let [dictionary (rest
                    (with-open [in-file (io/reader dictionary-file)]
                      (doall
                       (csv/read-csv in-file))))]
    (with-open [out-file (io/writer prunned-file)]
      (csv/write-csv out-file (->> dictionary
                                   (mapv (fn [[title rc]]
                                           (when (.exists (io/as-file (str normalized-corpus-folder title ".txt")))
                                             (when (> (->> (slurp (str normalized-corpus-folder title ".txt"))
                                                           tokenize
                                                           count) min-tokens)
                                               [[title rc]]))))
                                   (apply concat)
                                   (into []))))))

Then we can prune the general and domain specific dictionaries using this simple function:

(create-pruned-pages-dictionary-csv "resources/general-corpus-dictionary.csv"
                                    "resources/general-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

(create-pruned-pages-dictionary-csv "resources/domain-corpus-dictionary.csv"
                                    "resources/domain-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

As a result of this specific pruning approach, the number of features drops from 197 to 175.

Evaluating Pruned Training Corpuses and Selected Features

Now that the training corpuses have been pruned, let’s load them and then evaluate their performance on the gold standards.

;; Load positive and negative pruned training corpuses
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base-pruned/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base-pruned/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.pruned.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive:  21
False positive:  2
True negative:  307
False negative:  6

Precision:  0.9130435
Recall:  0.7777778
Accuracy:  0.97619045
F1:  0.84000003

The performances changes related to the initial results (using KBpedia 1.02) are:

  • Precision: +18.75%
  • Recall: -12.08%
  • Accuracy: +0.61%
  • F1: +2.26%

In this case, compared with the previous results (non-pruned with KBpedia 1.10), we improved the precision without decreasing the recall which is the ultimate goal. This means that the F1 score increased by 2.26% just by pruning, for this gold standard.

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +6.18%
  • Recall: -29.35%
  • Accuracy: -1.19%
  • F1: -14.63%

In this case, the results are identical (with non-pruned with KBpedia 1.10). Pruning did not change anything. Considering the relatively small size of the gold standard, this is to be expected since the model also did not drastically change.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive:  27
False positive:  7
True negative:  351
False negative:  23

Precision:  0.7941176
Recall:  0.54
Accuracy:  0.9264706
F1:  0.64285713

Now let’s check how these compare to the non-pruned version of the training corpus:

  • Precision: -12.08%
  • Recall: -3.7%
  • Accuracy: -1.31%
  • F1: -7.02%

Both false positives and false negatives increased with this change, which also led to a decrease in the overall metrics. What happened?

Different things may have happened in fact. Maybe the new set of features are not optimal, or maybe the hyperparameters of the SVM classifier are offset. This is what we will try to figure out by working with two new methods that we will use to try to continue to improve our model: hyperparameters optimization using grid search and using ensembles learning.

Hyperparameters Optimization Using Grid Search

Hyperparameters are parameters that are not learned by the estimators. They are a kind of configuration option for an algorithm. In the case of a linear SVM, hyperparameters are C, epsilon, weight and the algorithm used. Hyperparameter optimization is the task of trying to find the right parameter values in order to optimize the performance of the model.

There are multiple different strategies that we can use to try to find the best values for these hyperparameters, but the one we will use is called the grid search, which exhaustively searches across a manually defined subset of possible hyperparameter values.

The grid search function we want to define will enable us to specify the algorithm(s), the weight(s), C and the stopping tolerence. Then we will want the grid search to keep the hyperparameters that optimize the score of the metric we want to focus on. We also have to specify the gold standard we want to use to evaluate the performance of the different models.

Here is the function that implements that grid search algorithm:

(defn svm-grid-search
  [name model-path gold-standard & {:keys [grid-parameters selection-metric]
                                    :or {grid-parameters [{:c [1 2 4 16 256]
                                                           :e [0.001 0.01 0.1]
                                                           :algorithm [:l2l2]
                                                           :weight [1 15 30]}]
                                         selection-metric :f1}}]
  (let [best (atom {:gold-standard gold-standard
                    :selection-metric selection-metric
                    :score 0.0
                    :c nil
                    :e nil
                    :algorithm nil
                    :weight nil})
        model-vectors (read-string (slurp (str model-path "model.vectors")))]
    (doseq [parameters grid-parameters]
      (doseq [algo (:algorithm parameters)]
        (doseq [weight (:weight parameters)]
          (doseq [e (:e parameters)]
            (doseq [c (:c parameters)]
              (train-svm-model name model-path
                               :weights {1 (double weight)}
                               :v nil
                               :c c
                               :e e
                               :algorithm algo
                               :model-vectors model-vectors)
              (let [results (evaluate-model name gold-standard :output false)]              
                (println "Algorithm:" algo)
                (println "C:" c)
                (println "Epsilon:" e)
                (println "Weight:" weight)
                (println selection-metric ":" (get results selection-metric))
                (println)

                (when (> (get results selection-metric) (:score @best))
                  (reset! best {:gold-standard gold-standard
                                :selection-metric selection-metric
                                :score (get results selection-metric)
                                :c c
                                :e e
                                :algorithm algo
                                :weight weight}))))))))
    @best))

The possible algorithms are:

  1. :l2lr_primal
  2. :l2l2
  3. :l2l2_primal
  4. :l2l1
  5. :multi
  6. :l1l2_primal
  7. :l1lr
  8. :l2lr

To simplify things a little bit for this task, we will merge the three gold standards we have into one. We will use that gold standard moving forward. The merged gold standard can be downloaded from here. We now have a single gold standard with 1017 manually vetted web pages.

Now that we have a new consolidated gold standard, let’s calculate the performance of the models when the training corpuses are pruned and not. This will become the new basis to compare the subsequent results in this article. The metrics when they training corpuses are pruned:

True positive: 56
false positive: 10
True negative: 913
False negative: 38

Precision: 0.8484849
Recall: 0.59574467
Accuracy: 0.95280236
F1: 0.7

Now, let’s run the grid search that will try to optimize the F1 score of the model using the pruned training corpuses and using the full gold standard:

(svm-grid-search "grid-search-base-pruned-tests" 
                 "resources/svm/base-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.7096774
 :c 2
 :e 0.001
 :algorithm :l2l2
 :weight 30}

With a simple subset of the possible hyperparameter space, we found that by increasing the c parameter to 2 we could improve the performance of the F1 score on the gold standard by 1.37%. It is not a huge gain, but it is still an appreciable gain given the miinimal effort invested so far (basically: waiting for the grid search to finish). Subsequently we could tweak the subset of parameters to try to improve a little further. Let’s try with c = [ 1.5 , 2 , 2.5 ] and weight = [30, 40]. Let’s also try to check with other algorithms as well like L2-regularized L1-loss support vector regression (dual).

The goal here is to configure the initial grid search with general parameters with a wide range of possible values. Then subsequently we can use that tool to fine tune some of the parameters that were returning good results. In any case, the more computer power and time you have, the more tests you will be able to perform.

Posted at 11:00

November 16

Semantic Web Company (Austria): Triplifying a real dictionary

The Linked Data Lexicography for High-End Language Technology (LDL4HELTA) project was started in cooperation between Semantic Web Company (SWC) and K Dictionaries. LDL4HELTA combines lexicography and Language Technology with semantic technologies and Linked (Open) Data mechanisms and technologies. One of the implementation steps of the project is to create a language graph from the dictionary data.

The input data, described further, is a Spanish dictionary core translated into multiple languages and available in XML format. This data should be triplified (which means to be converted to RDF – Resource Description Framework) for several purposes, including to enrich it with external resources. The triplified data needs to comply with Semantic Web principles.

To get from a dictionary’s XML format to its triples, I learned that you must have a model. One piece of the sketched model, representing two Spanish words which have senses that relate to each other, is presented in Figure 1.

Figure 1

Figure 1: Language model example (click to enlarge)

This sketched model first needs to be created by a linguist who understands both the language complexity and Semantic Web principles. The extensive model [1] was developed at the Ontology Engineering Group of the Universidad Politécnica de Madrid (UPM).

Language is very complex. With this we all agree! How complex it really is, is probably often underestimated, especially when you need to model all its details and triplify it.

So why is the task so complex?

To start with, the XML structure is complex in itself, as it contains nested structures. Each word constitutes an entry. One single entry can contain information about:

  • Pronunciation
  • Inflection
  • Range Of Application
  • Sense Indicator
  • Compositional Phrase
  • Translations
  • Translation Example
  • Alternative Scripting
  • Register
  • Geographical Usage
  • Sense Qualifier
  • Provenance
  • Version
  • Synonyms
  • Lexical sense
  • Usage Examples
  • Homograph information
  • Language information
  • Specific display information
  • Identifiers
  • and more…

Entries can have predefined values, which can recur but their fields can also have so-called free values, which can vary too. Such fields are:

  • Aspect
  • Tense
  • Subcategorization
  • Subject Field
  • Mood
  • Grammatical Gender
  • Geographical Usage
  • Case
  • and more…

As mentioned above, in order to triplify a dictionary one needs to have a clear defined model. Usually, when modelling linked data or just RDF it is important to make use of existing models and schemas to enable easier and more efficient use and integration. One well-known lexicon model is Lemon. Lemon contains good pieces of information to cover our dictionary needs, but not all of them. We started using also the Ontolex model, which is much more complex and is considered to be the evolution of Lemon. However, some pieces of information were still missing, so we created an additional ontology to cover all missing corners and catch the specific details that did not overlap with the Ontolex model (such as the free values).

An additional level of complexity was the need to identify exactly the missing pieces in Ontolex model and its modules and create the part for the missing information. This was part of creating the dictionary’s model which we calledontolexKD.

As a developer you never sit down to think about all the senses or meanings or translations of a word (except if you specialize in linguistics), so just to understand the complexity was a revelation for me. And still, each dictionary contains information that is specific to it and which needs to be identified and understood.

The process used in order to do the mapping consists of several steps. Imagine this as a processing pipeline which manipulates the XML data. UnifiedViews is an ETL tool, specialized in the management of RDF data, in which you can configure your own processing pipeline. One of its use cases is to triplify different data formats. I used it to map XML to RDF and upload it into a triple store. Of course this particular task can also be achieved with other such tools or methods for that matter. In UnifiedViews the processing pipeline resembles what appears in Figure 2.

Figure 2: UnifiedViews pipeline used to triplify XML (click to enlarge)

Figure 2: UnifiedViews pipeline used to triplify XML (click to enlarge)

 

The pipeline is composed out of data processing units (DPUs) which communicate iteratively. In a left-to-right order the process in Figure 2 represents:

  • A DPU used to upload the XML files into UnifiedViews for further processing;
  • A DPU which transforms XML data to RDF using XSLT. The style sheet is part of the configuration of the unit;
  • The .rdf generated files are stored on the filesystem;
  • And, finally, the .rdf generated files are uploaded into a triple store, such as Virtuoso Universal server.

Basically the XML is transformed using XSLT.

Complexity increases also through the URIs (Uniform Resource Identifier) that are needed for mapping the information in the dictionary, because with Linked Data any resource should have a clearly identified and persistent identifier! The start was to represent a single word (headword) under a desired namespace and build on it to associate it with its part of speech, grammatical number, grammatical gender, definition, translation – just to begin with.

The base URIs follow the best practices recommended in the ISA study on persistent URIs following the pattern:http://{domain}/{type}/{concept}/{reference}.

An example of such URIs for the forms of a headword is:

These two URIs represent the singular masculine and singular feminine forms of the Spanish word entendedor.

If the dictionary contains two different adjectival endings, as with entendedor which has different endings for the feminine and masculine forms (entendedora and entendedor), and they are not explicitly mentioned in the dictionary than we use numbers in the URI to describe them. If the gener would be explicitly mentioned the URIs would be:

In addition, we should consider that the aim of triplifying the XML was for all these headwords with senses, forms and translations, to connect and be identified and linked following Semantic Web principles. The actual overlap and linking of the dictionary resources remains open. A second step for improving the triplification and mapping similar entries, if possible at all, still needs to be carried out. As an example, let’s take two dictionaries, say German, which contain a translation into English and an English dictionary which also contains translations into German. We get the following translations:

Bank – bank – German to English

bank – Bank – English to German

The URI of the translation from German to English was designed to look like:

And the translation from English to German would be:

In this case both represent the same translation but have different URIs because they were generated from different dictionaries (mind the translation order). These should be mapped so as to represent the same concept, theoretically, or should they not?

The word Bank in German can mean either a bench or a bank in English. When I translate both English senses back into German I get again the word Bank, but I cannot be sure which sense I translate unless the sense id is in the URI, hence the SE00006110 and SE00006116. It is important to keep the order of translation (target-source) but later map the fact that both translations refer to the same sense, same concept. This is difficult to establish automatically. It is hard even for a human sometimes.

One of the last steps of complexity was to develop a generic XSLT which can triplify all the different languages of this dictionary series and store the complete data in a triple store. The question remains: is the design of such a universal XSLT possible while taking into account the differences in languages or the differences in dictionaries?

The task at hand is not completed from the point of view of enabling the dictionary to benefit from Semantic Web principles yet. The linguist is probably the first one who can conceptualize “the how to do this”.

As a next step we will improve the Linked Data created so far and bring it to the status of a good linked language graph by enriching the RDF data with additional information, such as the history of a term or additional grammatical information etc.


References:

[1] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda, and G. Aguado-de Cea, “Modelling multilingual lexicographic resources for the web of data: the k dictionaries case,” in Proc. of GLOBALEX’16 workshop at LREC’15, Portoroz, Slovenia, May 2016.

Posted at 12:07

November 14

AKSW Group - University of Leipzig: Accepted paper in AAAI 2017

aaai-bannerHello Community! We are very pleased to announce that our paper “Radon– Rapid Discovery of Topological Relations” was accepted for presentation at the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), which will be held in February 4–9 at the Hilton San Francisco, San Francisco, California, USA.

In more detail, we will present the following paper: “Radon– Rapid Discovery of Topological Relations” Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, and Axel-Cyrille Ngonga Ngomo

Abstract. Datasets containing geo-spatial resources are increasingly being represented according to the Linked Data principles. Several time-efficient approaches for discovering links between RDF resources have been developed over the last years. However, the time-efficient discovery of topological relations between geospatial resources has been paid little attention to. We address this research gap by presenting Radon, a novel approach for the rapid computation of topological relations between geo-spatial resources. Our approach uses a sparse tiling index in combination with minimum bounding boxes to reduce the computation time of topological relations. Our evaluation of Radon’s runtime on 45 datasets and in more than 800 experiments shows that it outperforms the state of the art by up to 3 orders of magnitude while maintaining an F-measure of 100%. Moreover, our experiments suggest that Radon scales up well when implemented in parallel.

Acknowledgments
This work is implemented in the link discovery framework LIMES and has been supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227) as well as the BMWI Project GEISER (project no. 01MD16014).

Posted at 13:48

November 13

Bob DuCharme: Pulling RDF out of MySQL

With a command line option and a very short stylesheet.

Posted at 15:09

Copyright of the postings is owned by the original blog authors. Contact us.