Planet RDF

It's triples all the way down

April 23

Bob DuCharme: The Wikidata data model and your SPARQL queries

Reference works to get you taking advantage of the fancy parts quickly.

Posted at 14:43

April 21

Dublin Core Metadata Initiative: Webinar: Me4MAP - A method for the development of metadata application profiles

2017-04-21, A metadata application profile (MAP) is a construct that provides a semantic model for enhancing interoperability when publishing data to the Web of Data. When a community of practice agrees to follow a MAP's set of rules for publishing data as Linked Open Data, it makes it possible for such data to be processed automatically by software agents. Therefore, the existence of a method for MAP development is essential to providing developers with a common ground on which to work. The absence of such a method leads to a non-systematic set of MAP development activities that frequently results in MAPs of lesser quality. This Webinar with Mariana Curado Malta, Polythecnic of Oporto, Portugal, will present Me4MAP, a method for the development of metadata application profiles. The webinar will be presented twice, once in English and once in Portuguese. For more information about the webinar and to register, visit http://dublincore.org/resources/training/#2017Malta.

Posted at 23:59

Dublin Core Metadata Initiative: NKOS Workshop at DC-2017 in Washington, DC

2017-04-21, The 11th U.S. Networked Knowledge Organization Systems (NKOS) Workshop will take place on Saturday, October 28 as part of DC-2017 in Crystal City, VA (Washington, D.C.). The Call for Participation including presentations and demos is available at http://dcevents.dublincore.org/IntConf/index/pages/view/nkosCall.

Posted at 23:59

April 19

AKSW Group - University of Leipzig: ESWC 2017 accepted two Demo Papers by AKSW members

Hello Community! The 14th ESWC, which takes place from May 28th to June 1st 2017 in Portoroz, Slovenia, accepted two demos to be presented at the conference. Read more about them in the following:                                                                        

1. “KBox Distributing Ready-to-query RDF Knowledge Graphs by Edgard Marx, Ciro Baron, Tommaso Soru and Sandro Athaide Coleho

Abstract: The Semantic Web community has successfully contributed to a remarkable number of RDF datasets published on the Web.However, to use and build applications on top of Linked Data is still a cumbersome and time-demanding task.We present \textsc{KBox}, an open-source platform that facilitates the distribution and consumption of RDF data.We show the different APIs implemented by \textsc{KBox}, as well as the processing steps from a SPARQL query to its corresponding result.Additionally, we demonstrate how \textsc{KBox} can be used to share RDF knowledge graphs and to instantiate SPARQL endpoints.

Please see: https://www.researchgate.net/publication/315838619_KBox_Distributing_Ready-to-query_RDF_Knowledge_Graphs

and

https://www.researchgate.net/publication/305410480_KBox_–_Transparently_Shifting_Query_Execution_on_Knowledge_Graphs_to_the_Edge

2. “EAGLET – All That Glitters is not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking“ by Kunal Jha, Michael Röder and Axel-Cyrille Ngonga Ngomo

The desideratum to bridge the unstructured and structured data on theweb has lead to the advancement of a considerable number of annotation tools andthe evaluation of these Named Entity Recognition and Entity Linking systems isincontrovertibly one of the primary tasks. However, these evaluations are mostlybased on manually created gold standards. As much these gold standards have anupper hand of being created by a human, it also has room for major proportionof over-sightedness. We will demonstrate EAGLET, a tool that supports the semi-automatic checking of a gold standard based on a set of uniform annotation rules.

Please also see: https://svn.aksw.org/papers/2017/ESWC_EAGLET_2017/public.pdf

Posted at 08:19

April 08

Ebiquity research group UMBC: Google search now includes schema.org fact check data

Google claims on their search blog that “Fact Check now available in Google Search and News”.  We’ve sampled searches on Google and found that some results did indeed include Fact Check data from schema.org’s ClaimReview markup.  So we are including the following markup on this page.

    
    <script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "ClaimReview",
      "datePublished": "2016-04-08",
      "url": "http://ebiquity.umbc.edu/blogger/2017/04/08/google-search-now-
              including-schema-org-fact-check-data",
      "itemReviewed":
      {
        "@type": "CreativeWork",
        "author":
        {
          "@type": "Organization",
          "name": "Google"
        },
        "datePublished": "2016-04-07"
      },
      "claimReviewed": "Fact Check now available in Google search and news",
      "author":
      {
        "@type": "Organization",
        "Name": "UMBC Ebiquity Research Group",
        "url": "http://ebiquity.umbc.edu/"
      },
      "reviewRating":
      {
        "@type": "Rating",
        "ratingValue": "5",
        "bestRating": "5",
        "worstRating": "1",
        "alternateName" : "True"
      }
    }</script>

Google notes that

“Only publishers that are algorithmically determined to be an authoritative source of information will qualify for inclusion. Finally, the content must adhere to the general policies that apply to all structured data markup, the Google News Publisher criteria for fact checks, and the standards for accountability and transparency, readability or proper site representation as articulated in our Google News General Guidelines. If a publisher or fact check claim does not meet these standards or honor these policies, we may, at our discretion, ignore that site’s markup.”

and we hope that the algorithms will find us to be an authoritative source of information.

You can see the actual markup by viewing this page’s source or looking at the markup that Google’s structured data testing tool finds on it here by clicking on ClaimReview in the column on the right.

Update: We’ve been algorithmically determined to be an authoritative source of information!

Posted at 14:39

April 07

AKSW Group - University of Leipzig: AKSW Colloquium, 10.04.2017, GeoSPARQL on geospatial databases

At the AKSW Colloquium, on Monday 10th of April 2017, 3 PM, Matthias Wauer will discuss a paper titled “Ontop of Geospatial Databases“. Presented at ISWC 2016, this work extends an ontology based data access (OBDA) system with support for GeoSPARQL for querying geospatial relational databases. In the evaluation section, they compare their approach to Strabon. The work is partially supported by the Optique and Melodies EU projects.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/public/colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 08:43

April 02

W3C Read Write Web Community Group: Read Write Web — Q1 Summary — 2017

Summary

A quiet start to 2017 as people prepare for www 2017 and ESWC.  An active political quarter saw the inauguration of a US new president, and numerous concerns raised about new laws regarding the privacy at the ISP level.

The Linked Open Data cloud continues to grow and has a neat update here.  There has also been a release of the SHACL playground which allows data to be validated according to various “shapes“.

Linked Data Notifications has become a Proposed Recommendation, and will allow users of the web to have a data inbox, and enable a whole host of use cases.

Communications and Outreach

Collaboration has started to begun with two cloud providers, nextcloud and cozy cloud.  Hopefully this will bring read and write web standards to a wider audience, over time.

 

Community Group

Some ideas for extending the way PATCH works has been described by TimBL.  I found interesting the way data can be transmitted over other protocols than the web

– When clients of listening to the same resource are in fact located physically close, they could exchange patches through other medium like wifi or bluetooth.

– The system can evolve (under stress) to work entirely with distributed patches, making the original HTTP server unnecessary

– The patches could be combined with hashes of versions of folders to be the basis for a git-like version control system, or connect to git itself

solid

Applications

There is a new test website for the openid authentication branch of node solid server and solid client has been updated to work with this.  There have been various fixes to rdf and solid libraries, and two new repositories for solid notifications and solid permissions.

Good work has continued on rabel, a program for reading and writing linked data in various formats.  In addition the browser shimmed apps on solid-ui, solid-app-set continue to improve.  Finally, *shameless plug*, I am writing a gitbook on a skinned version of node solid server, bitmark storage, which hopes to integrate solid with crypto currencies, creating self funding storage.

Last but not Least…

On the topic of crypto currencies, I’m very excited about a draft paper released on semantic block chains.  There was some buzz generated around this topic and hopefully will feature in a workshop next quarter.

Posted at 11:04

March 31

AKSW Group - University of Leipzig: AKSW Colloquium, 03.04.2017, RDF Rule Mining

At the AKSW Colloquium, on Monday 3rd of April 2017, 3 PM, Tommaso Soru will present the state of his ongoing research titled “Efficient Rule Mining on RDF Data”, where he will introduce Horn Concerto, a novel scalable SPARQL-based approach for the discovery of Horn clauses in large RDF datasets. The presentation slides will be available at this URL.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:39

March 30

Leigh Dodds: The limitations of the open banking licence

The

Posted at 18:45

March 27

AKSW Group - University of Leipzig: AKSW Colloquium, 27.03.2017, PPO & PPM 2.0: Extending the privacy preference framework to provide finer-grained access control for the Web of Data

In the upcoming Colloquium, March the 27th at 3 PM Marvin Frommhold will discuss the paper “PPO & PPM 2.0: Extending the Privacy Preference Framework to provide finer-grained access control for the Web of Data” by Owen Sacco and John G. Breslin published in the I-SEMANTICS ’12 Proceedings of the 8th International Conference on Semantic Systems.

Abstract:  Web of Data applications provide users with the means to easily publish their personal information on the Web. However, this information is publicly accessible and users cannot control how to disclose their personal information. Protecting personal information is deemed important in use cases such as controlling access to sensitive personal information on the Social Semantic Web or even in Linked Open Government Data. The Privacy Preference Ontology (PPO) can be used to define fine-grained privacy preferences to control access to personal information and the Privacy Preference Manager (PPM) can be used to enforce such preferences to determine which specific parts of information can be granted access. However, PPO and PPM require further extensions to create more control when granting access to sensitive data; such as more flexible granularity for defining privacy preferences. In this paper, we (1) extend PPO with new classes and properties to define further fine-grained privacy preferences; (2) provide a new light-weight vocabulary, called the Privacy Preference Manager Ontology (PPMO), to define characteristics about privacy preference managers; and (3) present an extension to PPM to enable further control when publishing and sharing personal information based on the extended PPO and the new vocabulary PPMO. Moreover, the PPM is extended to provide filtering data over SPARQL endpoints.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 08:13

March 26

Bob DuCharme: Wikidata's excellent sample SPARQL queries

Learning about the data, its structure, and more.

Posted at 17:40

March 24

Dublin Core Metadata Initiative: Sayeed Choudhury to deliver Keynote at DC-2017

2017-03-24, The Governing Board and the Chairs of the DC-2017 Program Committee are please to announce that Sayeed Choudhury, Associate Dean for Research Data Management and Hodson Director of the Digital Research and Curation Center at the Sheridan Libraries of Johns Hopkins University will deliver the keynote address at DC-2017 in Washington, D.C. Choudhury has oversight for data curation research and development and data archive implementation at the Sheridan Libraries at Johns Hopkins University. Choudhury is a President Obama appointee to the National Museum and Library Services Board. He is a member of the Executive Committee for the Institute of Data Intensive Engineering and Science (IDIES) based at Johns Hopkins. He is also a member of the Board of the National Information Standards Organization (NISO) and a member of the Advisory Board for OpenAIRE2020. He has been a member of the National Academies Board on Research Data and Information, the ICPSR Council, the DuraSpace Board, Digital Library Federation advisory committee, Library of Congress' National Digital Stewardship Alliance Coordinating Committee, Federation of Earth Scientists Information Partnership (ESIP) Executive Committee and the Project MUSE Advisory Board. He is the recipient of the 2012 OCLC/LITA Kilgour Award. Choudhury has testified for the U.S. Research Subcommittee of the Congressional Committee on Science, Space and Technology. For additional information, see http://dcevents.dublincore.org/IntConf/index/pages/view/keynote17.

Posted at 23:59

Dublin Core Metadata Initiative: ZBW German National Library of Economics joins DCMI as Institutional Member

2017-03-24, The DCMI Governing Board is pleased to announce that ZBW German National Library of Economics has joined DCMI as a Institutional Member. ZBW Director, Klaus Tochtermann will serve as the Library's representative to the Board. ZBW German National Library of Economics - Leibniz Information Centre for Economics is the world's largest research infrastructure for economic literature, online as well as offline. Its disciplinary repository EconStor provides a large collection of more than 127,000 articles and working papers in Open Access. EconBiz, the portal for international economic information, allows students and researchers to search among nine million datasets. The ZBW edits two journals in economic policy, Wirtschaftsdienst and Intereconomics, and in cooperation with the Kiel Institute for the World Economy produces the peer-reviewed journal Economics based on the principle of Open Access. For information on becoming a DCMI Institutional Member, visit the DCMI membership page at http://dublincore.org/support/.

Posted at 23:59

Dublin Core Metadata Initiative: Webinar: Nailing Jello to a Wall: Metrics, Frameworks, & Existing Work for Metadata Assessment

2017-03-24, With the increasing number of repositories, standards and resources we manage for digital libraries, there is a growing need to assess, validate and analyze our metadata - beyond our traditional approaches such as writing XSD or generating CSVs for manual review. Being able to further analyze and determine measures of metadata quality helps us better manage our data and data-driven development, particularly with the shift to Linked Open Data leading many institutions to large-scale migrations. Yet, the semantically-rich metadata desired by many Cultural Heritage Institutions, and the granular expectations of some of our data models, makes performing assessment, much less going on to determine quality or performing validation, that much trickier. How do we handle analysis of the rich understandings we have built into our Cultural Heritage Institutions’ metadata and enable ourselves to perform this analysis with the systems and resources we have? This webinar with Christina Harlow, Cornell University Library, sets up this question and proposes some guidelines, best practices, tools and workflows around the evaluation of metadata used by and for digital libraries and Cultural Heritage Institution repositories. The goal is for webinar participants to walk away prepared to handle their own metadata assessment needs by using existing works and being better aware of the open questions in this domain. For additional information and to register, go to http://dublincore.org/resources/training/#2017harlow.

Posted at 23:59

Leigh Dodds: What is data asymmetry?

You’ve just parked your car. Google Maps offers to

Posted at 18:01

March 23

schema.org: Schema.org 3.2 release: courses, fact-checking, digital publishing accessibility, menus and more...

Schema.org 3.2 is released! This update brings many improvements including new vocabulary for describing courses, fact-check reviews, digital publishing accessibility, as well as a more thorough treatment of menus and a large number of pending proposals which are offered for early-access use, evaluation and improvement. We also introduce a new "hosted extension" area, iot.schema.org which provides an entry point for schema collaborations relating to the Internet of Things field. As always, our releases page has full details.

These efforts depend entirely on a growing network of collaborations, within our own W3C Community Group and beyond. Many thanks are due to the Schema Course Extension Community Group, the IDPF's Epub Accessibility Working Group, members of the international fact-checking network including the Duke Reporters Lab and Full Fact, the W3C Web of Things and Spatial Web initiatives, the Bioschemas project, and to Wikipedia's Wikidata project.

This release also provides the opportunity to thank two of our longest-serving steering group members, whose careers have moved on from the world of structured data markup. Peter Mika and Martin Hepp have both played leading roles in Schema.org since its earliest days, and the project has benefited greatly from their insight, commitment and attention to detail.

As we look towards future developments, it is worth taking a brief recap on how we have organized things recently. Schema.org's primary discussion forum is a W3C group, although its most detailed collaborations are typically in Github, organized around specific issues and proposed changes. These discussions are open to all interested parties. Schema designs frequently draw upon related groups that have a more specific topical focus. For example, the Courses group became a hub for education/learning metadata experts from LRMI and others. This need to engage with relevant experts also motivated the creation of the "pending" area introduced in our previous release. Github is a site oriented towards computer programmers. By surfacing proposed, experimental and other early access designs at pending.schema.org we hope we can reach a wider audience who may have insight to share. With today's release, we add 14 new "pending" designs, with courses, accessibility and fact-checking markup graduating from pending into the core section of schema.org. Future releases will follow this pipeline approach, encouraging greater consistency, quality and clarity as our vocabulary continues to evolve.



Posted at 15:29

March 21

Leigh Dodds: Fearful about personal data, a personal example

I was recently at a workshop on making better use of (personal) data for the benefit of specific communities. The discussion, perhaps inevitably, ended up focusing on many of the attendees concerns around how data about them was being used.

The group was asked to share what made them afraid or fearful about how personal data might be misused. The examples were mainly about use of the data by Facebook, by advertisers, as surveillance, etc. There was a view that being in control of that data would remove the fear and put the individual back in control. This same argument pervades a lot of the discussion around personal data. The narrative is that if I own my data then I can decide how and where it is used.

But this overlooks the fact that data ownership is not a clear cut thing. Multiple people might reasonably claim to have ownership over some data. For example bank transactions between individuals.

Posted at 19:10

Gregory Williams: SPARQL Limit by Resource

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

Project(
    Filter(
        ?rank <= 2,
        Ungroup(
            Window(
                Group((?school), BGP(?p :name ?name . ?p :school ?school . ?p :age ?age .)),
                ?rank,
                Rank,
                (),
                (DESC(?age)),
            )
        )
    ),
    {?age, ?name, ?school}
)

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
RANK(DESC(?age)) AS ?rank
HAVING (?rank <= 2)

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
HAVING (RANK(ORDER BY DESC(?age)) <= 2)

or:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    {
        SELECT ?age ?name ?school (RANK(GROUP BY ?school ORDER BY DESC(?age)) AS ?rank) WHERE {
            ?p :name ?name ;
               :school ?school ;
               :age ?age .
        }
    }
    FILTER(?rank <= 2)
}

Posted at 03:17

March 17

Ebiquity research group UMBC: SemTk: The Semantics Toolkit from GE Global Research, 4/4

The Semantics Toolkit

Paul Cuddihy and Justin McHugh
GE Global Research Center, Niskayuna, NY

10:00-11:00 Tuesday, 4 April 2017, ITE 346, UMBC

SemTk (Semantics Toolkit) is an open source technology stack built by GE Scientists on top of W3C Semantic Web standards.  It was originally conceived for data exploration and simplified query generation, and later expanded to a more general semantics abstraction platform. SemTk is made up of a Java API and microservices along with Javascript front ends that cover drag-and-drop query generation, path finding, data ingestion and the beginnings of stored procedure support.   In this talk we will give a tour of SemTk, discussing its architecture and direction, and demonstrate it’s features using the SPARQLGraph front-end hosted at http://semtk.research.ge.com.

Paul Cuddihy is a senior computer scientist and software systems architect in AI and Learning Systems at the GE Global Research Center in Niskayuna, NY. He earned an M.S. in Computer Science from Rochester Institute of Technology. The focus of his twenty-year career at GE Research has ranged from machine learning for medical imaging equipment diagnostics, monitoring and diagnostic techniques for commercial aircraft engines, modeling techniques for monitoring seniors living independently in their own homes, to parallel execution of simulation and prediction tasks, and big data ontologies.  He is one of the creators of the open source software “Semantics Toolkit” (SemTk) which provides a simplified interface to the semantic tech stack, opening its use to a broader set of users by providing features such as drag-and-drop query generation and data ingestion.  Paul has holds over twenty U.S. patents.

Justin McHugh is computer scientist and software systems architect working in the AI and Learning Systems group at GE Global Research in Niskayuna, NY. Justin attended the State University of New York at Albany where he earned an M.S in computer science. He has worked as a systems architect and programmer for large scale reporting, before moving into the research sector. In the six years since, he has worked on complex system integration, Big Data systems and knowledge representation/querying systems. Justin is one of the architects and creators of SemTK (the Semantics Toolkit), a toolkit aimed at making the power of the semantic web stack available to programmers, automation and subject matter experts without their having to be deeply invested in the workings of the Semantic Web.

Posted at 13:00

March 13

Leigh Dodds: Some tips for open data ecosystem mapping

At Open Data Camp last month I pitched to run a session on

Posted at 20:11

AKSW Group - University of Leipzig: DBpedia @ Google Summer of Code – GSoC 2017

DBpedia, one of InfAI’s community projects, will be part of the 5th Google Summer of Code program.

The GsoC has the goal to bring students from all over the globe into open source software development. In this regard we are calling for students to be part of the Summer of Codes. During three funded months, you will be able to work on a specific task, which results are presented at the summer.

We aroused your interest in particpation? Great, then check out the DBpedia website for further information.

Posted at 10:12

March 11

Leigh Dodds: The British Hypertextual Society (1905-2017)

With their globe-spanning satellite network nearing completion, Peter Linkage reports on some of the key milestones in the history of the British Hypertextual Society.

The British Hypertextual Society was founded in 1905 with a parliamentary grant from the Royal Society of London. At the time there was growing international interest in finding better ways to manage information, particularly scientific research. Undoubtedly the decision to invest in the creation of a British centre of expertise for knowledge organisation was also influenced by the rapid progress being made in Europe.

Posted at 11:52

March 10

Frederick Giasson: A Machine Learning Workflow

I am giving a talk (in French) at the 85th edition of the ACFAS congress, May 9. I will discuss the engineering aspects of doing machine learning. But more importantly, I will discuss how Semantic Web techniques, technologies and specifications can help solving the engineering problems and how they can be leveraged and integrated in a machine learning workflow.

The focus of my talk is based on my work in the field of the semantic web in the last 15 years and my more recent work creating the KBpedia Knowledge Graph at Cognonto and how they influenced our work to develop different machine learning solutions to integrate data, to extend knowledge structure, to tag and disambiguate concepts and entities in corpuses of texts, etc.

One thing we experienced is that most of the work involved in such project is not directly related to machine learning problems (or at least related to the usage of machine learning algorithms). And then I recently read a survey conducted by CrowdFlower in 2016 that support what we experienced. They surveyed about 80 data scientists to probe them to find out “where they feel their profession is going, [and] what their day-to-day job is like” To the question: “What data scientists spend the most time doing”, they answered:

As you can notice, 77% percent of the time spent by a data scientist is related to non-machine learning algorithms selection, testing and refinement. If you read the survey, you will see that at the same time about 93% of the respondent said that these are the least enjoyable tasks! Which is a bit depressing… But on the other hand, we don’t know how much they disliked these tasks.

What I find interesting in the survey is that most of these non-machine learning algorithms specific tasks (again, 77% of them!) are data manipulation tasks that need to be engineered in some process/workflow/pipeline.

To put these numbers into context, I created my own general machine learning workflow schema. This is the one I used multiple times in the last few years while working on different projects for Cognonto. Depending on the task at hand, some step may differ and be added, but the core is there. Also note that all the tasks where this machine learning workflow has been used are related to natural language processing, entities matching, concepts and entities tagging and disambiguation. (click on it to access a bigger version of the schema)

This workflow is split into four general areas:

  1. Data Processing
  2. Training Sets creation
  3. Machine Learning Algorithms testing, evaluation and selection
  4. Deployment and A/B testing

The only “real” machine learning work happens in the top-right corner of this schema and incurs only about 13% of the time spent by the data scientists according to the survey. All other tasks are related to data acquisition, data analysis, data normalization, data transformation, and then data integration, data filtering/slicing/reduction with which we will create a series of different training sets or training corpuses that will then lead to the creation (after proper splitting) of the training, validation and test sets.

It is only at that point that the data scientists will start testing algorithms to create different models, to evaluate them and to fine-tune the hyper-parameters. Once the best model(s) are selected, then we gradually put them into production with different steps of A/B testing.

Again, 77% of these tasks are related to non-machine learning algorithms tasks. These tasks are more related to an engineered pipeline which includes and ETL and an A/B testing frameworks. There is nothing sexy in this reality, but if data scientists spent 3/4 of their time working on these tasks then it suggests that they are highly important!

Note that every task with a small purple brain on it would benefit from leveraging a Knowledge Graph structure such as Cognonto’s KBpedia Knowledge Graph.

Posted at 19:25

Ebiquity research group UMBC: SemTk: The Semantics Toolkit from GE Global Research

The Semantics Toolkit

Paul Cuddihy and Justin McHugh
GE Global Research Center, Niskayuna, NY

10:30-11:30 Tuesday, 14 March 2017, ITE 346, UMBC

SemTk (Semantics Toolkit) is an open source technology stack built by GE Scientists on top of W3C Semantic Web standards.  It was originally conceived for data exploration and simplified query generation, and later expanded to a more general semantics abstraction platform. SemTk is made up of a Java API and microservices along with Javascript front ends that cover drag-and-drop query generation, path finding, data ingestion and the beginnings of stored procedure support.   In this talk we will give a tour of SemTk, discussing its architecture and direction, and demonstrate it’s features using the SPARQLGraph front-end hosted at http://semtk.research.ge.com.

Paul Cuddihy is a senior computer scientist and software systems architect in AI and Learning Systems at the GE Global Research Center in Niskayuna, NY. He earned an M.S. in Computer Science from Rochester Institute of Technology. The focus of his twenty-year career at GE Research has ranged from machine learning for medical imaging equipment diagnostics, monitoring and diagnostic techniques for commercial aircraft engines, modeling techniques for monitoring seniors living independently in their own homes, to parallel execution of simulation and prediction tasks, and big data ontologies.  He is one of the creators of the open source software “Semantics Toolkit” (SemTk) which provides a simplified interface to the semantic tech stack, opening its use to a broader set of users by providing features such as drag-and-drop query generation and data ingestion.  Paul has holds over twenty U.S. patents.

Justin McHugh is computer scientist and software systems architect working in the AI and Learning Systems group at GE Global Research in Niskayuna, NY. Justin attended the State University of New York at Albany where he earned an M.S in computer science. He has worked as a systems architect and programmer for large scale reporting, before moving into the research sector. In the six years since, he has worked on complex system integration, Big Data systems and knowledge representation/querying systems. Justin is one of the architects and creators of SemTK (the Semantics Toolkit), a toolkit aimed at making the power of the semantic web stack available to programmers, automation and subject matter experts without their having to be deeply invested in the workings of the Semantic Web.

Posted at 13:14

AKSW Group - University of Leipzig: New GERBIL release v1.2.5 – Benchmarking entity annotation systems

Dear all,
the Smart Data Management competence center at AKSW is happy to announce GERBIL 1.2.5.

GERBIL is a general entity annotation benchmarking system and offers an easy-to-use web-based platform for the agile comparison of annotators using multiple datasets and uniform measuring approaches. To add a tool to GERBIL, all the end user has to do is to provide a URL to a REST interface to its tool which abides by a given specification. The integration and benchmarking of the tool against user-specified datasets is then carried out automatically by the GERBIL platform. Currently, our platform provides results for **20 annotators** and **46 datasets** with more coming.

Website: http://aksw.org/Projects/GERBIL.html
Demo: http://gerbil.aksw.org/gerbil/
GitHub page: https://github.com/AKSW/gerbil
Download: https://github.com/AKSW/gerbil/releases/tag/v1.2.5

New features include:

  • Added annotators (DoSeR, NERFGUN, PBOH, xLisa)
  • Added datasets (Derczynski, ERD14 and GERDAQ, Microposts 2015 and 2016, Ritter, Senseval 2 and 3, UMBC, WSDM 2012)
  • Introduced the RT2KB experiment type that comprises recognition and typing of entities
  • Introduced index based sameAs relation retrieval and entity checking for KBs that do not change very often (e.g., DBpedia). Downloading the indexes is optional and GERBIL can run without them (but has the same performance drawbacks as the last versions).
  • A warning should be shown in the GUI if the server is busy at the moment.
  • Implemented checks for certain datasets and annotators. If dataset files are missing (because of licenses) or API keys of annotators, they are not available in the front end.

We want to thank everyone who helped to create this release, in particular we want to thank Felix Conrads and Jonathan Huthmann. We also acknowledge support by the DIESEL, QAMEL and HOBBIT projects.

We really appreciate feedback and are open to collaborations.
If you happen to have use cases utilizing this dataset, please contact us.

Michael and Ricardo on behalf of the GERBIL team

Posted at 10:49

March 09

AKSW Group - University of Leipzig: DBpedia Open Text Extraction Challenge – TextExt

DBpedia, a community project affiliated with the Institute for Applied Informatics (InfAI) e.V., extract structured information from Wikipedia & Wikidata. Now DBpedia started the DBpedia Open Text Extraction Challenge – TextExt. The aim is to increase the number of structured DBpedia/Wikipedia data and to provide a platform for benchmarking various extraction tools. DBpedia wants to polish the knowledge of Wikipedia and then spread it on the web, free and open for any IT users and businesses.

Procedure

Compared to other challenges, which are often just one time calls, the TextExt is a continuous challenge focusing on lasting progress and exceeding limits in a systematic way. DBpedia provides the extracted and cleaned full text for all Wikipedia articles from 9 different languages in regular intervals for download and as Docker in the machine readable NIF-RDF format (Example for Barrack Obama in English). Challenge participants are asked to wrap their NLP and extraction engines in Docker images and submit them to the DBpedia-Team. They will run participants’ tools in regular intervals in order to extract:

  1. Facts, relations, events, terminology, ontologies as RDF triples (Triple track)
  2. Useful NLP annotations such as pos-tags, dependencies, co-reference (Annotation track)

DBpedia allows submissions 2 months prior to selected conferences (currently http://ldk2017.org/ and http://2017.semantics.cc/ ). Participants that fulfil the technical requirements and provide a sufficient description will be able to present at the conference and be included in the yearly proceedings. Each conference, the challenge committee will select a winner among challenge participants, which will receive 1.000 €.

Results

Starting in December 2017, DBpedia will publish a summary article and proceedings of participants’ submissions at http://ceur-ws.org/ every year.

For further news and next events please have a look at http://wiki.dbpedia.org/textext or contact DBpedia via email dbpedia-textext-challenge@infai.org.

The project was created with the support of the H2020 EU project HOBBIT (GA-688227) and ALIGNED (GA-644055) as well as the BMWi project Smart Data Web (GA-01MD15010B)

Challenge Committee

  • Sebastian Hellmann, AKSW, DBpedia Association, KILT Competence Center, InfAI e.V., Leipzig
  • Sören Auer, Fraunhofer IAIS, University of Bonn
  • Ricardo Usbeck, AKSW, Simba Competence Center, Leipzig University
  • Dimitris Kontokostas, AKSW, DBpedia Association, KILT Competence Center, InfAI e.V., Leipzig
  • Sandro Coelho, AKSW, DBpedia Association, KILT Competence Center, InfAI e.V., Leipzig

 

Posted at 11:15

March 07

Leigh Dodds: Designing CSV files

A couple of the projects I’m involved with at the moment are at a stage where there’s some thinking going on around how to best provide CSV files for users. This has left me thinking about what options we actually have when it comes to designing a CSV file format.

CSV is a very useful, but pretty mundane format. I suspect many of us don’t really think very much about how to organise our CSV files. It’s just a table, right? What decisions do we need to make?

But there are actually quite a few different options we have that might make a specific CSV format more or less suited for specific audiences. So I thought I’d write down some of the options that occured to me. It might be useful input into both my current projects as well as future work on standard formats.

Starting from the “outside in”, we have decisions to make about all of the following:

File naming

How are you going to name your CSV file? A good file naming convention can help ensure that a data file has an unambiguous name within a data package or after a user has downloaded it.

Including a name, timestamp or other version indicator will avoid clobbering existing files if a user is archiving or regularly collecting data.

Adopting a similar policy to generating

Posted at 22:22

March 04

Ebiquity research group UMBC: SADL: Semantic Application Design Language

SADL – Semantic Application Design Language

Dr. Andrew W. Crapo
GE Global Research

 10:00 Tuesday, 7 March 2017

The Web Ontology Language (OWL) has gained considerable acceptance over the past decade. Building on prior work in Description Logics, OWL has sufficient expressivity to be useful in many modeling applications. However, its various serializations do not seem intuitive to subject matter experts in many domains of interest to GE. Consequently, we have developed a controlled-English language and development environment that attempts to make OWL plus rules more accessible to those with knowledge to share but limited interest in studying formal representations. The result is the Semantic Application Design Language (SADL). This talk will review the foundational underpinnings of OWL and introduce the SADL constructs meant to capture, validate, and maintain semantic models over their lifecycle.

 

Dr. Crapo has been part of GE’s Global Research staff for over 35 years. As an Information Scientist he has built performance and diagnostic models of mechanical, chemical, and electrical systems, and has specialized in human-computer interfaces, decision support systems, machine reasoning and learning, and semantic representation and modeling. His work has included a graphical expert system language (GEN-X), a graphical environment for procedural programming (Fuselet Development Environment), and a semantic-model-driven user-interface for decision support systems (ACUITy). Most recently Andy has been active in developing the Semantic Application Design Language (SADL), enabling GE to leverage worldwide advances and emerging standards in semantic technology and bring them to bear on diverse problems from equipment maintenance optimization to information security.

Posted at 14:19

March 03

Dublin Core Metadata Initiative: Webinar: Data on the Web Best Practices: Challenges and Benefits

2017-03-03, There is a growing interest in the publication and consumption of data on the Web. Government and non-governmental organizations already provide a variety of data on the Web, some open, others with access restrictions, covering a variety of domains such as education, economics, e-commerce and scientific data. Developers, journalists, and others manipulate this data to create visualizations and perform data analysis. Experience in this area reveals that a number of important issues need to be addressed in order to meet the requirements of both publishers and data consumers.

In this webinar, Bernadette Farias Lóscio, Caroline Burle dos Santos Guimarães and Newton Calegari will discuss the key challenges faced by publishers and data consumers when sharing data on the Web. We will also introduce the W3C Best Practices set (https://www.w3.org/TR/dwbp/) to address these challenges. Finally, we will discuss the benefits of engaging data publishers in the use of Best Practices, as well as improving the way data sets are made available on the Web. The webinar will be presented on two separate dates, once in Portuguese (30 March) and again in English (6 April).

For additional information and to register for either the Portuguese or English version, visit the webinar's webpage at http://dublincore.org:/resources/training/#2017DataBP. Registration is managed by DCMI's partner ASIS&T.

Posted at 23:59

Dublin Core Metadata Initiative: DC-2017 Call of Participation published

2017-03-03, The DC-2017 Call for Participation (CfP) has been published. DC-2017 will take place in Washington, D.C. and will be collocated with the ASIST Annual Meeting. The theme of DC-2017 is "Advancing metadata practice: Quality, Openness, Interoperability". The conference program will include peer reviewed papers, project reports, and poster tracks. In addition, an array of presentations, panels, tutorials and workshops will round out the program. The Conference Committee is seeking submissions in all tracks. The CfP is available at http://dcevents.dublincore.org/index.php/IntConf/dc-2017/schedConf/cfp.

Posted at 23:59

Copyright of the postings is owned by the original blog authors. Contact us.