Planet RDF

It's triples all the way down

March 27

AKSW Group - University of Leipzig: AKSW Colloquium, 27.03.2017, PPO & PPM 2.0: Extending the privacy preference framework to provide finer-grained access control for the Web of Data

In the upcoming Colloquium, March the 27th at 3 PM Marvin Frommhold will discuss the paper “PPO & PPM 2.0: Extending the Privacy Preference Framework to provide finer-grained access control for the Web of Data” by Owen Sacco and John G. Breslin published in the I-SEMANTICS ’12 Proceedings of the 8th International Conference on Semantic Systems.

Abstract:  Web of Data applications provide users with the means to easily publish their personal information on the Web. However, this information is publicly accessible and users cannot control how to disclose their personal information. Protecting personal information is deemed important in use cases such as controlling access to sensitive personal information on the Social Semantic Web or even in Linked Open Government Data. The Privacy Preference Ontology (PPO) can be used to define fine-grained privacy preferences to control access to personal information and the Privacy Preference Manager (PPM) can be used to enforce such preferences to determine which specific parts of information can be granted access. However, PPO and PPM require further extensions to create more control when granting access to sensitive data; such as more flexible granularity for defining privacy preferences. In this paper, we (1) extend PPO with new classes and properties to define further fine-grained privacy preferences; (2) provide a new light-weight vocabulary, called the Privacy Preference Manager Ontology (PPMO), to define characteristics about privacy preference managers; and (3) present an extension to PPM to enable further control when publishing and sharing personal information based on the extended PPO and the new vocabulary PPMO. Moreover, the PPM is extended to provide filtering data over SPARQL endpoints.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 08:13

March 26

Bob DuCharme: Wikidata's excellent sample SPARQL queries

Learning about the data, its structure, and more.

Posted at 17:40

March 24

Dublin Core Metadata Initiative: Sayeed Choudhury to deliver Keynote at DC-2017

2017-03-24, The Governing Board and the Chairs of the DC-2017 Program Committee are please to announce that Sayeed Choudhury, Associate Dean for Research Data Management and Hodson Director of the Digital Research and Curation Center at the Sheridan Libraries of Johns Hopkins University will deliver the keynote address at DC-2017 in Washington, D.C. Choudhury has oversight for data curation research and development and data archive implementation at the Sheridan Libraries at Johns Hopkins University. Choudhury is a President Obama appointee to the National Museum and Library Services Board. He is a member of the Executive Committee for the Institute of Data Intensive Engineering and Science (IDIES) based at Johns Hopkins. He is also a member of the Board of the National Information Standards Organization (NISO) and a member of the Advisory Board for OpenAIRE2020. He has been a member of the National Academies Board on Research Data and Information, the ICPSR Council, the DuraSpace Board, Digital Library Federation advisory committee, Library of Congress' National Digital Stewardship Alliance Coordinating Committee, Federation of Earth Scientists Information Partnership (ESIP) Executive Committee and the Project MUSE Advisory Board. He is the recipient of the 2012 OCLC/LITA Kilgour Award. Choudhury has testified for the U.S. Research Subcommittee of the Congressional Committee on Science, Space and Technology. For additional information, see http://dcevents.dublincore.org/IntConf/index/pages/view/keynote17.

Posted at 23:59

Dublin Core Metadata Initiative: ZBW German National Library of Economics joins DCMI as Institutional Member

2017-03-24, The DCMI Governing Board is pleased to announce that ZBW German National Library of Economics has joined DCMI as a Institutional Member. ZBW Director, Klaus Tochtermann will serve as the Library's representative to the Board. ZBW German National Library of Economics - Leibniz Information Centre for Economics is the world's largest research infrastructure for economic literature, online as well as offline. Its disciplinary repository EconStor provides a large collection of more than 127,000 articles and working papers in Open Access. EconBiz, the portal for international economic information, allows students and researchers to search among nine million datasets. The ZBW edits two journals in economic policy, Wirtschaftsdienst and Intereconomics, and in cooperation with the Kiel Institute for the World Economy produces the peer-reviewed journal Economics based on the principle of Open Access. For information on becoming a DCMI Institutional Member, visit the DCMI membership page at http://dublincore.org/support/.

Posted at 23:59

Dublin Core Metadata Initiative: Webinar: Nailing Jello to a Wall: Metrics, Frameworks, & Existing Work for Metadata Assessment

2017-03-24, With the increasing number of repositories, standards and resources we manage for digital libraries, there is a growing need to assess, validate and analyze our metadata - beyond our traditional approaches such as writing XSD or generating CSVs for manual review. Being able to further analyze and determine measures of metadata quality helps us better manage our data and data-driven development, particularly with the shift to Linked Open Data leading many institutions to large-scale migrations. Yet, the semantically-rich metadata desired by many Cultural Heritage Institutions, and the granular expectations of some of our data models, makes performing assessment, much less going on to determine quality or performing validation, that much trickier. How do we handle analysis of the rich understandings we have built into our Cultural Heritage Institutions’ metadata and enable ourselves to perform this analysis with the systems and resources we have? This webinar with Christina Harlow, Cornell University Library, sets up this question and proposes some guidelines, best practices, tools and workflows around the evaluation of metadata used by and for digital libraries and Cultural Heritage Institution repositories. The goal is for webinar participants to walk away prepared to handle their own metadata assessment needs by using existing works and being better aware of the open questions in this domain. For additional information and to register, go to http://dublincore.org/resources/training/#2017harlow.

Posted at 23:59

Leigh Dodds: What is data asymmetry?

You’ve just parked your car. Google Maps offers to

Posted at 18:01

March 23

schema.org: Schema.org 3.2 release: courses, fact-checking, digital publishing accessibility, menus and more...

Schema.org 3.2 is released! This update brings many improvements including new vocabulary for describing courses, fact-check reviews, digital publishing accessibility, as well as a more thorough treatment of menus and a large number of pending proposals which are offered for early-access use, evaluation and improvement. We also introduce a new "hosted extension" area, iot.schema.org which provides an entry point for schema collaborations relating to the Internet of Things field. As always, our releases page has full details.

These efforts depend entirely on a growing network of collaborations, within our own W3C Community Group and beyond. Many thanks are due to the Schema Course Extension Community Group, the IDPF's Epub Accessibility Working Group, members of the international fact-checking network including the Duke Reporters Lab and Full Fact, the W3C Web of Things and Spatial Web initiatives, the Bioschemas project, and to Wikipedia's Wikidata project.

This release also provides the opportunity to thank two of our longest-serving steering group members, whose careers have moved on from the world of structured data markup. Peter Mika and Martin Hepp have both played leading roles in Schema.org since its earliest days, and the project has benefited greatly from their insight, commitment and attention to detail.

As we look towards future developments, it is worth taking a brief recap on how we have organized things recently. Schema.org's primary discussion forum is a W3C group, although its most detailed collaborations are typically in Github, organized around specific issues and proposed changes. These discussions are open to all interested parties. Schema designs frequently draw upon related groups that have a more specific topical focus. For example, the Courses group became a hub for education/learning metadata experts from LRMI and others. This need to engage with relevant experts also motivated the creation of the "pending" area introduced in our previous release. Github is a site oriented towards computer programmers. By surfacing proposed, experimental and other early access designs at pending.schema.org we hope we can reach a wider audience who may have insight to share. With today's release, we add 14 new "pending" designs, with courses, accessibility and fact-checking markup graduating from pending into the core section of schema.org. Future releases will follow this pipeline approach, encouraging greater consistency, quality and clarity as our vocabulary continues to evolve.



Posted at 15:29

March 21

Leigh Dodds: Fearful about personal data, a personal example

I was recently at a workshop on making better use of (personal) data for the benefit of specific communities. The discussion, perhaps inevitably, ended up focusing on many of the attendees concerns around how data about them was being used.

The group was asked to share what made them afraid or fearful about how personal data might be misused. The examples were mainly about use of the data by Facebook, by advertisers, as surveillance, etc. There was a view that being in control of that data would remove the fear and put the individual back in control. This same argument pervades a lot of the discussion around personal data. The narrative is that if I own my data then I can decide how and where it is used.

But this overlooks the fact that data ownership is not a clear cut thing. Multiple people might reasonably claim to have ownership over some data. For example bank transactions between individuals.

Posted at 19:10

Gregory Williams: SPARQL Limit by Resource

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

Project(
    Filter(
        ?rank <= 2,
        Ungroup(
            Window(
                Group((?school), BGP(?p :name ?name . ?p :school ?school . ?p :age ?age .)),
                ?rank,
                Rank,
                (),
                (DESC(?age)),
            )
        )
    ),
    {?age, ?name, ?school}
)

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
RANK(DESC(?age)) AS ?rank
HAVING (?rank <= 2)

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
}
GROUP BY ?school
HAVING (RANK(ORDER BY DESC(?age)) <= 2)

or:

PREFIX : <http://example.org/>
SELECT ?age ?name ?school WHERE {
    {
        SELECT ?age ?name ?school (RANK(GROUP BY ?school ORDER BY DESC(?age)) AS ?rank) WHERE {
            ?p :name ?name ;
               :school ?school ;
               :age ?age .
        }
    }
    FILTER(?rank <= 2)
}

Posted at 03:17

March 17

Ebiquity research group UMBC: SemTk: The Semantics Toolkit from GE Global Research, 4/4

The Semantics Toolkit

Paul Cuddihy and Justin McHugh
GE Global Research Center, Niskayuna, NY

rescheduled due to weather
10:30-11:30 Tuesday, 4 April 2017, ITE 346, UMBC

SemTk (Semantics Toolkit) is an open source technology stack built by GE Scientists on top of W3C Semantic Web standards.  It was originally conceived for data exploration and simplified query generation, and later expanded to a more general semantics abstraction platform. SemTk is made up of a Java API and microservices along with Javascript front ends that cover drag-and-drop query generation, path finding, data ingestion and the beginnings of stored procedure support.   In this talk we will give a tour of SemTk, discussing its architecture and direction, and demonstrate it’s features using the SPARQLGraph front-end hosted at http://semtk.research.ge.com.

Paul Cuddihy is a senior computer scientist and software systems architect in AI and Learning Systems at the GE Global Research Center in Niskayuna, NY. He earned an M.S. in Computer Science from Rochester Institute of Technology. The focus of his twenty-year career at GE Research has ranged from machine learning for medical imaging equipment diagnostics, monitoring and diagnostic techniques for commercial aircraft engines, modeling techniques for monitoring seniors living independently in their own homes, to parallel execution of simulation and prediction tasks, and big data ontologies.  He is one of the creators of the open source software “Semantics Toolkit” (SemTk) which provides a simplified interface to the semantic tech stack, opening its use to a broader set of users by providing features such as drag-and-drop query generation and data ingestion.  Paul has holds over twenty U.S. patents.

Justin McHugh is computer scientist and software systems architect working in the AI and Learning Systems group at GE Global Research in Niskayuna, NY. Justin attended the State University of New York at Albany where he earned an M.S in computer science. He has worked as a systems architect and programmer for large scale reporting, before moving into the research sector. In the six years since, he has worked on complex system integration, Big Data systems and knowledge representation/querying systems. Justin is one of the architects and creators of SemTK (the Semantics Toolkit), a toolkit aimed at making the power of the semantic web stack available to programmers, automation and subject matter experts without their having to be deeply invested in the workings of the Semantic Web.

Posted at 13:00

March 13

Leigh Dodds: Some tips for open data ecosystem mapping

At Open Data Camp last month I pitched to run a session on

Posted at 20:11

AKSW Group - University of Leipzig: DBpedia @ Google Summer of Code – GSoC 2017

DBpedia, one of InfAI’s community projects, will be part of the 5th Google Summer of Code program.

The GsoC has the goal to bring students from all over the globe into open source software development. In this regard we are calling for students to be part of the Summer of Codes. During three funded months, you will be able to work on a specific task, which results are presented at the summer.

We aroused your interest in particpation? Great, then check out the DBpedia website for further information.

Posted at 10:12

March 11

Leigh Dodds: The British Hypertextual Society (1905-2017)

With their globe-spanning satellite network nearing completion, Peter Linkage reports on some of the key milestones in the history of the British Hypertextual Society.

The British Hypertextual Society was founded in 1905 with a parliamentary grant from the Royal Society of London. At the time there was growing international interest in finding better ways to manage information, particularly scientific research. Undoubtedly the decision to invest in the creation of a British centre of expertise for knowledge organisation was also influenced by the rapid progress being made in Europe.

Posted at 11:52

March 10

Frederick Giasson: A Machine Learning Workflow

I am giving a talk (in French) at the 85th edition of the ACFAS congress, May 9. I will discuss the engineering aspects of doing machine learning. But more importantly, I will discuss how Semantic Web techniques, technologies and specifications can help solving the engineering problems and how they can be leveraged and integrated in a machine learning workflow.

The focus of my talk is based on my work in the field of the semantic web in the last 15 years and my more recent work creating the KBpedia Knowledge Graph at Cognonto and how they influenced our work to develop different machine learning solutions to integrate data, to extend knowledge structure, to tag and disambiguate concepts and entities in corpuses of texts, etc.

One thing we experienced is that most of the work involved in such project is not directly related to machine learning problems (or at least related to the usage of machine learning algorithms). And then I recently read a survey conducted by CrowdFlower in 2016 that support what we experienced. They surveyed about 80 data scientists to probe them to find out “where they feel their profession is going, [and] what their day-to-day job is like” To the question: “What data scientists spend the most time doing”, they answered:

As you can notice, 77% percent of the time spent by a data scientist is related to non-machine learning algorithms selection, testing and refinement. If you read the survey, you will see that at the same time about 93% of the respondent said that these are the least enjoyable tasks! Which is a bit depressing… But on the other hand, we don’t know how much they disliked these tasks.

What I find interesting in the survey is that most of these non-machine learning algorithms specific tasks (again, 77% of them!) are data manipulation tasks that need to be engineered in some process/workflow/pipeline.

To put these numbers into context, I created my own general machine learning workflow schema. This is the one I used multiple times in the last few years while working on different projects for Cognonto. Depending on the task at hand, some step may differ and be added, but the core is there. Also note that all the tasks where this machine learning workflow has been used are related to natural language processing, entities matching, concepts and entities tagging and disambiguation. (click on it to access a bigger version of the schema)

This workflow is split into four general areas:

  1. Data Processing
  2. Training Sets creation
  3. Machine Learning Algorithms testing, evaluation and selection
  4. Deployment and A/B testing

The only “real” machine learning work happens in the top-right corner of this schema and incurs only about 13% of the time spent by the data scientists according to the survey. All other tasks are related to data acquisition, data analysis, data normalization, data transformation, and then data integration, data filtering/slicing/reduction with which we will create a series of different training sets or training corpuses that will then lead to the creation (after proper splitting) of the training, validation and test sets.

It is only at that point that the data scientists will start testing algorithms to create different models, to evaluate them and to fine-tune the hyper-parameters. Once the best model(s) are selected, then we gradually put them into production with different steps of A/B testing.

Again, 77% of these tasks are related to non-machine learning algorithms tasks. These tasks are more related to an engineered pipeline which includes and ETL and an A/B testing frameworks. There is nothing sexy in this reality, but if data scientists spent 3/4 of their time working on these tasks then it suggests that they are highly important!

Note that every task with a small purple brain on it would benefit from leveraging a Knowledge Graph structure such as Cognonto’s KBpedia Knowledge Graph.

Posted at 19:25

Ebiquity research group UMBC: SemTk: The Semantics Toolkit from GE Global Research

The Semantics Toolkit

Paul Cuddihy and Justin McHugh
GE Global Research Center, Niskayuna, NY

10:30-11:30 Tuesday, 14 March 2017, ITE 346, UMBC

SemTk (Semantics Toolkit) is an open source technology stack built by GE Scientists on top of W3C Semantic Web standards.  It was originally conceived for data exploration and simplified query generation, and later expanded to a more general semantics abstraction platform. SemTk is made up of a Java API and microservices along with Javascript front ends that cover drag-and-drop query generation, path finding, data ingestion and the beginnings of stored procedure support.   In this talk we will give a tour of SemTk, discussing its architecture and direction, and demonstrate it’s features using the SPARQLGraph front-end hosted at http://semtk.research.ge.com.

Paul Cuddihy is a senior computer scientist and software systems architect in AI and Learning Systems at the GE Global Research Center in Niskayuna, NY. He earned an M.S. in Computer Science from Rochester Institute of Technology. The focus of his twenty-year career at GE Research has ranged from machine learning for medical imaging equipment diagnostics, monitoring and diagnostic techniques for commercial aircraft engines, modeling techniques for monitoring seniors living independently in their own homes, to parallel execution of simulation and prediction tasks, and big data ontologies.  He is one of the creators of the open source software “Semantics Toolkit” (SemTk) which provides a simplified interface to the semantic tech stack, opening its use to a broader set of users by providing features such as drag-and-drop query generation and data ingestion.  Paul has holds over twenty U.S. patents.

Justin McHugh is computer scientist and software systems architect working in the AI and Learning Systems group at GE Global Research in Niskayuna, NY. Justin attended the State University of New York at Albany where he earned an M.S in computer science. He has worked as a systems architect and programmer for large scale reporting, before moving into the research sector. In the six years since, he has worked on complex system integration, Big Data systems and knowledge representation/querying systems. Justin is one of the architects and creators of SemTK (the Semantics Toolkit), a toolkit aimed at making the power of the semantic web stack available to programmers, automation and subject matter experts without their having to be deeply invested in the workings of the Semantic Web.

Posted at 13:14

AKSW Group - University of Leipzig: New GERBIL release v1.2.5 – Benchmarking entity annotation systems

Dear all,
the Smart Data Management competence center at AKSW is happy to announce GERBIL 1.2.5.

GERBIL is a general entity annotation benchmarking system and offers an easy-to-use web-based platform for the agile comparison of annotators using multiple datasets and uniform measuring approaches. To add a tool to GERBIL, all the end user has to do is to provide a URL to a REST interface to its tool which abides by a given specification. The integration and benchmarking of the tool against user-specified datasets is then carried out automatically by the GERBIL platform. Currently, our platform provides results for **20 annotators** and **46 datasets** with more coming.

Website: http://aksw.org/Projects/GERBIL.html
Demo: http://gerbil.aksw.org/gerbil/
GitHub page: https://github.com/AKSW/gerbil
Download: https://github.com/AKSW/gerbil/releases/tag/v1.2.5

New features include:

  • Added annotators (DoSeR, NERFGUN, PBOH, xLisa)
  • Added datasets (Derczynski, ERD14 and GERDAQ, Microposts 2015 and 2016, Ritter, Senseval 2 and 3, UMBC, WSDM 2012)
  • Introduced the RT2KB experiment type that comprises recognition and typing of entities
  • Introduced index based sameAs relation retrieval and entity checking for KBs that do not change very often (e.g., DBpedia). Downloading the indexes is optional and GERBIL can run without them (but has the same performance drawbacks as the last versions).
  • A warning should be shown in the GUI if the server is busy at the moment.
  • Implemented checks for certain datasets and annotators. If dataset files are missing (because of licenses) or API keys of annotators, they are not available in the front end.

We want to thank everyone who helped to create this release, in particular we want to thank Felix Conrads and Jonathan Huthmann. We also acknowledge support by the DIESEL, QAMEL and HOBBIT projects.

We really appreciate feedback and are open to collaborations.
If you happen to have use cases utilizing this dataset, please contact us.

Michael and Ricardo on behalf of the GERBIL team

Posted at 10:49

March 09

AKSW Group - University of Leipzig: DBpedia Open Text Extraction Challenge – TextExt

DBpedia, a community project affiliated with the Institute for Applied Informatics (InfAI) e.V., extract structured information from Wikipedia & Wikidata. Now DBpedia started the DBpedia Open Text Extraction Challenge – TextExt. The aim is to increase the number of structured DBpedia/Wikipedia data and to provide a platform for benchmarking various extraction tools. DBpedia wants to polish the knowledge of Wikipedia and then spread it on the web, free and open for any IT users and businesses.

Procedure

Compared to other challenges, which are often just one time calls, the TextExt is a continuous challenge focusing on lasting progress and exceeding limits in a systematic way. DBpedia provides the extracted and cleaned full text for all Wikipedia articles from 9 different languages in regular intervals for download and as Docker in the machine readable NIF-RDF format (Example for Barrack Obama in English). Challenge participants are asked to wrap their NLP and extraction engines in Docker images and submit them to the DBpedia-Team. They will run participants’ tools in regular intervals in order to extract:

  1. Facts, relations, events, terminology, ontologies as RDF triples (Triple track)
  2. Useful NLP annotations such as pos-tags, dependencies, co-reference (Annotation track)

DBpedia allows submissions 2 months prior to selected conferences (currently http://ldk2017.org/ and http://2017.semantics.cc/ ). Participants that fulfil the technical requirements and provide a sufficient description will be able to present at the conference and be included in the yearly proceedings. Each conference, the challenge committee will select a winner among challenge participants, which will receive 1.000 €.

Results

Starting in December 2017, DBpedia will publish a summary article and proceedings of participants’ submissions at http://ceur-ws.org/ every year.

For further news and next events please have a look at http://wiki.dbpedia.org/textext or contact DBpedia via email dbpedia-textext-challenge@infai.org.

The project was created with the support of the H2020 EU project HOBBIT (GA-688227) and ALIGNED (GA-644055) as well as the BMWi project Smart Data Web (GA-01MD15010B)

Challenge Committee

  • Sebastian Hellmann, AKSW, DBpedia Association, KILT Competence Center, InfAI e.V., Leipzig
  • Sören Auer, Fraunhofer IAIS, University of Bonn
  • Ricardo Usbeck, AKSW, Simba Competence Center, Leipzig University
  • Dimitris Kontokostas, AKSW, DBpedia Association, KILT Competence Center, InfAI e.V., Leipzig
  • Sandro Coelho, AKSW, DBpedia Association, KILT Competence Center, InfAI e.V., Leipzig

 

Posted at 11:15

March 07

Leigh Dodds: Designing CSV files

A couple of the projects I’m involved with at the moment are at a stage where there’s some thinking going on around how to best provide CSV files for users. This has left me thinking about what options we actually have when it comes to designing a CSV file format.

CSV is a very useful, but pretty mundane format. I suspect many of us don’t really think very much about how to organise our CSV files. It’s just a table, right? What decisions do we need to make?

But there are actually quite a few different options we have that might make a specific CSV format more or less suited for specific audiences. So I thought I’d write down some of the options that occured to me. It might be useful input into both my current projects as well as future work on standard formats.

Starting from the “outside in”, we have decisions to make about all of the following:

File naming

How are you going to name your CSV file? A good file naming convention can help ensure that a data file has an unambiguous name within a data package or after a user has downloaded it.

Including a name, timestamp or other version indicator will avoid clobbering existing files if a user is archiving or regularly collecting data.

Adopting a similar policy to generating

Posted at 22:22

March 04

Ebiquity research group UMBC: SADL: Semantic Application Design Language

SADL – Semantic Application Design Language

Dr. Andrew W. Crapo
GE Global Research

 10:00 Tuesday, 7 March 2017

The Web Ontology Language (OWL) has gained considerable acceptance over the past decade. Building on prior work in Description Logics, OWL has sufficient expressivity to be useful in many modeling applications. However, its various serializations do not seem intuitive to subject matter experts in many domains of interest to GE. Consequently, we have developed a controlled-English language and development environment that attempts to make OWL plus rules more accessible to those with knowledge to share but limited interest in studying formal representations. The result is the Semantic Application Design Language (SADL). This talk will review the foundational underpinnings of OWL and introduce the SADL constructs meant to capture, validate, and maintain semantic models over their lifecycle.

 

Dr. Crapo has been part of GE’s Global Research staff for over 35 years. As an Information Scientist he has built performance and diagnostic models of mechanical, chemical, and electrical systems, and has specialized in human-computer interfaces, decision support systems, machine reasoning and learning, and semantic representation and modeling. His work has included a graphical expert system language (GEN-X), a graphical environment for procedural programming (Fuselet Development Environment), and a semantic-model-driven user-interface for decision support systems (ACUITy). Most recently Andy has been active in developing the Semantic Application Design Language (SADL), enabling GE to leverage worldwide advances and emerging standards in semantic technology and bring them to bear on diverse problems from equipment maintenance optimization to information security.

Posted at 14:19

March 03

Dublin Core Metadata Initiative: Webinar: Data on the Web Best Practices: Challenges and Benefits

2017-03-03, There is a growing interest in the publication and consumption of data on the Web. Government and non-governmental organizations already provide a variety of data on the Web, some open, others with access restrictions, covering a variety of domains such as education, economics, e-commerce and scientific data. Developers, journalists, and others manipulate this data to create visualizations and perform data analysis. Experience in this area reveals that a number of important issues need to be addressed in order to meet the requirements of both publishers and data consumers.

In this webinar, Bernadette Farias Lóscio, Caroline Burle dos Santos Guimarães and Newton Calegari will discuss the key challenges faced by publishers and data consumers when sharing data on the Web. We will also introduce the W3C Best Practices set (https://www.w3.org/TR/dwbp/) to address these challenges. Finally, we will discuss the benefits of engaging data publishers in the use of Best Practices, as well as improving the way data sets are made available on the Web. The webinar will be presented on two separate dates, once in Portuguese (30 March) and again in English (6 April).

For additional information and to register for either the Portuguese or English version, visit the webinar's webpage at http://dublincore.org:/resources/training/#2017DataBP. Registration is managed by DCMI's partner ASIS&T.

Posted at 23:59

Dublin Core Metadata Initiative: DC-2017 Call of Participation published

2017-03-03, The DC-2017 Call for Participation (CfP) has been published. DC-2017 will take place in Washington, D.C. and will be collocated with the ASIST Annual Meeting. The theme of DC-2017 is "Advancing metadata practice: Quality, Openness, Interoperability". The conference program will include peer reviewed papers, project reports, and poster tracks. In addition, an array of presentations, panels, tutorials and workshops will round out the program. The Conference Committee is seeking submissions in all tracks. The CfP is available at http://dcevents.dublincore.org/index.php/IntConf/dc-2017/schedConf/cfp.

Posted at 23:59

Dublin Core Metadata Initiative: Webinar: Data on the Web Best Practices: Challenges and Benefits

2017-03-03, There is a growing interest in the publication and consumption of data on the Web. Government and non-governmental organizations already provide a variety of data on the Web, some open, others with access restrictions, covering a variety of domains such as education, economics, e-commerce and scientific data. Developers, journalists, and others manipulate this data to create visualizations and perform data analysis. Experience in this area reveals that a number of important issues need to be addressed in order to meet the requirements of both publishers and data consumers. In this webinar, Bernadette Farias Lóscio, Caroline Burle dos Santos Guimarães and Newton Calegari will discuss the key challenges faced by publishers and data consumers when sharing data on the Web. We will also introduce the W3C Best Practices set (https://www.w3.org/TR/dwbp/) to address these challenges. Finally, we will discuss the benefits of engaging data publishers in the use of Best Practices, as well as improving the way data sets are made available on the Web. The webinar will be presented on two separate dates, once in Portuguese (30 March) and again in English (6 April). For additional information and to register for either the Portuguese or English version, visit the webinar's webpage at http://dublincore.org:/resources/training/#2017DataBP. Registration is managed by DCMI's partner ASIS&T.

Posted at 23:59

Dublin Core Metadata Initiative: DC-2017 Call of Participation published

2017-03-03, The DC-2017 Call for Participation (CfP) has been published. DC-2017 will take place in Washington, D.C. and will be collocated with the ASIST Annual Meeting. The theme of DC-2017 is "Advancing metadata practice: Quality, Openness, Interoperability". The conference program will include peer reviewed papers, project reports, and poster tracks. In addition, an array of presentations, panels, tutorials and workshops will round out the program. The Conference Committee is seeking submissions in all tracks. The CfP is available at http://dcevents.dublincore.org/index.php/IntConf/dc-2017/schedConf/cfp.

Posted at 23:59

Libby Miller: Libbybot – a presence robot with Chromium 51, Raspberry Pi and RTCMultiConnection for WebRTC

I’ve been working on a

Posted at 14:23

Libby Miller: Immutable preferences, economics, social media and algorithmic recommendations

One of the things that encouraged me to leave economics after doing a PhD was that – at the time, and still in textbook microeconomics – the model of a person was so basic it could not encompass wants and needs that change.

You, to an economist, usually look like this:

217px-simple-indifference-curves-svg

“Indifference Curves” over two goods by

Posted at 13:22

February 27

Frederick Giasson: KBpedia Knowledge Graph 1.40: Extended Using Machine Learning

I am proud to announce the immediate release of the KBpedia Knowledge Graph version 1.40. This new version of the knowledge graph includes 53,739 concepts which is 14,687 more than with the previous version. It also includes 251,848 new alternative labels for 20,538 previously existing concepts in the version 1.20, and 542 new definitions.

This new version of KBpedia will have an impact on multiple different knowledge graph related tasks such as concepts and entities tagging and most of the existing Cognonto use cases. I will be discussing these updates and their effects on the use cases in a forthcoming series of blog posts.

But the key topic of this current blog post is this: How have we been able to increase the coverage of the KBpedia Knowledge Graph by 37.6% while keeping it consistent (that is, there are no contradictory facts) and satisfiable (that is, checks to see if the candidate addition violates any existing class disjointness assertions), all within roughly a single month of FTE effort?

Reciprocal Mapping: Leverage Linkages

At the core of the KBpedia Knowledge Graph are hundreds of thousands of class links between the KBpedia concepts and external core and extended data sources. To generate the initial mappings, we use the Cognonto Mapper to find potential linkage matches between the tens of thousands of KBpedia concepts and the millions of entities that exists in its external sources. We then finally vet the narrowed candidate pool of assignment by hand.

Then, through reciprocal mapping (see related article), we leverage these initial mappings: 1) to find more alternative labels and definitions for existing KBpedia concepts; and, more importantly, 2) to extend the scope and coverage of the KBpedia Knowledge Graph structure.

Extending KBpedia’s Coverage

Extending the scope and the coverage of a knowledge graph structure that contains tens of thousands of consistent and satisfiable classes is not a simple thing to do, particularly when we want to improve its coverage by more than 37% while trying to keep it consistent and satisfiable.

We have been able to achieve these aims with this new version 1.40 of the KBpedia Knowledge Graph by:

  1. Leveraging existing linkages between KBpedia concepts and Wikipedia categories
  2. Leveraging the inner structure of KBpedia using graph embeddings.

The Wikipedia Categories structure is a more-or-less consistent taxonomy used to categorize Wikipedia pages. Its structure is quite dissimilar than the KBpedia Knowledge Graph structure. However, even if the structure is dissimilar, the categories themselves can be used to extend knowledge areas not currently existing in KBpedia.

As explained in the Cognonto use case on Extending KBpedia With Wikipedia Categories, what first created graph embeddings for each of the Wikipedia categories. Then we created a classifier where the training positive examples are the Wikipedia category graph embeddings already linked to KBpedia concepts, and where the false training examples are other Wikipedia categories graph embeddings that are known to be bad candidates to create new KBpedia concepts.

Once the classifier is trained, we classify every sub-category of Wikipedia categories linked to KBpedia using that model. When the classification is done, a person reviews all of the positive classifications to determine the final candidates that will become new KBpedia concepts.

Once we have the list of vetted KBpedia concepts that we want to add to the core structure, we then use the KBpedia Generator to create a new KBpedia structure and to make sure that all the facts we have added to the Knowledge Graph are consistent and satisfiable. Inconsistencies and unsatisfiable class issues are fixed in an iterative process until the entire KBpedia Knowledge Graph structure is fully consistent and satisfiable with prior knowledge. It is only at this point that we can now release the new version of KBpedia. Without having the KBpedia Generator and its reasoning capabilities to check the consistency and the satisfiability of the knowledge structure, we simply could not extend the structure without adding hundreds of inconsistency and unsatisfiability issues. The number of relationships between all the concepts is simply too big to understanding all its ramifications simply by looking at it.

We estimate that this semi-automated process takes about 5% of the time it would normally take to conduct this entire process by comparable manual means. We know, since we have been doing the manual approach for nearly a decade.

Adding Alternative Labels and Definition

Another thing we gain by leveraging the external linkages of KBpedia is that we can use it to extend descriptions of existing KBpedia concepts since all the linkages are of high quality, both because they have been reviewed by a human and because they have been made consistent and satisfiable by the generation framework.

Reciprocal mapping shows how we can leverage the Wikipedia pages linkages. Via this rather efficient method, we also added 251,848 new alternative labels and 542 new definitions for 20,538 previously existing concepts in the version 1.20. Adding that many alternative labels to the knowledge graph greatly improves the coverage of the Cognonto conceptual tagger by adding 251,848 new surface forms that were previously unknown.

Posted at 20:15

February 26

Bob DuCharme: Getting to know Wikidata

First (SPARQL-oriented) steps.

Posted at 15:23

February 24

AKSW Group - University of Leipzig: The USPTO Linked Patent Dataset release

Dear all,

We are happy to announce USPTO Linked Patent Dataset release.

Patents are widely used to protect intellectual property and a measure of innovation output. Each year, the USPTO grants over 150, 000 patents to individuals and companies all over the world. In fact, there were more than 200, 000 patent grants issued in the US in 2013. However, accessing, searching and analyzing those patents is often still cumbersome and inefficient.

Our dataset is the output of converting USPTO XML patents data into RDF from the years 2002 – 2016. This supports the integration with other data sources in order to further simplify use cases such as trend analysis, structured patent search & exploration and societal progress measurements.

The USPTO Linked Patent Dataset contains 13,014,651 entities where 2,355,579 are patents. Other entities represent Applicant, Inventor, Agent, Examiner (primary and secondary),  and assignee. All these entities amount to c.a. 168 million triples are describing the patents information.

The complete description of the dataset and SPARQL endpoint are available on the DataHub: https://datahub.io/dataset/linked-uspto-patent-data.

We really appreciate feedback and are open to collaborations.
If you happen to have use cases utilizing this dataset, please contact us.

 

Posted at 16:18

February 22

AKSW Group - University of Leipzig: Two accepted papers in ESWC 2017

Hello Community! We are very pleased to announce the acceptance of two papers in ESWC 2017 research track. The ESWC 2017 is to be held in Portoroz, Slovenia from 28th of May to the 1st of June. In more detail, we will present the following papers:

  1. “WOMBAT – A Generalization Approach for Automatic Link Discovery” Mohamed Ahmed Sherif, Axel-Cyrille Ngonga Ngomo, Jens Lehmann

    Abstract. A significant portion of the evolution of Linked Data datasets lies in updating the links to other datasets. An important challenge when aiming to update these links automatically under the open-world assumption is the fact that usually only positive examples for the links exist. We address this challenge by presenting and evaluating WOMBAT , a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. WOMBAT is based on generalisation via an upward refinement operator to traverse the space of link specification. We study the theoretical characteristics of WOMBAT and evaluate it on 8 different benchmark datasets. Our evaluation suggests that WOMBAT outperforms state-of-the-art supervised approaches while relying on less information. Moreover, our evaluation suggests that WOMBAT’s pruning algorithm allows it to scale well even on large datasets.

  2. “All That Glitters is not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking” Kunal Jha, Michael Röder and Axel-Cyrille Ngonga Ngomo

    Abstract. The evaluation of Named Entity Recognition as well as Entity Linking systems is mostly based on manually created gold standards. However, the current gold standards have three main drawbacks. First, they do not share a common set of rules pertaining to what is to be marked and linked as an entity. Moreover, most of the gold standards have not been checked by other researchers after they have been published and hence commonly contain mistakes. Finally, they lack actuality as in most cases the reference knowledge base used to link the entities has been refined over time while the gold standards are typically not updated to the newest version of the reference knowledge base. In this work, we analyze existing gold standards and derive a set of rules for annotating documents for named entity recognition and entity linking. We derive Eaglet, a tool that supports the semi-automatic checking of a gold standard based on these rules. A manual evaluation of Eaglet’s results shows that it achieves an accuracy of up to 88% when detecting errors. We apply Eaglet to 13 gold standards and detect 38,453 errors. An evaluation of 10 tools on a subset of these datasets shows a performance difference of up to 10% micro F-measure on average.

 

Acknowledgments
This work is has been supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation action SLIPO (GA no. 731581), the BMWI Project SAKE (project no. 01MD15006E), the BmBF project DIESEL (project no. 01QE1512C) and the BMWI Project GEISER (project no. 01MD16014).

Posted at 16:43

Leigh Dodds: Open Data Camp Pitch: Mapping data ecosystems

I’m going to

Posted at 10:39

Copyright of the postings is owned by the original blog authors. Contact us.