It's triples all the way down
… or, the lack of it.
A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.
Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.
IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.
On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.
Offerings on the SaaS level provide an equally poor picture:
In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.
Posted at 20:12
function getCalaisResult($id, $text) { $parms = ' <c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <c:processingDirectives c:contentType="TEXT/RAW" c:outputFormat="XML/RDF" c:calculateRelevanceScore="true" c:enableMetadataType="SocialTags" c:docRDFaccessible="false" c:omitOutputtingOriginalText="true" ></c:processingDirectives> <c:userDirectives c:allowDistribution="false" c:allowSearch="false" c:externalID="' . $id . '" c:submitter="http://semsol.com/" ></c:userDirectives> <c:externalMetadata></c:externalMetadata> </c:params> '; $args = array( 'licenseID' => $this->a ['calais_key'], 'content' => urlencode($text), 'paramsXML' => urlencode(trim($parms)) ); $qs = substr($this->qs($args), 1); $url = 'http://api.opencalais.com/enlighten/rest/'; return $this->getAPIResult($url, $qs); }
function getZemantaResult($id, $text) { $args = array( 'method' => 'zemanta.suggest', 'api_key' => $this->a ['zemanta_key'], 'text' => urlencode($text), 'format' => 'rdfxml', 'return_rdf_links' => '1', 'return_articles' => '0', 'return_categories' => '0', 'return_images' => '0', 'emphasis' => '0', ); $qs = substr($this->qs($args), 1); $url = 'http://api.zemanta.com/services/rest/0.0/'; return $this->getAPIResult($url, $qs); }
function getAPIResult($url, $qs) { ARC2::inc('Reader'); $reader = new ARC2_Reader($this->a, $this); $reader->setHTTPMethod('POST'); $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded"); $reader->setMessageBody($qs); $reader->activate($url); $r = ''; while ($d = $reader->readStream()) { $r .= $d; } $reader->closeStream(); return $r; }
SELECT DISTINCT ?id ?obj ?cnf ?name FROM <' . $g . '> WHERE { ?rec a z:Recognition ; z:object ?obj ; z:confidence ?cnf . ?obj z:target ?id . ?id z:targetType <http://s.zemanta.com/targets#rdf> ; z:title ?name . FILTER(?cnf >= 0.4) } ORDER BY ?id
Posted at 09:09
Posted at 09:09
Dear all,
We are proud to announce that we will organize an online tutorial at the Web Conference on 25th of April 2022. A particular focus will be put on the DBpedia Infrastructure, i.e. DBpedia’s Databus publishing platform and the associated DBpedia services (Spotlight, Lookup and the DBpedia endpoints). In practical examples we will illustrate the potential and the benefit of using DBpedia in the context of the Web of Data.
We will organize a DBpedia Knowledge Graph hands-on tutorial. Although the tutorial will be shaped in a way that no specific prerequisites will be required, the participants would benefit if they have some background knowledge in Semantic Web concepts and technologies (RDF, OWL, SPARQL), general overview of the Web Architecture (HTTP, URI, JSON, etc.) and basic programming skills (bash, Java, JavaScript). The online tutorial will be 90 minutes.
Please register at the Web Conference page to be part of the masterclass. You need to buy a full access pass to join the DBpedia Tutorial.
We are looking forward to meeting you online!
Kind regards,
Julia
on behalf of the DBpedia Association
Posted at 13:05
We had the official book launch of “The Social Semantic Web” last month in the President’s Drawing Room at NUI Galway. The book was officially launched by Dr. James J. Browne, President of NUI Galway. The book was authored by myself, Dr. Alexandre Passant and Prof. Stefan Decker from the Digital Enterprise Research Institute at NUI Galway (sponsored by SFI). Here is a short blurb:
Web 2.0, a platform where people are connecting through their shared objects of interest, is encountering boundaries in the areas of information integration, portability, search, and demanding tasks like querying. The Semantic Web is an ideal platform for interlinking and performing operations on the diverse data available from Web 2.0, and has produced a variety of approaches to overcome limitations with Web 2.0. In this book, Breslin et al. describe some of the applications of Semantic Web technologies to Web 2.0. The book is intended for professionals, researchers, graduates, practitioners and developers.
Some photographs from the launch event are below.
Posted at 02:05
Uldis Bojars submitted his PhD thesis entitled “The SIOC MEthodology for Lightweight Ontology Development” to the University in September 2009. We had a nice night out to celebrate in one of our favourite haunts, Oscars Bistro.
This was followed by a successful defense at the end of November 2009. The examiners were Chris Bizer and Stefan Decker. Uldis even wore a suit for the event, see below.
Uldis established a formal ontology design process called the SIOC MEthodology, based on an evolution of existing methodologies that have been streamlined, experience developing the SIOC ontology, and observations regarding the development of lightweight ontologies on the Web. Ontology promotion and dissemination is established as a core part of the ontology development process. To demonstrate the usage of the SIOC MEthodology, Uldis described the SIOC project case study which brings together the Social Web and the Semantic Web by providing semantic interoperability between social websites. This framework allows data to be exported, aggregated and consumed from social websites using the SIOC ontology (in the SIOC application food chain). Uldis’ research work has been published in 4 journal articles, 8 conference papers, 13 workshop papers, and 1 book chapter. The SIOC framework has also been adopted in 33 third-party applications. The Semantic Radar tool he initiated for Firefox has been downloaded 24,000 times. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).
We wish Uldis all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.
Posted at 02:05
This is a few months late but better late then never! We said goodbye to PhD researcher Haklae Kim in May of this year when he returned to Korea and took up a position with Samsung Electronics soon afterward. We had a nice going away lunch for Haklae with the rest of the team from the Social Software Unit (picture below).
Haklae returned to Galway in September to defend his PhD entitled “Leveraging a Semantic Framework for Augmenting Social Tagging Practices in Heterogeneous Content Sharing Platforms”. The examiners were Stefan Decker, Tom Gruber and Philippe Laublet. Haklae successfully defended his thesis during the viva, and he will be awarded his PhD in 2010. We got a nice photo of the examiners during the viva which was conducted via Cisco Telepresence, with Stefan (in Galway) “resting” his hand on Tom’s shoulder (in San Jose)!
Haklae created a formal model called SCOT (Social Semantic Cloud of Tags) that can semantically describe tagging activities. The SCOT ontology provides enhanced features for representing tagging and folksonomies. This model can be used for sharing and exchanging tagging data across different platforms. To demonstrate the usage of SCOT, Haklae developed the int.ere.st open tagging platform that combined techniques from both the Social Web and the Semantic Web. The SCOT model also provides benefits for constructing social networks. Haklae’s work allows the discovery of social relationships by analysing tagging practices in SCOT metadata. He performed these analyses using both Formal Concept Analysis and tag clustering algorithms. The SCOT model has also been adopted in six applications (OpenLink Virtuoso, SPARCool, RelaxSEO, RDFa on Rails, OpenRDF, SCAN), and the int.ere.st service has 1,200 registered members. Haklae’s research work was published in 2 journal articles, 15 conference papers, 3 workshop papers, and 2 book chapters. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).
We wish Haklae all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.
Posted at 02:05
BlogTalk 2009
The 6th International Conf. on Social Software
September 1st and 2nd, 2009
Jeju Island, Korea
Overview
Following the international success of the last five BlogTalk events, the next BlogTalk – to be held in Jeju Island, Korea on September 1st and 2nd, 2009 – is continuing with its focus on social software, while remaining committed to the diverse cultures, practices and tools of our emerging networked society. The conference (which this year will be co-located with Lift Asia 09) is designed to maintain a sustainable dialog between developers, innovative academics and scholars who study social software and social media, practitioners and administrators in corporate and educational settings, and other general members of the social software and social media communities.
We invite you to submit a proposal for presentation at the BlogTalk 2009 conference. Possible areas include, but are not limited to:
Participants and proposal categories
Due to the interdisciplinary nature of the conference, audiences will come from different fields of practice and will have different professional backgrounds. We strongly encourage proposals to bridge these cultural differences and to be understandable for all groups alike. Along those lines, we will offer three different submission categories:
For academics, BlogTalk is an ideal conference for presenting and exchanging research work from current and future social software projects at an international level. For developers, the conference is a great opportunity to fly ideas, visions and prototypes in front of a distinguished audience of peers, to discuss, to link-up and to learn (developers may choose to give a practical demonstration rather than a formal presentation if they so wish). For practitioners, this is a venue to discuss use cases for social software and social media, and to report on any results you may have with like-minded individuals.
Submitting your proposals
You must submit a one-page abstract of the work you intend to present for review purposes (not to exceed 600 words). Please upload your submission along with some personal information using the EasyChair conference area for BlogTalk 2009. You will receive a confirmation of the arrival of your submission immediately. The submission deadline is June 27th, 2009.
Following notification of acceptance, you will be invited to submit a short or long paper (four or eight pages respectively) for the conference proceedings. BlogTalk is a peer-reviewed conference.
Timeline and important dates
(Due to the tight schedule we expect that there will be no deadline extension. As with previous BlogTalk conferences, we will work hard to endow a fund for supporting travel costs. As soon as we review all of the papers we will be able to announce more details.)
Topics
Application Portability Bookmarking Business Categorisation Collaboration Content Sharing Data Acquisition Data Mining Data Portability Digital Rights Education Enterprise Ethnography Folksonomies and Tagging Human Computer Interaction Identity Microblogging Mobile Multimedia |
Podcasting Politics Portals Psychology Recommender Systems RSS and Syndication Search Semantic Web Social Media Social Networks Social Software Transparency and Openness Trend Analysis Trust and Reputation Virtual Worlds Web 2.0 Weblogs Wikis |
Posted at 02:05
|
Figure 2 of
the AOP paper showing the content of the AOP-DB. |
Skin lipids
The first such papers is a review of the state of studying lipids in our skin: Research Techniques Made Simple: Lipidomic Analysis in Skin Research (doi:10.1016/j.jid.2021.09.017). This paper originates from WG4 of the EpiLipidNet project (funded by COST) where we are developing molecular pathways involving lipids. Florian Gruber is chair of WG4 and our role is the pathways. We have set up lipids.wikipathways.org for this and the article further mentions the pathway and network system biology approaches used in our group.
Adverse Outcome Pathways
Combining Adverse Outcome Pathways (AOPs) with molecular pathways is one of the longer running research lines in our group. This already started during the eNanoMapper project and Marvin has been studying alternative approaches in his PhD projected funded by OpenRiskNet and EU-ToxRisk. During OpenRiskNet he collaborated with the team of Holly Mortensen of the USA EPA resulting in some collaborative projects. One outcome is the recent EPA paper on an RDF version of their AOP-DB: The AOP-DB RDF: Applying FAIR Principles to the Semantic Integration of AOP Data Using the Research Description Framework (doi:10.3389/ftox.2022.803983).
Thanks again to everyone involved in these papers for these nice collaborations!
Posted at 09:52
Makevember and lockdown have encouraged me to make an improved version of libbybot, which is a physical version of a person for remote participation. I’m trying to think of a better name – she’s not all about representing me, obviously, but anyone who can’t be somewhere but wants to participate. [update Jan 15: she’s now called “sock_puppet”].
This one is much, much simpler to make, thanks to the addition of a pan-tilt hat and a simpler body. It’s also more expressive thanks to these lovely little 5*5 led matrixes.
Her main feature is that – using a laptop or phone – you can see, hear and speak to people in a different physical place to you. I used to use a version of this at work to be in meetings when I was the only remote participant. That’s not much use now of course. But perhaps in the future it might make sense for some people to be remote and some present.
New recent features:
* ish
**a sock
I’m still writing docs, but the repo is here.
Posted at 19:12
This is the second in a short series of posts in which I’m sharing my notes and thoughts on a variety of different approaches for assessing data infrastructure and data institutions.
The first post in the series looked at The Principles of Open Scholarly Infrastructure.
In this post I want to take a look at the Digital Public Good (DPG) registry developed by The Digital Public Goods Alliance.
The Digital Public Goods Alliance define digital public goods as:
open-source software, open data, open AI models, open standards, and open content that adhere to privacy and other applicable laws and best practices, do no harm by design, and help attain the Sustainable Development Goals (SDGs)
Digital Public Goods Alliance
While the links to the Sustainable Development Goals narrows the field this definition still encompasses a very diverse set of openly licensed resources.
Investing in the creation and use of DPGs was one of eight key actions in the UN Roadmap for Digital Cooperation published in 2020.
The Digital Public Goods Standard consists of 9 indicators and requirements that are used to assess whether a dataset, AI model, standard, software package or content can be considered a DPG.
To summarise, the indicators and requirements cover:
In contrast to the Principles of Open Scholarly Infrastructure, which defines principles for infrastructure services (i.e. data infrastructure and data institutions) the Digital Public Goods Standard can be viewed as focusing on the outputs of that infrastructure, e.g. the datasets that they publish or the software or standards that they produce.
But assessing a resource to determine if it is a Digital Public Good inevitably involves some consideration of the processes by which it has been produced.
A recent Rockefeller Foundation report on co-developing Digital Public Infrastructure endorsed by the Digital Public Goods Alliance, highlights that Digital Public Goods might also be used to create new digital infrastructure. E.g. by deploying open platforms in other countries or using data and AI models to build new infrastructure.
So Digital Public Goods are produced by, used by, and support the deployment of data and digital infrastructure.
The Digital Public Goods Standard was developed by the Digital Public Good Alliance (DPGA), “a multi-stakeholder initiative with a mission to accelerate the attainment of the sustainable development goals in low- and middle-income countries by facilitating the discovery, development, use of, and investment in digital public goods“
An early pilot of the standard was developed to assess Digital Public Goods focused on Early Grade Reading. The initial assessment criteria were developed by a technical group that explored cross-domain indicators and an expert group that focused on topics relevant to literacy.
This ended up covering 11 categories and 51 different indicators.
That results of that pilot was turned into the initial version of the DPG Standard and published in September 2020. In that process the 51 indicators were reduced down to just 9.
It is interesting to see what was removed, for example:
The process of engaging with domain experts has continued, with the DPGA developing Communities of Practice that have produced reports highlighting key digital public goods in specific domains. An example of what we called “data landscaping” at the ODI.
The assessment process is as follows:
While nominations can be made by third-parties, some indicators are only assessed based on evidence provided directly by the publisher of the resource.
At the time of writing there are 651 nominees and 87 assessed public goods in the registry. The list of Digital Public Goods consists of the following (items can be in multiple categories):
Category | Count |
Software | 68 |
Content | 17 |
Data | 8 |
AI Model | 4 |
Standard | 4 |
Its worth noting that several of the items in the “Data” category are actually APIs and services.
The assessment of a verified Digital Public Good is publicly included in the registry. For example here is the recently published assessment of the Mozilla Common Voice dataset. However all of the data supporting the individual nominations can be found in this public repository.
The documentation and the submission guide explain that the benefits of becoming a Digital Public Good include
Indirectly, by providing a standard for assessment, the DPGA will be influencing the process by which openly licensed resources might be created.
Is the Standard useful as a broader assessment tool, e.g. for projects that are not directly tied to the SDGs? Or for organisations looking to improve their approach to publishing open data, open source or machine-learning models?
I think the detailed submission questions provide some useful points of reflection.
But I think the process of trying to produce a single set of assessment criteria that covers data, code, AI models and content means that useful and important detail is lost.
Even trying to produce a single set of criteria for assessing (open) datasets across domains is difficult. We tried that at the ODI with Open Data Certificates. Others are trying to do this now with the FAIR data principles. You inevitably end up with language like “using appropriate best practices and standards” which is hard to assess without knowledge of the context in which data is being collected and published.
I also think the process of winnowing down the original 51 indicators to just 9 and some supporting questions, has lost some important elements. Particularly around sustainability.
Again, in Open Data Certificates, we asked questions about longer-term sustainability and access to data. This is also seems highly relevant in the context in which the DPGA are operating.
I think the standard might have been better having separate criteria for different types of resource and then directly referencing existing criteria (e.g. FAIR data assessment tools for data) or best practices (e.g. Model Cards for Model Reporting for AI), etc.
Posted at 14:07
How do we create well-designed, trustworthy, sustainable data infrastructure and institutions?
This is a question that I remain deeply interested in. Much of the freelance work I’ve been doing since leaving the ODI has been in that area. For example, I’m currently helping with a multi-year evaluation of an grant-funded data institution.
I’m particularly interested in frameworks and methodologies for assessing infrastructure and institutions. With a view towards helping them become more open, more trustworthy and more sustainable.
This is the first in a series of blog posts looking at some of this existing work.
The Principles of Open Scholarly Infrastructure (POSI) consist of 16 principles grouped into three themes: governance, sustainability and insurance.
The seven Governance principles touch on how the infrastructure will be governed and managed. These highlight the need for, e.g. stakeholder led governance, transparency, and the need to plan across the entire lifecycle of the infrastructure, including its wind-down.
The five Sustainability principles highlight the need to for revenue generation to align with the mission of the organisation, and emphases generating revenue from services rather than data. They also highlight the need to generate a surplus and finding long-term sources of revenue, rather than relying on grant funding.
The five Insurance principles centre openness: open source and open data, as well as IP issues. In short, ensuring that if the infrastructure fails (or loses the trust of its community) its core assets can be reused.
The principles were first presented in a 2015 blog post. The principles attempted to codify a set of rules and norms that was already informing the operations of CrossRef, this was prompted by an growing distrust in infrastructure services by the scholarly community.
Reliance on time-limited grant funding was impacting sustainability and reliability of services, alongside growing concerns over commercial ownership of key services.
Since then the principles have been discussed and adopted by a number of others.
POSI is intended to help guide the development and operations of infrastructure that supports any kind of scholarly activities (e.g. both research and teaching) across a range of domains (e.g. both the sciences and humanities).
This includes infrastructure services that support the management and publication of research data and metadata, scholarly research archives, identifier schemes, etc.
The FAQ highlights that the principles were also intended to help support procurement and comparison of different services.
Any set of principles will reflect the priorities of community that produced them. Care should be taken before blindly applying principles from one context to another. Some issues might be foregrounded that are less important. While other important concerns might be appropriately centred.
See my post on the FAIR data principles for more on that.
However, I think much of POSI is applicable to all types of open infrastructure. Good governance and sustainability is important in any context. Open data and open source also play a fundamental role.
However there are some elements that might not apply in all contexts, or might be presented differently. And others which might be missing. For example:
The POSI FAQ is also worth reading as it includes a number of clarifications about the scope and intent behind some of the principles.
In short, I think it would be useful to compare POSI with approaches originating in other sectors, in order to identify common themes. Within the research space, IOI is also exploring ways to assess and compare infrastructure.
A range of organisations have adopted the principles, most recently Europe PMC.
No organisation meets all of the principles.
But the intention isn’t that an organisation should comply with all of them before doing an assessment. An assessment is intended to prompt reflection and development of a plan for improvement.
There is an assumption that organisations will regularly reassess themselves against the principles.
All of the existing assessments have taken the same broad approach, replicating that used by CrossRef:
I’ve produced a public spreadsheet listing the current RAG ratings for each organisation.
To help with future assessments, I think there’s scope to:
Posted at 14:07
I recently finished reading “How to Do Nothing” by Jenny Odell. It’s a great, thought-provoking read.
Despite the title the book isn’t a treatise on disconnecting or a guide to mindfulness. It’s an exploration of attention: what is it? how do we direct it? can it be trained? And how is it hijacked by social media?
While the book makes a powerful case of the importance of stepping away from technology to reconnect with our local environments and communities, Odell’s recommendation isn’t that we should just disconnect from social media or technology. Her argument is that we need to redesign the ways that we connect with one another. Reframing rather than disengaging.
She illustrates the power of redirecting our attention with examples from birdwatching, art and music.
Having spent much of last summer, identifying the bees in my garden, I’m very aware of how a change in focus can suddenly bring a small corner of the world to life.
In her discussion of social media, Odell touches on context collapse. But she also highlights how social feeds themselves lack context. There is no organising principle of geography, theme or community to that tumble of posts. This leaves us endlessly scrolling, searching for meaning our attention (and emotions) flitting from one thing to the next.
This crystallised for me my recent frustrations with Twitter: no matter how well I curate my list of followers what I see is rarely what I’m looking for in the moment. Feeds lack structure and there’s no way real way for me to assert any control over it.
It’s also why I uninstalled Tik Tok when I realised that an endless scroll of algorithmically recommended content was the only real way to engage.
Odell argues that we need new frameworks for connecting online, touching on Community Memory and decentralised systems like Mastodon. Systems that build or provide context across communities.
One of the first articles I read after finishing “How To Do Nothing” was a post called “What using RSS feeds feels like” by Giles Turnbull. It neatly describes the flexibility that using a Feed Reader provides. They can support us in focusing our attention on the things we enjoy. While not a social space, they can connect to them.
The broader themes of “How To Do Nothing” are the importance of (re)connecting to our local communities and environment so that we can deal with the big challenges ahead of us. And perhaps as an antidote to an increasingly polarised society.
This really challenged me to reflect on what it would mean to reduce my use of social media. What would fill that space both online and off? I’m not sure of the answer to that yet.
But I feel like social spaces — at least the ones I currently use, at any rate — have become less fulfilling. Trading connectivity over context. I’d like to find some different options.
Posted at 13:05
I’ve been reading about different approaches to watermarking AI and the datasets used to train them.
This seems to be an active area of research within the machine learning community. But, of the papers I’ve looked at so far, there hasn’t been much discussion of how these techniques might be applied and what groundwork needs to be done to achieve that.
This post is intended as a brief explainer on watermarking AI and machine-learning datasets. It also includes some potential use cases for how these might be applied in real-world systems to help build trust, e.g. by supporting audits or other assurance activities.
Please leave any comments or corrections if you think I’ve misrepresented something.
There are techniques that allow watermarking of digital objects, like images, audio clips and video. It’s also been applied to data.
Sometimes the watermark is visible. Sometimes it’s not.
Watermarking is frequently used as part of rights management, e.g. to track the provenance of an image to support copyright claims.
There are sophisticated techniques to apply hidden watermarks to digital objects in ways that resist attempts to remove them.
Watermarking involves add message, logo, signature or some other data to something in order to determine its origin. There’s a very long history of watermarking physical objects like bank notes and postage stamps.
Researchers are currently exploring ways to apply watermarking techniques to machine-learning models and the data used to produce them.
There are broadly two approaches:
The are various ways of implementing and using these approaches.
For example, a model can be watermarked by:
Dataset watermarking assumes that the publisher of the dataset isn’t involved in the later training of an AI. So it relies on adjusting the training dataset only. It’s a way of finding out how a model was produced. Whereas model watermarking allows the detection of a model when it is deployed.
Dataset watermarking requires new techniques to be developed because existing watermarking approaches don’t work in a machine-learning context.
For example, when training a image classification model, any watermarks present in the training images will just be discarded. The watermarks are not relevant to the learning process. To be useful, watermarking a machine-learning dataset, involves modifying the data in ways that are consistent with the labelling, so that they induce changes in the model that can later be generated.
In this context then, dataset watermarking is a technique that is specifically intended to apply to machine-learning datasets: labelled dataset intended to be used in machine-learning applications and research. It’s not a technique you would just apply to a random dataset published to a government data portal.
Checking whether a model is watermarked, e.g. to determine where the model came from or whether it was trained on a specific dataset can be done without having direct access to the model.
Importantly, it’s possible to verify either a model or dataset watermark without having direct access to the model. The watermark can be checked by inspecting the output of the model in response to specific inputs that are designed to expose it.
If you want a more technical introduction to model and dataset watermarking, then I recommend starting with these papers:
These techniques are related to areas like “data poisoning” (modifying training data to cause defects in a model) and “membership inference” (determining whether some sample data was used to train a model as a privacy measure).
In their blog introducing the concept of “Radioactive data” the Facebook research team suggest that the technique:
“…can help researchers and engineers to keep track of which dataset was used to train a model so they can better understand how various datasets affect the performance of different neural networks.”
Using ‘radioactive data’ to detect if a dataset was used for training
They later expand on this a little:
Techniques such as radioactive data can also help researchers and engineers better understand how others in the field are training their models. This can help detect potential bias in those models, for example. Radioactive data could also help protect against the misuse of particular datasets in machine learning.
Using ‘radioactive data’ to detect if a dataset was used for training
The paper on “Open Source Dataset Protection” suggests it would be a useful way to confirm that commercial AI models have not been trained on datasets that are only licensed for academic or educational use.
I’ve yet to see a more specific set of use cases, so I’d like to suggest some.
I think all of the below are potential uses that align with the capabilities of the existing techniques. Future research might open up other potential uses.
Model watermarking
Dataset watermarking
There’s undoubtedly a lot more of these. In general watermarking can help us determine what model is being used in a service and what dataset(s) were used in training them.
For some of the use cases, there are likely to be other ways to achieve the same goal. Regulators could directly require companies to inform them about sources of data they are using, rather than independently checking for watermarks.
But sometimes you need multiple approaches to help build trust. And I wanted to flesh out a list of possible users that were outside of research and which were not about IP enforcement.
Assuming that these types of watermarking techniques are useful in the ways I’ve suggested, then there are a range of things required to build a proper ecosystem around them.
All of the hard work to create the appropriate standards, governance and supporting infrastructure to make them work.
This includes:
Posted at 16:05
I like reading old magazines and books over at the Internet Archive.
They’ve got a great online reader that works just fine in the browser. But sometimes I want a local copy I can put on my tablet or other device. And reading locally saves them some bandwidth.
Downloading individual items is simple, but it can be tedious to grab multiple items. So here’s a quick tutorial on automatically downloading items in bulk and then doing something with them.
The Archive provide an open API to their collections and a command-line tool that uses that API to let you access metadata, and upload and download content.
The Getting Started guide has plenty of examples and installation instructions for Unix systems. I also found some Windows instructions.
The Archive organises items into “collections”. The issues of a magazine will be organised into a single collection.
There are also collections of collections. The Magazine Rack collection is a great entry point into a whole range of magazine collections, so its a good starting point to explore if you want to see what the Archive currently holds.
To download every issue of a magazine you just need to first identify the name of its collection.
The easiest way to do that is to take the identifier from the URL. E.g. the INPUT magazine collection, has the following URL:
https://archive.org/details/inputmagazine
The identifier is the last part of the URL
(“inputmagazine
“).
You can also click on the “About” section of the collection and
look at its metadata. The Identifier
is listed
there.
Assuming you have the ia
tool installed, the
following command-line will let you download the contents of a
named collection. Just change the identifier from
“inputmagazine
” to the one you want.
ia download --search 'collection:inputmagazine' --glob="*.pdf"
The “glob
” parameter asks the tool to only download
PDFs (“files that have a .pdf
extension”). If you
don’t do this then you will end up downloading every format that
the Archive holds for every item. You almost certainly don’t want
that that. Its slow and uses up bandwidth.
If you’re downloading to put the content on an ereader or
kindle, then you could use “*.epub
” or
“.mobi
” instead.
When downloading the files, the ia
tool will put
each one into a separate folder.
And that’s it: you now have you own local copy of all the magazines. You can use the same approach to download any type of content, not just magazines.
Now to do something with them.
Magazines usually have great cover art. I like turning them into animated GIFs. Here’s one I made for Science for the People. And another for INPUT magazine.
To do that you need to do two things:
To extract the images, I use the pdftoppm tool. This is also cross-platform so should work on any system.
The following command will extract the first page of a file
called example.pdf
and save it into a new file called
example-01.jpg
.
pdftoppm -f 1 -l 1 -jpeg -r 300 example.pdf example
See the documentation page for more information on the parameters.
Having downloaded an entire collection using the ia tool, you will have a set of folders each containing a single PDF file. Here’s a quick bash script that you can run from the folder where you downloaded the content.
It will find every PDF you’ve downloaded, then use
pdftoppm
to extract the first page, storing the images
in a separate “images” directory.
#!/bin/bash
mkdir -p images
i=0
for FILE in **/*.pdf;
do
i=$((i + 1))
echo $FILE;
pdftoppm -f 1 -l 1 -jpeg -r 300 $FILE images/issue-$i
done
Finally, to create a GIF from those JPEG files I use the
ImagicMagick convert
tool.
If you create an animated GIF from a lot of files, especially at higher-resolutions, then you’re going to end up with a very large file size. So I resize the images when creating the GIF.
The following command will find all the images we’ve just created, resize them by 25% and turn them into a GIF that will loop forever.
convert -resize 25% -loop 0 `ls -v images/*.jpg` input.gif
The ls
command gives us a listing of the image
files in a natural order. So we get the issues appearing in the
expected sequence.
You can add other options to
tweak the delay
between frames of the GIF. Or
change the loop
variable to only do a fixed number of
loops.
If you want to post the GIF to Twitter then there is a maximum
15MB file size via the web interface. So you may need to tweak the
resize
command further.
Happy reading and animating.
Posted at 12:05
We’re two years into the COVID-19 pandemic and I still keep having moments of “Holy shit, we’re in a global pandemic“.
We’ve all been through so many emotions. And there’s more to come. But it still seems surreal at times.
I say that not to deny or dismiss what is happening. It’s just a lot to deal with at times.
Both my daughter and wife now have COVID. They’re doing fine thankfully. We’re all isolating in separate parts of the house. We’re lucky to be able to do that. Lucky that I work from home and can look after them.
But that surreal feeling has been particularly acute this week.
Across the planet, a virus mutated. It infected someone. It’s been passed from person to person across the globe until now it’s inside the house. Inside the people who live here. At some point it will be inside me.
There’s an extra visceral feeling from knowing that the virus is inside these four walls. It feels different to knowing it is out there.
That process of transmission happens all the time. It’s how I’ve caught every cold I’ve ever had. It’s a fact of life. But, like so many other facts of our lives, it’s one that doesn’t always get the attention it deserves.
We’re lucky to have vaccines for COVID-19. We’re lucky to have been triple-vaccinated. I knew that we’d catch it at some point. I’m so glad it’s now, after we’ve been vaccinated. Many other people have not been so lucky.
Although it privilege really, rather than luck.
I assume at least some of the denial that I hear that “things will go back to normal” stems from other people also feeling that the current situation is weird, unusual, temporary. That “it will all be over soon”. But that’s not going to happen.
Things have changed now for good. I know that. Doesn’t stop me from occasionally having these “Holy shit” moments though.
Posted at 17:05
In 1984 a new magazine hit the racks in W. H Smiths: INPUT.
It offered to help you to “Learn programming for fun and the future” via a weekly course in programming and computing. It ran for a total of 52 issues.
I was 12 when it came out. And I collected every issue.
INPUT gave me my first proper introduction to programming. The ZX Spectrum user guide and the BASIC programming manual were useful references and supplied some example programmes. (I played a lot of Pangolins).
But it was INPUT that taught me more than just the basics and introduced me to a whole range of new topics.
Sadly, I got rid of my copies, along with all of my other 80s computer magazines many years ago. Happily, the full collection of INPUT is available on the Internet Archive.
I had a dig through it again recently. They covered a surprising range of material. Both simple and advanced programming, as well as how the hardware worked.
All of the code was provided for a range of desktop computers. I always found it interesting to compare the different versions of BASIC. I even read the C64 only features, which were usually showing of its more advanced features.
I never did get the hang of Assembly though.
The magazine included some longer tutorials that built up a programme or covered a topic over a number of issues.
One of my favourites started in Issue 9. A five part series on how to write text adventure games. I spent a lot of time playing at making my own games.
Looking back now, I can see that one of the biggest impacts it had on me came from a listing in Issue 2 and 3. These two articles introduced a simple filing system with basic record management and search functions. A little database to help people track their hobbies.
I used it to catalogue all the games I copied off my mates.
I made a database for my Dad to keep track of his pigeons. How many races they’d won. And which ones he’d been breeding. He never used it. But I had fun designing it.
Most of my career has involved working with data and databases. INPUT gave me not just an introduction to programming, but my first introduction to that whole topic.
There’s a line that starts with me working with that BASIC code on the ZX Spectrum which continues all the way to the present day.
During the second year of my biology degree I took a computing module. As part of that I wrote a Pascal programme to simulate the activity of restriction enzymes. It sounds fancy, but it was just string manipulation.
Looking at that code its clearly informed by what I had learned from that INPUT article. It had a simple menu system to access different functions. I even wrote some tools to help me create the data files that drove the application, so I could manage those records separately.
I had a lot of fun writing it.
I described that Pascal programme in the interview I had to take to get accepted for my Masters in Computing. It was a conversion course for people who didn’t have a Computing or Maths background. It definitely helped me get on that course.
I took all the modules about databases, obviously. As well as the networking ones. Because I was started to get interested in the web.
And then I got interested in the web as a database. And data on the web.
It’s funny how things work out.
So thanks Mum for bringing that magazine home every Thursday night. And thank you Internet Archive for making them available for me to read again.
Posted at 15:05
We (John G. Breslin and Guangyuan Piao, Unit for Social Semantics, Insight Centre for Data Analytics, NUI Galway) have created a prototype ontology for web archives based on two existing ontologies: Semantically-Interlinked Online Communities (SIOC) and the Common Data Model (CDM).
Figure 1: Initial Prototype of Web Archive Ontology, Linking to SIOC and CDM
In Figure 1, we give an initial prototype for a general web archive ontology, linked to concepts in the CDM, but allowing flexibility in terms of archiving works, media, web pages, etc. through the “Item” concept. Items are versioned and linked to each other, as well as to concepts appearing in the archived items themselves.
We have not shown the full CDM for ease of display in this document, but rather some of the more commonly used concepts. We can also map to other vocabulary terms shown in the last column of Table 1 below; some mappings and reused terms are shown in Figure 1.
Essentially, the top part of the model differentiates between the archive / storage mechanism for an item in an area (Container) on a website (Site), i.e. where it originally came from , who made it, when it was created / modified, when it was archived, the content stream, etc., and on the bottom, what the item actually is (for example, in terms of CDM, the single exemplar of the manifestation of an expression of a work).
Also, the agents who make the item and the work may differ (e.g. a bot may generate a HTML copy of a PDF publication written by Ms. Smith).
In Table 1, we list some relevant public ontologies and terms of interest. Some terms can be reused, and others can be mapped to for interoperability purposes.
Ontology Name | Overview | Why relevant? | What terms are useful? |
FRBR | For describing functional requirements for bibliographic records. | To describe bibliographic records. |
Expression Work |
FRBRoo | Express the conceptualisation of FRBR with an object-oriented methodology instead of the entity-relationship methodology, as an alternative. | In general, FRBRoo inherits all concepts of CIDOC-CRM and harmonises with it. |
ClassicalWork LegalWork ScholarlyWork Publication Expression |
BIBFrame | For describing bibliographic descriptions, both on the Web and in the broader networked world. | To represent and exchange bibliographic data. |
Work Instance Annotation Authority |
EDM | The Europeana Data Model models data in and supports functionality for Europeana, an internet portal that acts as an interface to millions of books, paintings, films, museum objects and archival records that have been digitised throughout Europe. | Complements FRBRoo with additional properties and classes. |
incorporate isDerivativeOf WebResource TimeSpan Agent Place PhysicalThing |
CIDOC- | For describing the implicit and explicit concepts and relationships used in the cultural heritage domain. | To describe cultural heritage information. |
EndofExistence Creation Time-Span |
EAC-CPF | Encoded Archival Context for Corporate Bodies, Persons and Families is used for encoding the names of creators of archival materials and related information. | Used closely in association with EAD to provide a formal method for recording the descriptions of record creators. |
lastDateTimeVerified Control Identity |
EU PO CDM | Ontology based on the FRBR model, for describing the relationships between resource types managed by the EU Publications Office and their views, according to the FRBR model. | To describe records. |
Expression Work Manifestation Agent Subject Item |
OAI-ORE | Defines standards for the description and exchange of aggregations of Web resources. | To describe relationships among resources (also used in EDM). |
aggregates Aggregation ResourceMap |
EAD | Standard used for hierarchical descriptions of archival records. | Terms are designed to describe archival records. |
audience abbreviation certainty repositorycode AcquisitionInformation ArchivalDescription |
WGS84 Geo | For describing information about spatially located things. | Terms can be used with the Place ontology for describing place information. |
lat long |
Media | For describing media resources on the Web. | To describe media contents for web archiving. |
compression format MediaType |
Places | For describing places of geographic interest. | To describe place information for events, etc. |
City Country Continent |
Event | For describing events. | To describe specific event in content. Also can be used for representing events at an administrative level. |
agent product place Agent Event |
SKOS | A common data model for sharing and linking knowledge organisation systems. | To capture similarities among ontologies and makes the relationships explicit. |
broader related semanticRelation relatedMatch Concept Collection |
SIOC | For describing social content. | Terms are general enough to be used for web archiving. |
previous_version next_version earlier_version later_version latest_version Item Container Site embed_knowledge |
Dublin Core | Provide a metadata vocabulary of “core” properties that is able to provide basic descriptive information about any kind of resource. | Fundamental terms used with other ontologies. |
creator date description identifier language publisher |
LOC METS Profile | The Metadata Encoding and Transmission Standard (METS) is a metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. The METS profile expresses the requirements that a METS document must satisfy. | To describe and organise the components of a digital object. |
controlled_ vocabularies external_schema |
DCAT and DCAT-AP | A specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. | Enable the exchange of description metadata between data portals. |
downloadURL accessURL Distribution Dataset CatalogRecord |
Formex | A format for the exchange of data between the Publication Office and its contractors. In particular (but not only), it defines the logical markup for documents, which are published in the different series of the Official Journal of the European Union. | Useful for annotating archived items as well for exchange purposes. |
Archived Annotation FT Note |
ODP | Ontology describing the metadata vocabulary for the Open Data Portal of the European Union. | To describe dataset portals. |
datasetType datasetStatus accrualPeriodicity DatasetDocumentation |
LOC PREMIS | Used to describe preservation metadata. | Applicable to archives. |
ContentLocation CreatingApplication Dependency |
VIAF | Virtual International Authority File is an international service designed to provide convenient access to the world’s major name authority files (lists of names of people, organisations, places, etc. used by libraries). Enables switching of the displayed form of names to the preferred language of a web user. | Useful for linking to name authority files and helping to serve different language communities in Europe. |
AuthorityAgency NameAuthority NameAuthorityCluster |
Table 1: Relevant Ontologies and Terms
Posted at 15:05
SIOC is a Social Semantic Web project that originated at DERI, NUI Galway (funded by SFI) and which aims to interlink online communities with semantic technologies. You can read more about SIOC on the Wikipedia page for SIOC or in this paper. But in brief, SIOC provides a set of terms that describe the main concepts in social websites: posts, user accounts, thread structures, reply counts, blogs and microblogs, forums, etc. It can be used for interoperability between social websites, for augmenting search results, for data exchange, for enhanced feed readers, and more. It’s also one of the metadata formats used in the forthcoming Drupal 7 content management system, and has been deployed on hundreds of websites including Newsweek.com.
As part of our dissemination activities, I’ve tried to regularly summarise recent developments in the project so as to give an overview of what’s going on and also to help in connecting interested parties. It’s been much too long (over a year) since my last report, so this will be a long one! In reverse chronological order, here’s a list of recent applications and websites that are using SIOC:
Posted at 15:05
I was interviewed by Marie Boran from Silicon Republic recently for an interesting article she was writing entitled “Will Google Wave topple the e-mail status quo and change the way we work?“. I thought that maybe my longer answers may be of interest and am pasting them below.
Disclaimer: My knowledge of Google Wave is second hand through various videos and demonstrations I’ve seen… Also, my answers were written pretty quickly!
As someone who is both behind Ireland’s biggest online community boards.ie and a researcher at DERI on the Semantic Web, are you excited about Google Wave?
Technically, I think it’s an exciting development – commercially, it obviously provides potential for others (Google included) to set up a competing service to us (!), but I think what is good is the way it has been shown that Google Wave can integrate with existing platforms. For example, there’s a nice demo showing how Google Wave plus MediaWiki (the software that powers the Wikipedia) can be used to help editors who are simultaneously editing a wiki page. If it can be done for wikis, it could aid with lots of things relevant to online communities like boards.ie. For example, moderators could see what other moderators are online at the same time, communicate on issues such as troublesome users, posts with questionable content, and then avoid stepping on each other’s toes when dealing with issues.
Does it potential for collaborative research projects? Or is it heavyweight/serious enough?
I think it has some potential when combined with other tools that people are using already. There’s an example from SAP of Google Wave being integrated with a business process modelling application. People always seem to step back to e-mail for doing various research actions. While wikis and the like can be useful tools for quickly drafting research ideas, papers, projects, etc., there is that element of not knowing who is doing stuff at the same time as you. Just as people are using Gtalk to augment Gmail by being able to communicate in contacts in real-time when browsing e-mails, Google Wave could potentially be integrated with other platforms such as collaborative work environments, document sharing systems, etc. It may not be heavyweight enough on its own but at least it can augment what we already use.
Where does Google Wave sit in terms of the development of the Semantic Web?
I think it could be a huge source of data for the Semantic Web. What we find with various social and collaborative platforms is that people are voluntarily creating lots of useful related data about various objects (people, events, hobbies, organisations) and having a more real-time approach to creating content collaboratively will only make that source of data bigger and hopefully more interlinked. I’d hope that data from Google Wave can be made available using technologies such as SIOC from DERI, NUI Galway and the Online Presence Ontology (something we are also working on).
If we are to use Google Wave to pull in feeds from all over the Web will both RSS and widgets become sexy again?
I haven’t seen the example of Wave pulling in feeds, but in theory, what I could imagine is that real-time updating of information from various sources could allow that stream of current information to be updated, commented upon and forwarded to various other Waves in a very dynamic way. We’ve seen how Twitter has already provided some new life for RSS feeds in terms of services like Twitterfeed automatically pushing RSS updates to Twitter, and this results in some significant amounts of rebroadcasting of that content via retweets etc.
Certainly, one of the big things about Wave is its integration of various third-party widgets, and I think once it is fully launched we will see lots of cool applications building on the APIs that they provide. There’s been a few basic demonstrator gadgets shown already like polls, board games and event planning, but it’ll be the third-party ones that make good use of the real-time collaboration that will probably be the most interesting, as there’ll be many more people with ideas compared to some internal developers.
Is Wave the first serious example of a communications platform that will only be as good as the third-party developers that contribute to it?
Not really. I think that title applies to many of the communications platforms we use on the Web. Facebook was a busy service but really took off once the user-contributable applications layer was added. Drupal was obviously the work of a core group of people but again the third-party contributions outweigh those of the few that made it.
We already have e-mail and IM combined in Gmail and Google Docs covers the collaborative element so people might be thinking ‘what is so new, groundbreaking or beneficial about Wave?’ What’s your opinion on this?
Perhaps the real-time editing and updating process. Often times, it’s difficult to go back in a conversation and add to or fix something you’ve said earlier. But it’s not just a matter of rewriting the past – you can also go back and see what people said before they made an update (“rewind the Wave”).
Is Google heading towards unified communications with Wave, and is it possible that it will combine Gmail, Wave and Google Voice in the future?
I guess Wave could be one portion of a UC suite but I think the Wave idea doesn’t encompass all of the parts…
Do you think Google is looking to pull in conversations the way FriendFeed, Facebook and Twitter does? If so, will it succeed?
Yes, certainly Google have had interests in this area with their acquisition of Jaiku some time back (everyone assumed this would lead to a competitor to Twitter; most recently they made the Jaiku engine available as open source). I am not sure if Google intends to make available a single entry point to all public waves that would rival Twitter or Facebook status updates, but if so, it could be a very powerful competitor.
Is it possible that Wave will become as widely used and ubiquitous as Gmail?
It will take some critical mass to get it going, integrating it into Gmail could be a good first step.
And finally – is the game changing in your opinion?
Certainly, we’ve moved from frequently updated blogs (every few hours/days) to more frequently updated microblogs (every few minutes/seconds) to being able to not just update in real-time but go back and easily add to / update what’s been said any time in the past. People want the freshest content, and this is another step towards not just providing content that is fresh now but a way of freshening the content we’ve made in the past.
Posted at 15:05
For the past few months, there have been a variety of calls for feedback and suggestions on how the US Government can move towards becoming more open and transparent, especially in terms of their dealings with citizens and also for disseminating information about their recent financial stimulus package.
As part of this, the National Dialogue forum was set up to solicit solutions for ways of monitoring the “expenditure and use of recovery funds”. Tim Berners-Lee wrote a proposal on how linked open data could provide semantically-rich, linkable and reusable data from Recovery.gov. I also blogged about this recently, detailing some ideas for how discussions by citizens on the various uses of expenditure (represented using SIOC and FOAF) could be linked together with financial grant information (in custom vocabularies).
More recently, the Open Government Initiative solicited ideas for a government that is “more transparent, participatory, and collaborative”, and the brainstorming and discussion phases have just ended. This process is now in its third phase, where the ideas proposed to solve various challenges are to be more formally drafted in a collaborative manner.
What is surprising about this is how few submissions and contributions have been put into this third and final phase (see graph below), especially considering that there is only one week for this to be completed. Some topics have zero submissions, e.g. “Data Transparency via Data.gov: Putting More Data Online”.
This doesn’t mean that people aren’t still thinking about this. On Monday, Tim Berners-Lee published a personal draft document entitled “Putting Government Data Online“. But we need more contributions from the Linked Data community to the drafts during phase three of the Open Government Directive if we truly believe that this solution can make a difference.
(I watched it again today, and added a little speech bubble to the image below to express my delight at seeing SIOC profiles on the Linked Open Data cloud slide.)
We also have a recently-established Linked Data Research Centre at DERI in NUI Galway.
Posted at 15:05
Tim Berners-Lee recently posted an important request for the provision of Linked Open Data from the US recovery effort website Recovery.gov.
The National Dialogue website (set up to solicit ideas for data collection, storage, warehousing, analysis and visualisation; website design; waste, fraud and abuse detection; and other solutions for transparency and accountability) says that for Recovery.gov to be a useful portal for citizens, it “requires finding innovative ways to integrate, track, and display data from thousands of federal, state, and local entities”.
If you support the idea of Linked Open Data from Recovery.gov, you can have a read and provide some justifications on this thread.
(I’ve recently given some initial ideas about how grant feed data could be linked with user contributions in the form of associated threaded discussions on different topics, see picture below, all to be exposed as Linked Open Data using custom schemas plus SIOC and FOAF across a number of agencies / funding programs / websites.)
Posted at 15:05
Our forthcoming book entitled “The Social Semantic Web”, to be published by Springer in Autumn 2009, is now available to pre-order from both Springer and Amazon.
An accompanying website for the book will be at socialsemanticweb.net.
Posted at 15:05
Previously we saw how ACLs can be used in Virtuoso to protect different types of resources. Today we will look into conditional groups which allow to share resources or grant permissions to a dynamic group of individuals. This means that we do not maintain a list of group members but instead define a set of conditions which an individual needs to fulfill in order to be part of the group in question.
That does sound very dry. Let’s just jump to an example:
@prefix oplacl: <http://www.openlinksw.com/ontology/acl#> . [] a oplacl:ConditionalGroup ; foaf:name "People I know" ; oplacl:hasCondition [ a oplacl:QueryCondition ; oplacl:hasQuery """ask where { graph <urn:my> { <urn:me> foaf:knows ^{uri}^ } }""" ] .
This group is based on a single condition which uses a simple
SPARQL ASK query. The ask query contains a variable
^{uri}^
which the ACL engine will replace with the URI
of the authenticated user. The group contains anyone who is in a
foaf:knows
relationship to urn:me
in
named graph urn:my
. (Ideally the latter graph should
be
write-protected using ACLs as described before.)
Now we use this group in ACL rules. That means we first create it:
$ curl -X POST \ --data-binary @group.ttl \ -H"Content-Type: text/turtle" \ -u dba:dba \ http://localhost:8890/acl/groups
As a result we get a description of the newly created group
which also contains its URI. Let’s imagine this URI is
http://localhost:8890/acl/groups/1
.
To mix things up we will use the group for sharing permission to access a service instead of files or named graphs. Like many of the Virtuoso-hosted services the URI Shortener is ACL controlled. We can restrict access to it using ACLs.
As always the URI Shortener has its own ACL scope which we need to enable for the ACL system to kick in:
sparql prefix oplacl: <http://www.openlinksw.com/ontology/acl#> with <urn:virtuoso:val:config> delete { oplacl:DefaultRealm oplacl:hasDisabledAclScope <urn:virtuoso:val:scopes:curi> . } insert { oplacl:DefaultRealm oplacl:hasEnabledAclScope <urn:virtuoso:val:scopes:curi> . };
Now we can go ahead and create our new ACL rule which allows anyone in our conditional group to shorten URLs:
[] a acl:Authorization ; oplacl:hasAccessMode oplacl:Write ; acl:accessTo <http://localhost:8890/c> ; acl:agent <http://localhost:8890/acl/groups/1> ; oplacl:hasScope <urn:virtuoso:val:scopes:curi> ; oplacl:hasRealm oplacl:DefaultRealm .
Finally we add one URI to the conditional group as follows:
sparql insert into <urn:my> { <urn:me> foaf:knows <http://www.facebook.com/sebastian.trug> . };
As a result my facebook account has access to the URL
Shortener:
The example we saw here uses a simple query to determine the members of the conditional group. These queries could get much more complex and multiple query conditions could be combined. In addition Virtuoso handles a set of non-query conditions (see also oplacl:GenericCondition). The most basic one being the following which matches any authenticated person:
[] a oplacl:ConditionalGroup ; foaf:name "Valid Identifiers" ; oplacl:hasCondition [ a oplacl:GroupCondition, oplacl:GenericCondition ; oplacl:hasCriteria oplacl:NetID ; oplacl:hasComparator oplacl:IsNotNull ; oplacl:hasValue 1 ] .
This shall be enough on conditional groups for today. There will be more playing around with ACLs in the future…
Posted at 21:10
Disclaimer: Many of the features presented here are rather new and can not be found in the open-source version of Virtuoso.
Last time we saw how to share files and folders stored in the Virtuoso DAV system. Today we will protect and share data stored in Virtuoso’s Triple Store – we will share RDF data.
Virtuoso is actually a quadruple-store which means each triple lives in a named graph. In Virtuoso named graphs can be public or private (in reality it is a bit more complex than that but this view on things is sufficient for our purposes), public graphs being readable and writable by anyone who has permission to read or write in general, private graphs only being readable and writable by administrators and those to which named graph permissions have been granted. The latter case is what interests us today.
We will start by inserting some triples into a named graph as dba – the master of the Virtuoso universe:
This graph is now public and can be queried by anyone. Since we want to make it private we quickly need to change into a SQL session since this part is typically performed by an application rather than manually:
$ isql-v localhost:1112 dba dba Connected to OpenLink Virtuoso Driver: 07.10.3211 OpenLink Virtuoso ODBC Driver OpenLink Interactive SQL (Virtuoso), version 0.9849b. Type HELP; for help and EXIT; to exit. SQL> DB.DBA.RDF_GRAPH_GROUP_INS ('http://www.openlinksw.com/schemas/virtrdf#PrivateGraphs', 'urn:trueg:demo'); Done. -- 2 msec.
Now our new named graph urn:trueg:demo
is private
and its contents cannot be seen by anyone. We can easily test
this by logging out and trying to query the graph:
But now we want to share the contents of this named graph with someone. Like before we will use my LinkedIn account. This time, however, we will not use a UI but Virtuoso’s RESTful ACL API to create the necessary rules for sharing the named graph. The API uses Turtle as its main input format. Thus, we will describe the ACL rule used to share the contents of the named graph as follows.
@prefix acl: <http://www.w3.org/ns/auth/acl#> . @prefix oplacl: <http://www.openlinksw.com/ontology/acl#> . <#rule> a acl:Authorization ; rdfs:label "Share Demo Graph with trueg's LinkedIn account" ; acl:agent <http://www.linkedin.com/in/trueg> ; acl:accessTo <urn:trueg:demo> ; oplacl:hasAccessMode oplacl:Read ; oplacl:hasScope oplacl:PrivateGraphs .
Virtuoso makes use of the ACL ontology proposed by the
W3C and extends on it with several custom classes and
properties in the OpenLink ACL
Ontology. Most of this little Turtle snippet should be obvious:
we create an Authorization resource which grants Read access to
urn:trueg:demo
for agent http://www.linkedin.com/in/trueg. The only
tricky part is the scope. Virtuoso has the concept of ACL scopes
which group rules by their resource type. In this case the scope is
private graphs, another typical scope would be DAV resources.
Given that file rule.ttl contains the above resource we can post the rule via the RESTful ACL API:
$ curl -X POST --data-binary @rule.ttl -H"Content-Type: text/turtle" -u dba:dba http://localhost:8890/acl/rules
As a result we get the full rule resource including additional properties added by the API.
Finally we will login using my LinkedIn identity and are granted read access to the graph:
We see all the original triples in the private graph. And as before with DAV resources no local account is necessary to get access to named graphs. Of course we can also grant write access, use groups, etc.. But those are topics for another day.
Using ACLs with named graphs as described in this article requires some basic configuration. The ACL system is disabled by default. In order to enable it for the default application realm (another topic for another day) the following SPARQL statement needs to be executed as administrator:
sparql prefix oplacl: <http://www.openlinksw.com/ontology/acl#> with <urn:virtuoso:val:config> delete { oplacl:DefaultRealm oplacl:hasDisabledAclScope oplacl:Query , oplacl:PrivateGraphs . } insert { oplacl:DefaultRealm oplacl:hasEnabledAclScope oplacl:Query , oplacl:PrivateGraphs . };
This will enable ACLs for named graphs and SPARQL in general. Finally the LinkedIn account from the example requires generic SPARQL read permissions. The simplest approach is to just allow anyone to SPARQL read:
@prefix acl: <http://www.w3.org/ns/auth/acl#> . @prefix oplacl: <http://www.openlinksw.com/ontology/acl#> . <#rule> a acl:Authorization ; rdfs:label "Allow Anyone to SPARQL Read" ; acl:agentClass foaf:Agent ; acl:accessTo <urn:virtuoso:access:sparql> ; oplacl:hasAccessMode oplacl:Read ; oplacl:hasScope oplacl:Query .
I will explain these technical concepts in more detail in another article.
Posted at 21:10
Dropbox, Google Drive, OneDrive, Box.com – they all allow you to share files with others. But they all do it via the strange concept of public links. Anyone who has this link has access to the file. On first glance this might be easy enough but what if you want to revoke read access for just one of those people? What if you want to share a set of files with a whole group?
I will not answer these questions per se. I will show an alternative based on OpenLink Virtuoso.
Virtuoso has its own WebDAV file storage system built in. Thus, any instance of Virtuoso can store files and serve these files via the WebDAV API (and an LDP API for those interested) and an HTML UI. See below for a basic example:
This is just your typical file browser listing – nothing fancy. The fancy part lives under the hood in what we call VAL – the Virtuoso Authentication and Authorization Layer.
We can edit the permissions of one file or folder and share it with anyone we like. And this is where it gets interesting: instead of sharing with an email address or a user account on the Virtuoso instance we can share with people using their identifiers from any of the supported services. This includes Facebook, Twitter, LinkedIn, WordPress, Yahoo, Mozilla Persona, and the list goes on.
For this small demo I will share a file with my LinkedIn identity http://www.linkedin.com/in/trueg. (Virtuoso/VAL identifier people via URIs, thus, it has schemes for all supported services. For a complete list see the Service ID Examples in the ODS API documentation.)
Now when I logout and try to access the file in question I am presented with the authentication dialog from VAL:
This dialog allows me to authenticate using any of the supported authentication methods. In this case I will choose to authenticate via LinkedIn which will result in an OAuth handshake followed by the granted read access to the file:
It is that simple. Of course these identifiers can also be used in groups, allowing to share files and folders with a set of people instead of just one individual.
Next up: Sharing Named Graphs via VAL.
Posted at 21:10
Digitally signing Emails is always a good idea. People can verify that you actually sent the mail and they can encrypt emails in return. A while ago Kingsley showed how to sign emails in Thunderbird.I will now follow up with a short post on how to do the same in Evolution.
The process begins with actually getting an X.509 certificate including an embedded WebID. There are a few services out there that can help with this, most notably OpenLink’s own YouID and ODS. The former allows you to create a new certificate based on existing social service accounts. The latter requires you to create an ODS account and then create a new certificate via Profile edit -> Security -> Certificate Generator. In any case make sure to use the same email address for the certificate that you will be using for email sending.
The certificate will actually be created by the web browser, making sure that the private key is safe.
If you are a Google Chrome user you can skip the next step since Evolution shares its key storage with Chrome (and several other applications). If you are a user of Firefox you need to perform one extra step: go to the Firefox preferences, into the advanced section, click the “Certificates” button, choose the previously created certificate, and export it to a .p12 file.
Back in Evolution’s settings you can now import this file:
To actually sign
emails with your shiny new certificate stay in the Evolution
settings, choose to edit the Mail Account in question,
select the certificate in the Secure MIME (S/MIME) section
and check “Digitally sign outgoing messages (by
default)“:
The nice thing about
Evolution here is that in contrast to Thunderbird there is no need
to manually import the root certificate which was used to sign your
certificate (in our case the one from OpenLink). Evolution will
simply ask you to trust that certificate the first time you try to
send a signed email:
Posted at 21:10
Posted at 21:10
After almost two
years spent working at Asemantics, I left it to join the Fondazione Bruno Kessler (FBK),
a quite large research institute based in
Trento.
These last two years have been amazing: I met very skilled and enthusiastic people working with them on a broad set of different technologies. Every day spent there has been an opportunity for me to learn something new from them, and at the very end they are now very good friends more than colleagues. Now Asemantics is part of the bigger Pro-netics Group.
Moved from Rome, I decided to follow Giovanni Tummarello and Michele Mostarda to launch from scratch a new research unit at FBK called “Web of Data”. FBK is a well-established organization with several units acting on a plethora of different research fields. Every day there is the opportunity to join workshops and other kind of events.
Just to give you an idea of how the things work here, in the April 2009 David Orban gave a talk here on “The Open Internet of Things” attended by a large number of researchers and students. Aside FBK, in Trento there is a quite active community hanging out around the Semantic Web.
“The Semantic Valley”, that’s how they call this euphoric movement around these technologies.
Back to me, the new “Web of Data” unit has joined the Sindice.com army and the last minute release of Any23 0.2 is only the first outcome of this joint effort on the Semantic Web Index between DERI and FBK.
In particularly, the Any23 0.2 release has been my first task here. It’s library, a service, an RDF distiller. It’s used on board the Sindice ingestion pipeline, it’s publicly available here and yesterday I spent a couple of minutes to write this simple bookmarklet:
javascript:window.open(‘http://any23.org/best/’%20+%20window.location);
Once on your browser, it returns a bunch of distilled RDF triples using the Any23 servlet if pressed on a Web page.
So, what’s next?
The Web of Data unit has just started. More things, from the next release of Sindice.com to other projects currently in inception, will see the light. I really hope to keep on contributing on the concrete consolidation of the Semantic Web, the Web of Data or Web3.0 or whatever we’d like to call it.
Posted at 21:10
This is a (short) technical post.
Everyday, I face the problem of getting some Linked Data URIs that uniquely identify a “thing” starting from an ambiguous, poor and flat keyword or description. One of the first step dealing with the development of application that consumes Linked Data is to provide a mechanism that allows to link our own data sets to one (or more) LoD bubble. To gain a clear idea on why identifiers matters I suggest you to read this note from Dan Brickley: starting from some needs we encountered within the NoTube project he clearly underlined the importance of LoD identifiers. Even if the problem of uniquely identifying words and terms falls in the biggest category usually known as term disambiguation, I’d like to clarify here, that what I’m going to explain is a narrow restriction of the whole problem.
What I really need is a simple mechanism that allows me to convert one specific type of identifiers to a set of Linked Data URIs.
For example, I need something that given a book ISBN number it returns me a set of URIs that are referring to that book. Or, given the title of a movie I expect back some URIs (from DBpedia or LinkedMDB or whatever) identifying and describing it in a unique way.
Isn’t SPARQL enough for you to do that?
Yes, obviously the following SPARQL query may be sufficient:
but what I need is something quicker that I may invoke as an HTTP GET like:
http://localhost:8080/resolver?value=978-0-374-16527-7&category=isbn
returning back to me a simple JSON:
{ "mappings": [
"http://dbpedia.org/resource/Gomorrah_%28book%29"],
"status": "ok"
}
But the real issue here is the code overhead necessary if you want to add other kind of identifiers resolution. Let’s imagine, for instance, that I already implemented this kind of service and I want to add another resolution category. What I should do is to hard code another SPARQL query, modify the code allowing to invoke it as a service and redeploy it.
I’m sure we could do better.
If we give a closer look at the above SPARQL query, we easily figure out that the problem could be highly generalized. In fact, often resolving such kind of resolution means perform a SPARQL query asking for URIs that have a certain value for a certain property. As dbprop:isbn for the ISBN case.
And this is what I did the last two days: The NoTube Identity Resolver.
A simple Web service (described in the figure below) fully customizable by simply editing an XML configuration file.
The resolvers.xml file allows you to provide a simple description of the resolution policy that will be accessible with a simple HTTP GET call.
Back to the ISBN example, the following piece of XML is enough to describe the resolver:
<resolver id=”2″
type=”normal”>
<category>isbn</category>
<endpoint>http://dbpedia.org/sparql</endpoint>
<lookup>dbpedia-owl:isbn</lookup>
<sameas>true</sameas>
<matching>LITERAL</matching>
</resolver>
Where:
Moreover, the NoTube Identity Resolver gives you also the possibility to specify more complex resolution policies through a SPARQL query as shown below:
<resolver id="3"
type="custom">
<category>movie</category>
<endpoint>http://dbpedia.org/sparql</endpoint>
<sparql><![CDATA[SELECT DISTINCT ?subject
WHERE { ?subject a <http://dbpedia.org/ontology/Film>.
?subject <http://dbpedia.org/property/title> ?title.
FILTER (regex(?title, "#VALUE#")) }]]>
</sparql>
<sameas>true</sameas>
</resolver>
In other words, every resolver described in the resolvers.xml file allows you to enable one kind of resolution mechanism without writing a line af Java code.
Do you want to try?
Just download the war package, get this resolvers.xml (or write your own), export the RESOLVERS_XML_LOCATION environment variable pointing to the folder where the resolvers.xml is located, deploy the war on your Apache Tomcat application server, start the application and try it out heading your browser to:
http://localhost:8080/notube-identity-resolver/resolver?value=978-0-374-16527-7&category=isbn
That’s all folks
Posted at 21:10