Planet RDF

It's triples all the way down

April 03

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Posted at 05:07

April 02

Jeen Broekstra: Jürgen Klopp says he cried when NHS staff sang You’ll Never Walk Alone

Klopp, chatting with the club’s website, added: “There are numerous people out there that have much bigger problems so it might feel really embarrassing to myself if i used to be to speak about my ‘problems’ – I even have the issues every one within the world has within the moment. That’s the lesson we learn during this moment.

“Four or five weeks ago it’s sort of a lot of nations thought: ‘That’s our problem, that’s our problem, that’s our problem, we’ve a drag with them’ and stuff like this. Now nature shows us we are all an equivalent and that we have all an equivalent problems within the same moment, and that we need to work together on the answer . there’s nothing good therein situation aside from maybe what we will learn from that.”

Liverpool hosted the last high-profile match in England before football was suspended when Atlético Madrid, and approximately 3,000 of their supporters, visited Anfield within the Champions League on 11 March. Klopp recalled: “We played the Bournemouth game on Saturday, we won it, then Sunday City lost, therefore the information for us was ‘two wins to go’.on the other hand on Monday morning I awakened and heard about things in Madrid, that they might close the faculties and universities from Wednesday, so it had been really strange to organize for that game to be honest. agen sbobet online? maxbetsbobet.

“I usually don’t struggle with things around me, I can build barriers right and left once I steel oneself against a game, but therein moment it had been really difficult. Wednesday we had the sport , I loved the sport , I loved what I saw from the boys, it had been a very , specialized performance aside from the result – we didn’t score enough, we conceded too many, that’s all clear, but between these two main pieces of data it had been an excellent game!

“Thursday [we were] off then Friday once we arrived it had been already clear this is often not a session. Yes, we trained, but it had been more of a gathering . We had tons of things to speak about, tons of things to believe , things I never thought before in my life about.

“Nobody knew exactly – and no-one knows exactly – how it’ll continue , therefore the only way we could roll in the hay was to organise it nearly as good as possible for the boys and confirm everything is sorted the maximum amount as we will sort it in our little space, within the little area where we are responsible, really. That’s what we did during a very short time, then we sent the boys home, went home ourselves and here we are still.”

The post Jürgen Klopp says he cried when NHS staff sang You’ll Never Walk Alone appeared first on Rivull Development.

Posted at 04:05

Jeen Broekstra: ‘It’s horrible’: Halesowen halted with promotion and Wembley in sight

Simeon Cobourne scores Halesowen’s winner in the FA Trophy at Barnet.
“It was a touch surreal in certain stages,” said the captain Paul McCone. “As people from a lower division you usually say: ‘This is your final .’ Every game should are our final , but we were watching getting to subsequent stage, subsequent stage and therefore the next stage.”

The FA stated that it “remains hopeful” of completing the ultimate rounds of the FA Trophy but Halesowen are the sole team in either step three or four to succeed in this stage, which suggests that they stand alone on an island with no remaining league games and no-one else to play.

“If the games do get played, we’ve gotta attempt to find some opposition to urge a few of friendlies in before that,” said Smith. “Well, all step three to 6 sides have all been stepped down now. So, I don’t skills we might manage to try to to that.”

Halesowen’s success also reflects the transient nature of non-league football. Smith arrived to a revolving door of a football club that he says utilised nearly 100 players last season. He rebuilt the side, tying the squad players to contracts so as to foster a tight-knit team and enable fans to familiarise themselves with players. Smith convinced most of his players to step down from higher divisions, assuring them that they might go straight copy . within the end, they performed even better than they imagined but the ultimate result was out of their hands. Now Smith must convince them to stay in step four for another season. judi bola online judibolaterbaik.co

“I’m hoping that everyone wants to remain at the club due to that factor,” says McCone. “In theory, we’ve got loose end . Unforeseen circumstances that have caused that loose end , but it’s definitely something to urge ticked off the list. The plan was to travel there, get promoted and obtain Halesowen up the league. no matter what has gone on, that plan hasn’t changed.”

Smith seems to be at the proper place for an additional attempt. The club was near bankruptcy when it had been appropriated by Keith McKenna and Karen Brooks at the top of 2018 and that they have turned it around. The finances are healthy enough to survive these unprecedented stoppages and it seems that that the solidarity is, too.

“There’s no point of watching what ifs and the way the season could have progressed,” said Smith. “It’s so, so important that we take great pride in how far we’ve come … It’s been a magical, magical journey. the maximum amount because it could be wiped off the record books, nobody can take those memories faraway from these fans.”

The post ‘It’s horrible’: Halesowen halted with promotion and Wembley in sight appeared first on Rivull Development.

Posted at 04:05

March 17

schema.org: Schema for Coronavirus special announcements, Covid-19 Testing Facilities and more

The COVID-19 pandemic is causing a large number of “Special Announcements” pertaining to changes in schedules and other aspects of everyday life. This includes not just closure of facilities and rescheduling of events but also new availability of medical facilities such as testing centers.

We have today published Schema.org 7.0, which includes fast-tracked new vocabulary to assist the global response to the Coronavirus outbreak.

It includes a "SpecialAnnouncement" type that provides for simple date-stamped textual updates, as well as markup to associate the announcement with a situation (such as the Coronavirus pandemic), and to indicate URLs for various kinds of update such a school closures, public transport closures, quarantine guidelines, travel bans, and information about getting tested.  

Many new testing facilities are being rapidly established worldwide, to test for COVID-19. Schema.org now has a CovidTestingFacility type to represent these, regardless of whether they are part of long-established medical facilities or temporary adaptations to the emergency.

We are also making improvements to other areas of Schema.org to help with the worldwide migration to working online and working from home, for example by helping event organizers indicate when an event has moved from having a physical location to being conducted online, and
whether the event's "eventAttendanceMode" is online, offlline or mixed. 

We will continue to improve this vocabulary in the light of feedback (github; doc), and welcome suggestions for improvements and additions particularly from organizations who are publishing such updates. 

Dan Brickley, R.V.Guha, Google.
Tom Marsh, Microsoft.

Posted at 03:16

March 15

Leigh Dodds: Quick tips for chairing remote meetings

There’s a growing set of useful resources and guidance to help people run better remote meetings. I’ve been compiling a list to a few. At the risk of repeating other, better advice, I’m going to write down some brief tips for running remote meetings.

For a year or so I was chairing fortnightly meetings of the OpenActive standards group. Those meetings were an opportunity to share updates with a community of collaborators, get feedback on working documents and have debates and discussion around a range of topics. So I had to get better at doing it. Not sure whether I did, but here’s a few things I learned.

I’ll skip over general good meeting etiquette (e.g. around circulating an agenda and working documents in advance), to focus on the remote bits.

  1. Give people time to arrive. Just because everyone is attending remotely doesn’t mean that everyone will be able to arrive promptly. They may be working through technical difficulties, for example. Build in a bit of deliberate slack time at the start of the meeting. I usually gave it around 5-10 minutes. As people arrive, greet them and let them know this is happening. You can then either chat as a group or people can switch to emails, etc while waiting for things to start.
  2. Call the meeting to order. Make it clear when the meeting is formally starting and you’ve switched from general chat and waiting for late arrivals. This will help ensure you have people’s attention.
  3. Use the tools you have as a chair. Monitor side chat. Monitor the video feeds to check to see if people look like they have something to say. And, most importantly, mute people that aren’t speaking but are typing or have lots of background noise. You can usually avoid the polite dance around asking people to do that, or suffering in silence, by using option to mute people. Just tell them you’ve done that. I usually had Zoom meetings set up so that people were muted on entry.
  4. Do a roll call. Ask everyone to introduce themselves at the start. Don’t just ask everyone to do that, as they’ll talk over each other. Go through people individually as ask them to say hello or do an introduction. This helps with putting voices to names (if not everyone is on video), ensures that everyone knows how to mute/unmute and puts some structure to the meeting.
  5. Be aware of when people are connecting in different ways. Some software, like Zoom, allow people to join in several ways. Be aware of when you have people on phone and video, especially if you’re presenting material. Try to circulate links either before or during meeting so they can see them
  6. Use slides to help structure the meeting. I found that doing a screenshare of a set of slides for the agenda and key talking points helps to give people a sense of where you’re at in the meeting. So, for example if you have four items on your agenda, have a slide for each topic item. With key questions or decision points. It can help to focus discussion, keeps people’s attention on the meeting (rather than a separate doc) and gives people a sense of where you are. The latter is especially helpful if people are joining late.
  7. Don’t be afraid of a quick recap. If people join a few minutes late in the meeting, give them a quick recap of where you’re at, ask them to introduce themselves. I often did this if people joined a few minutes late, but not if they dropped in 30 minutes into a 1 hour meeting.
  8. Don’t be afraid of silence or directly asking people questions. Chairing remote meetings can be stressful and awkward for everyone. It can be particularly awkward to ask questions and then sit in silence. Often this is because people are worried about talking over each other. Or they just need time to think. Don’t be afraid of a bit of silence. Doing a roll call to ask everyone individually for feedback can be helpful if you want to make decisions. Check in on people who have not said anything for a while. It’s slow, but provides some order for everyone
  9. Keep to time. I tried very hard not to let meetings over-run even if we didn’t cover everything. People have other events in their calendars. Video and phone calls can be tiring. It’s better to wrap up at a suitable point and follow up on things you didn’t get to cover than to have half the meeting drop out at the end.
  10. Follow-up afterwards. Make sure to follow up afterwards. Especially if not everyone was able to attend. For OpenActive we video the calls and share those as well as a summary of discussion points.

Those are all the things I tried to consciously get better at and I think helped things go more smoothly.

Posted at 15:05

Leigh Dodds: What is collaborative maintenance of data? A short talk at the Royal Society

Following the publication of their report on data governance in the 21st century, the Royal Society are running a number of workshops to explore data governance in different sectors. In October 2019 year they ran one exploring data governance in the auto insurance sector.

Last week they held a workshop looking at data governance in the civil society sector. The ODI were invited to help out, and I chaired a session looking at collaborative maintenance of data. I believe the Royal Society will be publishing a longer write-up of the workshop over the coming weeks.

This blog post is a written version of a short ten minute talk I gave during the workshop. The slides are public.

Let’s start with a definition. What is collaborative maintenance?

You might already be familiar with terms like “crowd-sourcing” or “citizen science”. Both of those are examples of collaborative maintenance. But it can take other forms too. At the ODI we use collaborative maintenance of data to refer to any scenario where organisations and communities are sharing the work of collecting and maintaining data.

It might be helpful to position collaborative maintenance alongside other approaches that are part of “open culture”. These include open standards, open source, and open data. Let’s look at each of them in turn.

Open standards for data are reusable, shared agreements that shape how we collect, share, govern and use data. There are different types of open standards. Some are technical, and describe file formats and methods of exchanging data. Others are higher-level and capture codes of practices and protocols for collecting data. Open standards are best developed collaboratively, so that everyone impacted by or benefiting from the standard can help shape it.

Open source involves collaborating to create reusable, openly licensed code and applications. Some open source projects are run by individuals or small communities. Others are backed by larger commercial organisations. This collaborative work is different to that of open standards. For example, it involves identifying and agreeing features, writing and testing code and producing documentation to allow others to use it.

Open data is about publishing data under an open licence, so it can be accessed, used and shared by anyone for any purpose. Different communities engage in publication of open data for different purposes.

For example, the open government movement originally focused on open data as a means to increase transparency of governments. More recently there is a shift towards using open data to help address a variety of social, economic and environmental challenges. In contrast, as part of the open science movement, there is a different role for open data. Recent attention has been on the use of open data to address the reproducibility crisis around research. Or to help respond to emerging health issues, like Coronavirus.

With a few exceptions, the main approach to open data has been a single organisation (or researcher) publishing data that they have already collected. There may be some collaboration around use of that data, but not in its collection or maintenance.

This makes open data quite distinct from open source or open sources.

We can think of collaborative maintenance as about taking the approach used in open source and applying it to data. Collaborative maintenance involves collaboration across the full lifecycle of a dataset.

Some examples might be helpful.

OpenStreetMap is a collaboratively produced spatial database of the entire world. While it was originally produced by individuals and communities, it is now contributed to by large organisations like Facebook, Microsoft and Apple. The Humanitarian OpenStreetMap community focuses on the collection and use of data to support humanitarian activities. The community are involved in deciding what data to collect, prioritising maintenance of data following disasters, and mapping activities either on the ground or remotely. The community works across the lifecycle and is self-directing.

Common Voice is a Mozilla project. It aims to build an open dataset to support voice recognition applications. By asking others to contribute to the dataset, they hope to make it more comprehensive and inclusive. Mozilla have defined what data will be collected and the tasks to be carried out, but anyone can contribute to the dataset by adding their voice or transcribing a recording. It’s this open participation that could help ensure that the dataset represents a more diverse set of people.

Edubase is maintained by the Department for Education (DfE). It’s our national database of schools. It’s used in a variety of different applications. Like Mozilla, DfE are acting as the steward of the data and have defined what information should be collected. But the work of populating and maintaining the shared directory is carried out by people in the individual schools. This is the best way to keep that data up to date. Those who are know when the data has changed have the ability to update it. The contributors all benefit from shared resource.

Build a shared directory is a common use for collaborative maintenance. But there are others.

Looking across these projects and other examples that we’ve studied in our desk and user research, we can see that there are different ways we can collaborate around data.

For example, we can work together to decide what data to collect. We can share the work of collecting and maintaining data, ensuring its quality and governing access to it. We can use open source to help to build the tools to support those communities.

We’ve developed the collaborative maintenance guidebook to help support the design of new services and platforms. It includes some background and a worked example. The bulk of the guidebook is a set of “design patterns” that describe solutions to common problems. For example how to manage quality when many different people are contributing to the same dataset.

We think collaborative maintenance can be useful in more projects. For civil society organisations collaborative maintenance might help you engage with communities that you’re supporting to collect and maintain useful data. It might also be a tool to support collaboration across the sector as a means of building common resources.

The guidebook is at an early stage and we’d love to get feedback on it contents. Or help you apply it to a real-world project. Let us know what you think!

 

Posted at 14:05

March 02

Leigh Dodds: How can publishing more data increase the value of existing data?

There’s lots to love about the “Value of Data” report. Like the fantastic infographic on page 9. I’ll wait while you go and check it out.

Great, isn’t it?

My favourite part about the paper is that it’s taught me a few terms that economists use, but which I hadn’t heard before. Like “Incomplete contracts” which is the uncertainty about how people will behave because of ambiguity in norms, regulations, licensing or other rules. Finally, a name to put to my repeated gripes about licensing!

But it’s the term “option value” that I’ve been mulling over for the last few days. Option value is a measure of our willingness to pay for something even though we’re not currently using it. Data has a large option value, because its hard to predict how its value might change in future.

Organisations continue to keep data because of its potential future uses. I’ve written before about data as stored potential.

The report notes that the value of a dataset can change because we might be able to apply new technologies to it. Or think of new questions to ask of it. Or, and this is the interesting part, because we acquire new data that might impact its value.

So, how does increasing access to one dataset affect the value of other datasets?

Moving data along the data spectrum means that increasingly more people will have access to it. That means it can be used by more people, potentially in very different ways than you might expect. Applying Joy’s Law then we might expect some interesting, innovative or just unanticipated uses. (See also: everyone loves a laser.)

But more people using the same data is just extracting additional value from that single dataset. It’s not directly impacting the value of other dataset.

To do that we need to use that in some specific ways. So far I’ve come up with seven ways that new data can change the value of existing data.

  1. Comparison. If we have two or more datasets then we can compare them. That will allow us to identify differences, look for similarities, or find correlations. New data can help us discover insights that aren’t otherwise apparent.
  2. Enrichment. New data can enrich an existing data by adding new information. It gives us context that we didn’t have access to before, unlocking further uses
  3. Validation. New data can help us identify and correct errors in existing data.
  4. Linking. A new dataset might help us to merge some existing dataset, allowing us to analyse them in new ways. The new dataset acts like a missing piece in a jigsaw puzzle.
  5. Scaffolding. A new dataset can help us to organise other data. It might also help us collect new data.
  6. Improve Coverage. Adding more data, of the same type, into an existing pool can help us create a larger, aggregated dataset. We end up with a more complete dataset, which opens up more uses. The combined dataset might have a a better spatial or temporal coverage, be less biased or capture more of the world we want to analyse
  7. Increase Confidence. If the new data measures something we’ve already recorded, then the repeated measurements can help us to be more confident about the quality of our existing data and analyses. For example, we might pool sensor readings about the weather from multiple weather stations in the same area. Or perform a meta-analysis of a scientific study.

I don’t think this is exhaustive, but it was a useful thought experiment.

A while ago, I outlined ten dataset archetypes. It’s interesting to see how these align with the above uses:

  • A meta-analysis to increase confidence will draw on multiple studies
  • Combining sensor feeds can also help us increase confidence in our observations of the world
  • A register can help us with linking or scaffolding datasets. They can also be used to support validation.
  • Pooling together multiple descriptions or personal records can help us create a database that has improved coverage for a specific application
  • A social graph is often used as scaffolding for other datasets

What would you add to my list of ways in which new data improves the value of existing data? What did I miss?

Posted at 21:05

February 21

Leigh Dodds: Three types of agreement that shape your use of data

Whenever you’re accessing, using or sharing data you will be bound by a variety of laws and agreements. I’ve written previously about how data governance is a nested set of rules, processes, legislation and norms.

In this post I wanted to clarify the differences between three types of agreements that will govern your use of data. There are others. But from a data consumer point of view these are most common.

If you’re involved in any kind of data project, then you should have read all of relevant agreements that relate to data you’re planning to use. So you should know what to look for.

Data Sharing Agreements

Data sharing agreements are usually contracts that will have been signed between the organisations sharing data. They describe how, when, where and for how long data will be shared.

They will include things like the purpose and legal basis for sharing data. They will describe the important security, privacy and other considerations that govern how data will be shared, managed and used. Data sharing agreements might be time-limited. Or they might describe an ongoing arrangement.

When the public and private sector are sharing data, then publishing a register of agreements is one way to increase transparency around how data is being shared.

The ICO Data Sharing Code of Practice has more detail on the kinds of information a data sharing agreement should contain. As does the UK’s Digital Economy Act 2017 code of practice for data sharing. In a recent project the ODI and CABI created a checklist for data sharing agreements.

Data sharing agreements are most useful when organisations, of any kind, are sharing sensitive data. A contract with detailed, binding rules helps everyone be clear on their obligations.

Licences

Licences are a different approach to defining the rules that apply to use of data. A licence describes the ways that data can be used without any of the organisations involved having to enter into a formal agreement.

A licence will describe how you can use some data. It may also place some restrictions on your use (e.g. “non-commercial”) and may spell out some obligations (“please say where you got the data”). So long as you use the data in the described ways, then you don’t need any kind of explicit permission from the publisher. You don’t even have to tell them you’re using it. Although it’s usually a good idea to do that.

Licences remove the need to negotiate and sign agreements. Permission is granted in advance, with a few caveats.

Standard licences make it easier to use data from multiple sources, because everyone is expecting you to follow the same rules. But only if the licences are widely adopted. Where licences don’t align, we end up with unnecessary friction.

Licences aren’t time-limited. They’re perpetual. At least as long as you follow your obligations.

Licences are best used for open and public data. Sometimes people use data sharing agreements when a licence might be a better option. That’s often because organisations know how to do contracts, but are less confident in giving permissions. Especially if they’re concerned about risks.

Sometimes, even if there’s an open licence to use data, a business would still prefer to have an agreement in place. That’s might be because the licence doesn’t give them the freedoms they want, or they’d like some additional assurances in place around their use of data.

Terms and Conditions

Terms and conditions, or “terms of use” are a set of rules that describe how you can use a service. Terms and conditions are the things we all ignore when signing up to website. But if you’re using a data portal, platform or API then you need to have definitely checked the small print. (You have, haven’t you?)

Like a Data Sharing Agreement, a set of terms and conditions is something that you formally agree to. It might be by checking a box rather than signing a document, but its still an agreement.

Terms of use will describe the service being offered and the ways in which you can use it. Like licences and data sharing agreements, they will also include some restrictions. For example whether you can build a commercial service with it. Or what you can do with the results.

A good set of terms and conditions will clearly and separately identify those rules that relate to your use of the service (e.g. how often you can use it) from those rules that relate to the data provided to you. Ideally the terms would just refer to a separate licence. The Met Office Data Point terms do this.

A poorly defined set of terms will focus on the service parts but not include enough detail about your rights to use and reuse data. That can happen if the emphasis has been on the terms of use of the service as a product, rather than around the sharing of data.

The terms and conditions for a data service and the rules that relate to the data are two of the important decisions that shape the data ecosystem that service will enable. It’s important to get them right.

Hopefully that’s a helpful primer. Remember, if you’re in any kind of role using data then you need to read the small print. If not, then you’re potentially exposing yourself and others to risks.

Posted at 20:05

February 14

Libby Miller: Links for my Pervasive Media Studio talk

I’m giving a talk this afternoon at the Pervasive Media Studio in Bristol about some low-resolution people experiments I’ve been making. Here are some related links:

Exploring the Affect of Abstract Motion in Social Human-Robot Interaction, John Harris and Ehud Sharlin

Libbybot 11 code and instructions

Zamia and scripts for running it on the Raspberry Pi 3

Real_libby code

GPT-2

Matt Jones – Butlers or Centaurs

Tim Cowlishaw‘s colab notebook for GPT-2 retraining

Janelle Shane‘s site and book

This is the voice synthesis code that my colleagues used. They are hoping to open up their version of it soon.

There’s a voice assistant survey that uses the synthesised voices and there’s a blog post about that.

Pimoroni sell Respeaker 4-mics.

I’m libbymiller on twitter.

 

Posted at 16:06

February 09

Leigh Dodds: GUIDE, a retrospective

“Tyntesfield servants’ bells” by Caroline. CC-BY-NC-ND licence. https://www.flickr.com/photos/carolineld/4608720906/

This article was first published in the February 2030 edition of Sustain magazine. Ten years since the public launch of GUIDE we sit down with its designers to chat about its origin and what’s made it successful.

It’s a Saturday morning and I’m sitting in the bustling cafe at Tyntesfield house, a National Trust property south of Bristol. I’m enjoying a large pot of tea and a slice of cake with Joe Shilling and Gordon Leith designers of one of the world’s most popular social applications: GUIDE. I’d expected to meet somewhere in the city, but Shilling suggested this as a suitable venue. It turns out Tyntesfield plays a part in the origin story of GUIDE. So its fitting that we are here for the tenth anniversary of its public launch.

SHILLING: “Originally we were just playing. Exploring the design parameters of social applications.”

He stirs the pot of tea while Leith begins sectioning the sponge cake they’ve ordered.

SHILLING: “People did that more in the early days of the web. But Twitter, Facebook, Instagram…they just kind of sucked up all the attention and users. It killed off all that creativity. For a while it seemed like they just owned the space…But then TikTok happened…”

He pauses while I nod to indicate I’ve heard of it.

SHILLING: “…and small experiments like Yap. It was a slow burn, but I think a bunch of us started to get interested again in designing different kinds of social apps. We were part of this indie scene building and releasing bespoke social networks. They came and went really quickly. People just enjoyed them whilst they were around.”

Leith interjects around a mouthful of cake:

LEITH: “Some really random stuff. Social nets with built in profile decay so they were guaranteed to end. Made them low commitment, disposable. Messaging services where you could only post at really specific, sometimes random times. Networks that only came online when its members were in precise geographic coordinates. Spatial partitioning to force separation of networks for home, work and play. Experimental, ritualised interfactions.”

SHILLING: “The migratory networks grew out of that movement too. They didn’t last long, but they were intense. ”

LEITH: “Yeah. Social networks that just kicked into life around a critical mass of people. Like in a club. Want to stay a member…share the memes? Then you needed to be in its radius. In the right city, at the right time. And then keep up as the algorithm shifted it. Social spaces herding their members.”

SHILLING: “They were intense and incredibly problematic. Which is why they didn’t last long. But for a while there was a crowd that loved them. Until the club promoters got involved and then that commercial aspect killed it.”

RENT-SEEKING

GUIDE had a very different starting point. Flat sharing in Bristol, the duo needed money. Their indie credibility was high, but what they were looking for a more mainstream hit with some likelihood of revenue. The break-up of Facebook and the other big services had created an opportunity which many were hoping to capitalise on. But investment was a problem.

LEITH: “We wrote a lot of grant proposals. Goal was to use the money to build out decent code base. Pay for some servers that we could use to launch something bigger”.

Shilling pours the tea, while Leith passes me a slice of cake.

SHILLING: “It was a bit more principled that that. There was plenty of money for apps to help with social isolation. We thought maybe we could build something useful, tackle some social problems, work with a different demographic than we had before. But, yeah, we had our own goals too. We had to take what opportunities were out there.”

LEITH: “My mum had been attending this Memory Skills group. Passing around old photos and memorabilia to get people talking and reminiscing. We thought we could create something digital.”

SHILLING: “We managed to land a grant to explore the idea. We figured that there was a demographic that had spent time connecting not around the high street or the local football club. But with stuff they’d all been doing online. Streaming the same shows. Revisiting old game worlds. We thought those could be really useful touch points and memory triggers too. And not everyone can access some of the other services.”

LEITH: “Mum could talk for hours about Skyrim and Fallout”.

SHILLING: “So we prototyped some social spaces based around that kind of content. It was during the user testing that we had the real eye-opener”.

“Memory Box” by judy_and_ed. CC-BY-NC. https://www.flickr.com/photos/65924740@N00/18516079841/

ITERATIONS

The first iterations of the app that ultimately became GUIDE were pretty rough. Shilling and Leith have been pretty open about their early failures.

LEITH: “The first iteration was basically a Twitch knock-off. People could join the group remotely, chat to each other and watch whatever the facilitator decided to stream.”

SHILLING: “Engagement was low. We didn’t have cash to license a decent range of content. The facilitators needed too much training on the streaming interface and real-time community management.”

LEITH: “I then tried getting a generic game engine to boot up old game worlds, so we could run tours. But the tech was a nightmare to get working. Basically needed different engines for different games”

SHILLING: “Some of the users loved it, mainly those that had the right hardware and were already into gaming. But it didn’t work for most people. And again…I…we were worried about licensing issues”

LEITH: “So we started testing a customised, open source version of Yap. Hosted chat rooms, time-limited rooms and content embedding…that ticked a lot of boxes. I built a custom index over the Internet Archive, so we could use their content as embeds”.

SHILLING: “There’s so much great stuff that people love in the Internet Archive. At the time, not many services were using it. Just a few social media accounts. So we made using it a core feature. It neatly avoided the licensing issues. We let the alpha testers run with the service for a while. We gave them and the memory service facilitators tips on hosting their own chats. And basically left them to it for a few weeks. It was during the later user testing that we discovered they were using it in different ways that we’d expected.”

Instead of having conversations with their peer groups, the most engaged users were using it to chat with their families. Grandparents showing their grandchildren stuff they’d watched, listened to, or read when they were younger.

SHILLING: “They were using it to tell stories”

Surrounded by the bustle in the cafe, we pause to enjoy the tea and cake. Then Shilling gestures around the room.

SHILLING: “We came here one weekend. To get out of the city. Take some time to think. They have these volunteers here. One in every room of the house. People just giving up their free time to answer any questions you might have as you wander around. Maybe, point out interesting things you might not have noticed? Or, if you’re interested, tell you about some of things they love about the place. It was fascinating. I realised that’s how our alpha testers were using the prototype…just sharing their passions with their family.”

LEITH: “So this is where GUIDE was born. We hashed out the core features for the next iteration in a walk through the grounds. Fantastic cake, too.”

“Walkman and mix tapes” by henry… CC-BY-NC-ND. https://www.flickr.com/photos/henrybloomfield/5136897807/

MEMORY PALACE

The familiar, core features of GUIDE have stayed roughly the same since that day.

Anyone can become a Guide and create a Room which they can use to curate and showcase small collections of public domain or openly licensed content. But no more than seven videos, photos, games or whatever else you can embed from the Internet Archive. Room contents can be refreshed once a week.

Visitors are limited to a maximum of five people. Everyone else gets to wait in a lobby, with new visitors being admitted every twenty minutes. Audio feeds only from the Guides, allowing them to chat to Visitors. But Visitors can only interact with Guides via a chat interface that requires building up messages — mostly questions — from a restricted set of words and phrases that can be tweaked by Guides for their specific Room. Each visitor limited to one question every five minutes.

LEITH: “The asymmetric interface, lobby system and cool-down timers were lifted straight from games. I looked up the average number of grandchildren people had. Turns out its about fives, so we used that to size Rooms. The seven item limit was because I thought it was a lucky number. We leaned heavily on the Internet Archive’s bandwidth early on for the embeds, but we now mirror a lot of stuff. And donate, obviously.”

SHILLING: “The restricted chat interface has helped limit spamming and moderation. No video feeds from Guides means that the focus stays on the contents of the Room, not the host. Twitch had some problematic stuff which we wanted to avoid. I think its more inclusive.”

LEITH: “Audio only meant the ASMR crowd were still happy though”.

Today there are tens of thousands of Rooms. Shilling shows me a Room where the Guide gives tours of historical maps of Bath, mixing in old photos for context. Another, “Eleanor’s Knitting Room” curates knitting patterns. The Guide alternating between knitting tips and cultural critiques.

Leith has a bookmarked collection of retro-gaming Rooms. Doom WAD teardowns and classic speed-runs analysis for the most part.

In my own collection, my favourite is a Room showing a rota of Japanese manhole cover designs, the Guide an expert on Japanese art and infrastructure. I often have this one a second screen whilst writing. The lobby wait time is regularly over an hour. Shilling asks me to share that one with him.

LEITH: “There are no discovery tools in Guide. That was deliberate from the start. Strictly no search engine. Want to find a Room? You’ll need to be invited by a Guide or grab a link from a friend”.

SHILLING: “Our approach has been to allow the service to grow within the bounds of existing communities. We originally marketed the site to family groups, and an older demographic. The UK and US were late adopters, the service was much more popular elsewhere for a long time. Things really took off when the fandoms grabbed hold of it.”

An ecosystem of recommendation systems, reviews and community Room databases has grown up around the service. I asked whether that defeated the purpose of not building those into the core app?

LEITH: “It’s about power. If we ran those features then it would be our algorithms. Our choice. We didn’t want that.”

SHILLING: “We wanted the community to decide how to best use GUIDE as social glue. There’s so many more creative ways in which people interact with and use the platform now”.

The two decline to get into discussion of the commercial success of GUIDE. It’s well-documented that the two have become moderately wealthy from the service. More than enough to cover that rent in the city centre. Shilling only touches on it briefly:

SHILLING: “No ads and a subscription-based service has kept us honest. The goal was to pay the bills while running a service we love. We’ve shared a lot of that revenue back with the community in various ways”.

Photo by Jacques Bopp on Unsplash. https://unsplash.com/photos/pvtA7r3jBTc

SLOW WEB

GUIDE can be situated within the Slow Web movement. There are a host of services offering quieter online experiences. Videos of walks through foreign cities. Live feeds from orbiting satellites and VR outposts mounted on marine buoys and in wild locations around the world. Social features as bolt-on features. But GUIDE’s focus on the curation of small spaces, story telling and shared discovery sets it apart.

Of course, all of this was possible before. YouTube and Twitch supported broadcasts and streaming for years, and many people used them in similar ways. But the purposeful design of a more dedicated interface highlights how constraints can shape a community and spark creativity. Removal of many of the asymmetries inherent in the design of those older platforms has undoubtedly helped.

While we finished the last of the tea, I asked them what they thought made the service successful.

SHILLING: “You can find, watch and listen to any of the material that people are sharing in GUIDE on the open web. Just Google it. But I don’t think people just want more content. They want context. And its people that bring that context to life. You can find Rooms now where there’s a relay of Guides running 24×7. Each Guide highlighting different aspects of the exact same collection. Costume design, narrative arcs and character bios. Historical and cultural significance. Personal stories. There’s endless context to discover around the same content. That’s what fandoms have understood for years.”

LEITH: “People just like stories. We gave them a place to tell them. And an opportunity to listen.”

Posted at 16:05

February 04

Leigh Dodds: Can the regulation of hazardous substances help us think about regulation of AI?

This post is a thought experiment. It considers how existing laws that cover the registration and testing of hazardous substances like pesticides might be used as an analogy for thinking through approaches to regulation of AI/ML.

As a thought experiment its not a detailed or well-research proposal, but there are elements which I think are interesting. I’m interested in feedback and also pointers to more detailed explorations of similar ideas.

A cursory look of substance registration legislation in the EU and US

Under EU REACH legislation, if you want to manufacture or import large amount of potentially hazardous chemical substances then you need to register with the ECHA. The registration process involves providing information about the substance and its potential risks.

“No data no market” is a key principle of the legislation. The private sector carries the burden of collecting data and demonstrating safety of substances. There is a standard set of information that must be provided.

In order to demonstrate the safety, companies may need to carry out animal testing. The legislation has been designed to minimise unnecessary animal  testing. While there is an argument that all testing is unnecessary, current practices requires testing in some circumstances. Where testing is not required, then other data sources can be used. But controlled animal tests are the proof of last resort if no other data is available.

To further minimise the need to carry out tests on animals, the legislation is designed to encourage companies registering the same (or similar) substances to share data with one another in a “fair, transparent and non-discriminatory way”. Companies There is detailed guidance around data sharing, including a legal framework and guidance on cost sharing.

The coordination around sharing data and costs is achieved via a SIEF (PDF), a loose consortia of businesses looking to register the same substance. There is guidance to help facilitate creation of these sharing forums.

The US has a similar set of laws which also aim to encourage sharing of data across companies to minimise animal testing and other regulatory burdens. The practice of “data compensation” provides businesses with a right to charge fees for use of data. The legislation doesn’t define acceptable fees, but does specify an arbitration procedure.

The compensation, along with some exclusive use arrangements, are intended to avoid discouraging original research, testing and registration of new substances. Companies that bear the costs of developing new substances can have exclusive use for a period and expect some compensation for research costs to bring to market. Later manufacturers can benefit from the safety testing results, but have to pay for the privilege of access.

Summarising some design principles

Based on my reading, I think both sets of legislation are ultimately designed to:

  • increase safety of the general public, by ensuring that substances are properly tested and documented
  • require companies to assess the risks of substances
  • take an ethical stance on reducing unnecessary animal testing and other data collection by facilitating
    data collection
  • require companies to register their intention to manufacture or import substances
  • enable companies to coordinate in order to share costs and other burdens of registration
  • provide an arbitration route if data is not being shared
  • avoid discouraging new research and development by providing a cost sharing model to offset regulatory requirements

Parallels to AI regulation

What if we adopted a similar approach towards the regulation of AI/ML?

When we think about some of the issues with large scale, public deployment of AI/ML, I think the debate often highlights a variety of needs, including:

  • greater oversight about how systems are being designed and tested, to help understand risks and design problems
  • understanding how and where systems are being deployed, to help assess impacts
  • minimising harms to either the general public, or specific communities
  • thorough testing of new approaches to assess immediate and potential long-term impacts
  • reducing unnecessary data collection that is otherwise required to train and test models
  • exploration of potential impacts of new technologies to address social, economic and environmental problems
  • to continue to encourage primary research and innovation

That list is not exhaustive. I suspect not everyone will necessarily agree on the importance of all elements.

However, if we look at these concerns and the principles that underpin the legislation of hazardous substances, I think there are a lot of parallels.

Applying the approach to AI

What if, for certain well-defined applications of AI/ML such as facial recognition, autonomous vehicles, etc, we required companies to:

  • register their systems, accompanies by a standard set of technical, testing and other documentation
  • carry out tests of their system using agreed protocols, to encourage consistency in comparison across testing
  • share data, e.g via a data trust or similar model, in order to minimise the unnecessary collection of data and to facilitate some assessment of bias in training data
  • demonstrate and document the safety of their systems to agreed standards, allowing public and private sector users of systems and models to make informed decisions about risks, or to support enforcement of legal standards
  • coordinate to share costs of collecting and maintaining data, conducting tests of standard models, etc
  • and, perhaps, after a period, accept that trained models would become available for others to reuse, similarly to how medicines or other substances may ultimately be manufactured by other companies

In addition to providing more controls and assurance around how AI/ML is being deployed, an approach based on facilitating collaboration around collection of data might help nudge new and emerging sectors into a more open direction, right from the start.

There are a number of potential risks and issues which I will acknowledge up front:

  • sharing of data about hazardous substance testing doesn’t have to address data protection. But this could be factored in to the design, and some uses of AI/ML draw on non-personal data
  • we may want to simply ban, or discourage use of some applications of AI/ML, rather than enable it. But at the moment there are few, if any controls
  • the approach might encourage collection and sharing of data which we might otherwise want to restrict. But strong governance and access controls, via a data trust or other institution might actually raise the bar around governance and security, beyond that which individual businesses can, or are willing to achieve. Coordination with a regulator might also help decide on how much is “enough” data
  • the utility of data and openly available models might degrade over time, requiring ongoing investment
  • the approach seems most applicable to uses of AI/ML with similar data requirements, In practice there may be only a small number of these, or data requirements may vary enough to limit benefits of data sharing

Again, not an exhaustive list. But as I’ve noted, I think there are ways to mitigate some of these risks.

Let me know what you think, what I’ve missed, or what I should be reading. I’m not in a position to move this forward, but welcome a discussion. Leave your thoughts in the comments below, or ping me on twitter.

Posted at 21:05

February 02

Leigh Dodds: When can expect more from data portability?

We’re at the end of week 5 of 2020, of the new decade and I’m on a diet.

I’m back to using MyFitnessPal again. I’ve used it on and off for the last 10 years whenever I’ve decided that now is the time to be more healthy. The sporadic, but detailed history of data collection around my weight and eating habits mark out each of the times when this time was going to be the time when I really made a change.

My success has been mixed. But the latest diet is going pretty well, thanks for asking.

This morning the app chose the following feature to highlight as part of its irregular nudges for me to upgrade to premium.

Downloading data about your weight, nutrition and exercise history are a premium feature of the service. This gave me pause for thought for several reasons.

Under UK legislation, and for as long as we maintain data adequacy with the EU, I have a right to data portability. I can request access to any data about me, in a machine-readable format, from any service I happen to be using.

The company that produce MyFitnessPal, Under Armour, do offer me a way to exercise this right. It’s described in their privacy policy, as shown in the following images.

Note about how to exercise your GDPR rights in MyFitnessPalData portability in MyFitnessPal

Rather than enabling this access via an existing product feature, they’ve decide to make me and everyone else request the data directly. Every time I want to use it.

This might be a deliberate decision. They’re following the legislation to the letter. Perhaps its a conscious decision to push people towards a premium service, rather than make it easy by default. Their user base is international, so they don’t have to offer this feature to everyone.

Or maybe its the legal and product teams not looking at data portability as an opportunity. That’s something that the ODI has previously explored.

I’m hoping to see more exploration of the potential benefits and uses of data portability in 2020.

I think we need to re-frame the discussion away from compliance and on to commercial and consumer benefits. For example, by highlighting how access to data contributes to building ecosystems around services, to help retain and grow a customer base. That is more likely to get traction than a continued focus on compliance and product switching.

MyFitnessPal already connects into an ecosystem of other services. A stronger message around portability might help grow that further.  After all, there are more reasons to monitor what you eat than just weight loss.

Clearer legislation and stronger guidance from organisations like ICO and industry regulators describing how data portability should be implemented would also help. Wider international adoption of data portability rights wouldn’t hurt either.

There’s also a role for community driven projects to build stronger norms and expectations around data portability. Projects like OpenSchufa demonstrate the positive benefits of coordinated action to build up an aggregated view of donated, personal data.

But I’d also settle with a return to the ethos of the early 2010s, when making data flow between services was the default. Small pieces, loosely joined.

If we want the big platforms to go on a diet, then they’re going to need to give up some of those bytes.

Posted at 15:05

January 31

Leigh Dodds: Do data scientists spend 80% of their time cleaning data? Turns out, no?

It’s hard to read an article about data science or really anything that involves creating something useful from data these days without tripping over this factoid, or some variant of it:

Data scientists spend 80% of their time cleaning data rather than creating insights.

Or

Data scientists only spend 20% of their time creating insights, the rest wrangling data.

It’s frequently used to highlight the need to address a number of issues around data quality, standards, access. Or as a way to sell portals, dashboards and other analytic tools.

The thing is, I think it’s a bullshit statistic.

Not because I don’t think there aren’t improvements to be made about how we access and share data. Far from it. My issue is more about how that statistic is framed and because its just endlessly parroted without any real insight.

What did the surveys say?

I’ve tried to dig out the underlying survey or source of that factoid, to see if there’s more context. While the figure is widely referenced its rarely accompanied by a link to a survey or results.

Amusingly this IBM data science product marketing page cites this 2018 HBR blog post which cites this 2017 IBM blog which cites this 2016 Crowdflower survey. Why don’t people link to original sources?

In terms of sources of data on how data scientists actually spend their time, I’ve found two ongoing surveys.

So what do these surveys actually say?

  • Crowdflower, 2015: “66.7% said cleaning and organizing data is one of their most time-consuming tasks“.
    • They didn’t report estimates of time spent
  • Crowdflower, 2016: “What data scientists spend the most time doing? Cleaning and organizing data: 60%, Collecting data sets; 19% …“.
    • Only 80% of time spent if you also lump in collecting data as well
  • Crowdflower, 2017: “What activity takes up most of your time? 51% Collecting, labeling, cleaning and organizing data
    • Less than 80% and also now includes tasks like labelling of data
  • Figure Eight, 2018: Doesn’t cover this question.
  • Figure Eight, 2019: “Nearly three quarters of technical respondents 73.5% spend 25% or more of their time managing, cleaning, and/or labeling data
    • That’s pretty far from 80%!
  • Kaggle, 2017: Doesn’t cover this question
  • Kaggle, 2018: “During a typical data science project, what percent of your time is spent engaged in the following tasks? ~11% Gathering data, 15% Cleaning data…
    • Again, much less than 80%

Only the Crowdflower survey reports anything close to 80%, but you need to lump in actually collecting data as well.

Are there other sources? I’ve not spent too much time on it. But this 2015 bizreport article mentions another survey which suggests “between 50% and 90% of business intelligence (BI) workers’ time is spend prepping data to be analyzed“.

And an August 2014 New York Times article states that: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data“. But doesn’t link to the surveys, because newspapers hate links.

It’s worth noting that “Data Scientist” as a job started to really become a thing around 2009. Although the concept of data science is older. So there may not be much more to dig up. If you’ve seen some earlier surveys, then let me know.

Is it a useful statistic?

So looking at the figures, it looks to me that this is a bullshit statistic. Data scientists do a whole range of different types of task. If you arbitrary label some of these as analysis and others not, then you can make them add up to 80%.

But that’s not the only reason why I think its a bullshit statistic.

Firstly there’s the implication that cleaning and working with data is somehow not worth the time of a data scientist. It’s “data janitor work” work. And “It’s a waste of their skills to be polishing the materials they rely on“. Ugh.

Who, might I ask, is supposed to do this janitorial work?

I would argue that spending time working with data. To transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in.

Understand the material better and you’ll get better insights.

Secondly, I think data science use cases and workflows are a poor measure for how well data is published. Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardised a dataset its likely to need some work.

A sculptor has different needs than a bricklayer. They both use similar materials. And they both create things of lasting value and worth.

We could measure utility better using other assessments than time spent on bespoke work.

Thirdly, it’s measuring the wrong thing. Actually, maybe some friction around the use of data is a good thing. Especially if it encourages you to spend more time understanding a dataset. Even more so if it in any way puts a break on dumb uses of machine-learning.

If we want the process of accessing, using and sharing data to be as frictionless as possible in a technical sense, then let’s make sure that is offset by adding friction elsewhere. E.g. to add checkpoints for reviews of ethical impacts. No matter how highly paid a data scientist is, the impacts of poor use of data and AI can be much, much larger.

Don’t tell me that data scientists are spending time too much time working with data and not enough time getting insights into production. Tell me that data scientists are increasingly spending 50% of their time considering the ethical and social impacts of their work.

Let’s measure that.

Posted at 21:05

January 29

Leigh Dodds: Long live RSS! How I manage my reading

“LONG LIVE RSS!”

I shout these words from my bedroom window every morning. Reaffirming my love for this century’s most criminally neglected data standard.

If you’ve either forgotten, or never enjoyed, the ease of managing your information consumption via the magic of RSS and a feed reader, then you’re missing out mate.

Struggling with the noise, gloom and general bombast of social media? Get yourself a feed reader and fill it full of interesting subscriptions for a most measured and sedate way to consume words.

Once upon a time everyone(*) used them. We engaged in educated discourse, shared blog rolls, sent trackbacks and wrote comments on each others websites. Elegant weapons for a more civilized age (**).

I like to read things when I have time to reduce distractions and give me change to absorb several viewpoints rather than simply the latest, hottest takes.

I’ve fine-tuned my approach to managing my reading and research. A few of the tools and services have changed, but the essentials stay the same. If you’re interested, here’s how I’ve made things work for me:

  • Feedbin
    • Manages all my subscriptions for blogs, newsletters and more into one easily accessible location
    • Lots of sites still support RSS its not dead, merely resting
    • Feedbin is great at discovering feeds if you just paste in a site URL. One of the magic parts of RSS
    • You can also subscribe to newsletters with a special Feedbin email address and they’ll get delivered to your reader. Brilliant. You’re not making me go back into my inbox, its scary in there.
  • Feedme. Feedbin allows me to read posts anywhere, but I use this Android app (there are others) as a client instead
    • Regularly syncs with Feedbin, so I can have all the latest unread posts on my phone for the commute or an idle few minutes
    • It provides a really quick interface to skim through posts and either immediately read the or add them to my “to read” list, in Pocket…
  • Pocket. Mobile and web app that I basically use as a way to manage a backlog of things “to read”.
    • Gives me a clutter free (no ads!) way to read content either in the browser (which I rarely do) or on my phone
    • It has its issues with some content, but you can easily switch to a full web view
    • Not everything I want to read comes in via my feed reader so I take links from Slack, Twitter or elsewhere and use the Pocket browser extension or its share button integration to stash things away for later reading. Basically if its not a 1-2 minute read it goes into Pocket until I’m ready for it. Keeps the number of browser tabs under control too.
    • The offline content syncing makes it great for using on my commute, especially on the tube
  • IFTTT. I use this service to do two things:
    • Once I archive something in Pocket then it automatically adds them to Pinboard for me, using the right tags.
    • If I favourite something it tweets out the link without me having to go and actually look at twitter
  • Pinboard. Basically a complete archive of articles I’ve read.

The end result is a fully self-curated feed of interesting stuff. I’m no longer fighting someone else’s algorithm, so I can easily find things again.

I can minimise number of organisations I’m following on twitter, and just subscribe to their blogs. Also helps to buck the trend towards more email newsletters which are just blogs but you’re all in denial.

Also helps to reduce the number of distractions, and fight the pressure to keep checking on twitter in case I’ve missed something interesting. It’ll be in the feed reader when I’m ready.

Long live RSS!

It’s about time we stopped rebooting social networks and rediscovered more flexible ways to create, share and read content online. Go read

Say it with me. Go on.

LONG LIVE RSS!

(*) not actually everyone, but all the cool kids anyway. Alright, just us nerds, but we loved it.

(**) not actually more civilised, but it was more decentralised

 

Posted at 19:05

January 25

Egon Willighagen: MetaboEU2020 in Toulouse and the ELIXIR Metabolomics Community assemblies

This week I attended the European RFMF Metabomeeting 2020, aka #MetaboEU2020, held in Toulouse. Originally, I had hoped to do this by train, but that turned out unfeasible. Co-located with this meeting where ELIXIR Metabolomics Community meetings. We're involved in two implementation studies for together less than a month of work. But both this community and the conference are great places to talk about WikiPathways, BridgeDb (our website is still disconnected from the internet), and cheminformatics.

Toulouse was generally great. It comes with its big city issues, like fairly expensive hotels, and a very frequent public transport system. It also had a great food market where we had our "gala dinner". Toulouse is also home to Airbus, so it was hard to miss the Beluga:


The MetaboEU2020 conference itself had some 400 participants, of course, with a lot of wet lab metabolomics. As a chemist, with a good pile of training in analytical chemistry, it's great to see the progress. From a data analysis perspective, the community has a long way to come. We're still talking about known known, unknown knowns, and unknown unknowns. The posters were often cryptic, e.g. stating they found 35 interesting metabolites, without actually listing them. The talks were also really interesting.

Now, if you read this, there is a good chance you were not at the meeting. You can check the above linked hashtag for coverage on Twitter, but we can do better. I loved Lanyrd, but their business model was not scalable and the service no longer exists. But Scholia (see doi:10.3897/rio.5.e35820) could fill the gap (it uses the Wikidata RDF and SPARQL queries). I followed Finn's steps and created a page for the meeting and started associated speakers (I've done this in the past for other meetings too):


Finn also created proceedings pages in the past, which I also followed. So, I asked people on Twitter to post their slidedeck and posters on Figshare or Zenodo, and so far we ended up with 10 "proceedings" (thanks to everyone who did!!!):



As you can see, there is an RSS feed which you can follow (e.g. with Feedly) to get updates if more materials appears online! I wish all conferences did this!

Posted at 10:05

January 24

Leigh Dodds: Licence Friction: A Tale of Two Datasets

For years now at the Open Data Institute we’ve been working to increase access to data, to create a range of social and economic benefits across a range of sectors. While the details change across projects one of the more consistent aspects of our work and guidance has been to support data stewards in making data as open as possible, whilst ensuring that is clearly licensed.

Reference data, like addresses and other geospatial data, that underpins our national and global data infrastructure needs to be available under an open licence. If it’s not, which is the ongoing situation in the UK, then other data cannot be made as open as possible. 

Other considerations aside, data can only be as open as the reference data it relies upon. Ideally, reference data would be in the public domain, e.g. using a CC0 waiver. Attribution should be a consistent norm regardless of what licence is used

Data becomes more useful when it is linked with other data. When it comes to data, adding context adds value. It can also add risks, but more value can be created from linking data. 

When data is published using bespoke or restrictive licences then it is harder to combine different datasets together, because there are often limitations in the licensing terms that restrict how data can be used and redistributed.

This means data needs to be licensed using common, consistent licences. Licences that work with a range of different types of data, collected and used by different communities across jurisdictions. 

Incompatible licences create friction that can make it impossible to create useful products and services. 

It’s well-reported that data scientists and other users spend huge amounts of time cleaning and tidying data because it’s messy and non-standardised. It’s probably less well-reported how many great ideas are simply shelved because of lack of access to data. Or are impossible because of issues with restrictive or incompatible data licences. Or are cancelled or simply needlessly expensive due to the need for legal consultations and drafting of data sharing agreements.

These are the hurdles you often need to overcome before you even get started with that messy data.

Here’s a real-world example of where the lack of open geospatial data in the UK, and ongoing incompatibilities between data licensing is getting in the way of useful work. 

Introducing Active Places

Active Places is a dataset stewarded by Sport England. It provides a curated database of sporting facilities across England. It includes facilities provided by a range of organisations across the public, private and third-sectors. It’s designed to help support decision making about the provision of tens of thousands of sporting sites and facilities around the UK to drive investment and policy making. 

The dataset is rich and includes a wide range of information from disabled access through to the length of ski slopes or the number of turns on a cycling track.

While Sport England are the data steward, the curation of the dataset is partly subcontracted to a data management firm and partly carried out collaboratively with the owners of those sites and facilities.

The dataset is published under a standard open licence, the Creative Commons Attribution 4.0 licence. So anyone can access, use and share the data so long as they acknowledge its source. Contributors to the dataset agree to this licence as part of registering to contribute to the site.

The dataset includes geospatial data, including the addresses and locations of individual sites. This data includes IP from Ordnance Survey and Royal Mail, which means they have a say over what happens to it. In order to release the data under an open licence, Sport England had to request an exemption from the Ordnance Survey to their default position, which is that data containing OS IP cannot be sublicensed. When granted an exemption, an organisation may publish their data under an open licence. In short, OS waive their rights over the geographic locations in the data. 

The OS can’t, however waive any rights that Royal Mail has over the address data. In order to grant Sport England an exemption, the OS also had to seek permission from Royal Mail.  The Sport England team were able to confirm this for me. 

Unfortunately it’s not clear, without having checked, that this is actually the case. It’s not evident in the documentation of either Active Places or the OS exemption process. Is it clarifying all third-party rights a routine part of the exemption process or not?

It would be helpful to know. As the ODI has highlighted, lack of transparency around third-party rights in open data is a problem. For many datasets the situation remains unclear. And Unclear positions are fantastic generators of legal and insurance fees.

So, to recap: Sport England has invested time in convincing Ordnance Survey to allow it to openly publish a rich dataset for the public good. A dataset in which geospatial data is clearly important, but is not the main feature of the dataset. The reference data is dictating how open the dataset can be and, as a result how much value can be created from it.

In case you’re wondering, lots of other organisations have had to do the same thing. The process is standardised to try and streamline it for everyone. A 2016 FOI request shows that between 2011 and 2015 the Ordnance Survey handled more than a 1000 of these requests

Enter OpenStreetMap

At the end of 2019, members of the OpenStreetmap community contacted Sport England to request permission to use the Active Places dataset. 

If you’re not familiar with OpenStreetmap, then you should be. It’s an openly licensed map of the world maintained by a huge community of volunteers, humanitarian organisations, public and private sector businesses around the world.

The OpenStreetmap Foundation is the official steward of the dataset with the day to data curation and operations happening through its volunteer network. As a small not-for-profit, it has to be very cautious about legal issues relating to the data. It can’t afford to be sued. The community is careful to ensure that data that is imported or added into the database comes from openly licensed sources.

In March 2017, after a consultation with the Creative Commons, the OpenStreetmap Licence/Legal Working Group concluded that data published under the Creative Commons Attribution licence is not compatible with the licence used by OpenStreetmap which is called the Open Database Licence. They felt that some specific terms in the licence (and particularly in its 4.0 version) meant that they needed additional permission in order to include that data in OpenStreetmap.

Since then the OpenStreetmap community, has been contacting data stewards to ask them to sign an additional waiver that grants the OSM community explicit permission to use the data. This is exactly what open licensing of data is intended to avoid.

CC-BY is one of the most frequently used open data licences, so this isn’t a rare occurrence. 

As an indicator of the extra effort required, in a 2018 talk from the Bing Maps team in which they discuss how they have been supporting the OpenStreetmap community in Australia, they called out their legal team as one of the most important assets they had to provide to the local mapping community, helping them to get waivers signed. At the time of writing nearly 90 waivers have been circulated in Australia alone, not all of which have been signed.

So, to recap, due to a perceived incompatibility between two of the most frequently used open data licences, the OpenStreetmap community and its supporters are spending time negotiating access to data that is already published under an open licence.

I am not a lawyer. So these are like, just my opinions. But while I understand why the OSM Licence Working Group needs to be cautious, it feels like they are being overly cautious. Then again, I’m not the one responsible for stewarding an increasingly important part of a global data infrastructure. 

Another opinion is that perhaps the Microsoft legal team might be better deployed to solve the licence incompatibility issues. Instead they are now drafting their own new open data licences, which are compatible with CC-BY.

Active Places and OpenStreetmap

Still with me?

At the end of last year, members of the OpenStreetMap community contacted Sport England to ask them to sign a waiver so that they could use the Active Places data. Presumably to incorporate some of the data into the OSM database.

The Sport England data and legal teams then had to understand what they were being asked to do and why. And they asked for some independent advice, which is where I provided some support through our work with Sport England on the OpenActive programme. 

The discussion included:

  • questions about why an additional waiver was actually necessary
  • the differences in how CC-BY and ODbL are designed to require data to remain open and accessible – CC-BY includes limitation on use of technical restrictions, which is allowed by the open definition, whilst ODbL adopts a principle of encouraging “parallel distribution”. 
  • acceptable forms and methods of attribution
  • who, within an organisation like Sport England, might have responsibility to decide what acceptable attribution looked like
  • why the OSM community had come to its decisions
  • who actually had authority to sign-off on the proposed waiver
  • whether signing a waiver and granting a specific permission undermined Sport England’s goal to adopt standard open data practices and licences, and a consistent approach for every user
  • whether the OS exemption, which granted permission to SE to publish the dataset under an open licence, impacted any of the above

All reasonable questions from a team being asked to do something new. 

Like a number of organisations asked to sign waiver in Australia, SE have not yet signed a waiver and may choose not to do so. Like all public sector organisations, SE are being cautious about taking risks. 

The discussion has spilled out onto twitter. I’m writing this to provide some context and background to the discussion in that thread. I’m not criticising anyone as I think everyone is trying to come to a reasonable outcome. 

As the twitter thread highlights, the OSM community are not just concerned about the CC-BY licence but also about the potential that additional third-party rights are lurking in the data. Clarifying that may require SE to share more details about how the address and location data in the dataset is collected, validated and normalised for the OSM community to be happy. But, as noted earlier in the blog, I’ve at least been able to determine the status of any third-party rights in the data. So perhaps this will help to move things further.

The End

So, as a final recap, we have two organisations both aiming to publish and use data for the public good. But, because of complexities around derived data and licence compatibilities, data that might otherwise be used in new, innovative ways is instead going unused.

This is a situation that needs solving. It needs the UK government and Geospatial Commission to open up more geospatial data.

It needs the open data community to invest in resolving licence incompatibilities (and less in creating new licences) so that everyone benefits. 

We also need to understand when licences are the appropriate means of governing how data is used and when norms, e.g. around attribution, can usefully shape how data is accessed, used and shared.

Until then these issues are going to continue to undermine the creation of value from open (geospatial) data.

Posted at 20:05

January 22

schema.org: Schema.org 6.0

Schema.org version 6.0 has been released. See the release notes for full details.  As always, the release notes have full details and links (including previous releases e.g. 5.0 and 4.0).

We are now aiming to release updated schemas on an approximately monthly basis (with longer gaps around vacation periods). Typically, new terms are first added to our "Pending" area to give time for the definitions to benefit from implementation experience before they are added to the "core" of Schema.org. As always, many thanks to everyone who has contributed to this release of Schema.org.

--
Dan Brickley, for Schema.org.

Posted at 16:13

January 18

Leigh Dodds: [Paper Review] The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work

This blog post is a quick review and notes relating to a research paper called: The Coerciveness of the Primary Key: Infrastructure Problems in Human Services Work (PDF available here)

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

This paper explores the impact of data infrastructure, and in particular the use of identifiers and the design of databases, on the delivery of human (public) services. By reviewing the use of identifiers and data in service delivery to support homelessness and those affected by AIDS, the authors highlight a number of tensions between how the design of data infrastructure and the need to share data with funders and other agencies has an inevitable impact on frontline services.

For example, the need to evidence impact to funders requires the collection of additional personal, legal identifiers. Even when that information is not critical to the delivery of support.

The paper also explores the interplay between the well defined, unforgiving world of database design, and the messy nature of delivering services to individuals. Along the way the authors touch on aspects of identity, identification, and explore different types of identifiers and data collection practices.

The authors draw out a number of infrastructure problems and provide some design provocations for alternate approaches. The three main problems are the immutability of identifiers in database schema, the “hegemony of NOT NULL” (or the need for identification), and the demand for uniqueness across contexts.

Three reasons to read

Here’s three reasons why you might want to read this paper:

  1. If, like me, you’re often advocating for use of consistent, open identifiers, then this paper provides a useful perspective of how this approach might create issues or unwanted side effects outside of the simpler world of reference data
  2. If you’re designing digital public services then the design provocations around identifiers and approaches to identification are definitely worth reading. I think there’s some useful reflections about how we capture and manage personal information
  3. If you’re a public policy person and advocating for consistent use of identifiers across agencies, then there’s some important considerations around the the policy, privacy and personal impacts of data collection in this paper

Three things I learned

Here’s three things that I learned from reading the paper.

  1. In a section on “The Data Work of Human Services Provision“, the authors highlighted three aspects of frontline data collection which I found it useful to think about:
    • data compliance work – collecting data purely to support the needs of funders, which might be at odds with the needs of both the people being supported and the service delivery staff
    • data coordination work – which stems from the need to link and aggregate data across agencies and funders to provide coordinated support
    • data confidence work – the need to build a trusted relationship with people, at the front-line, in order to capture valid, useful data
  2. Similarly, the authors tease out four reasons for capturing identifiers, each of which have different motivations, outcomes and approaches to identification:
    • counting clients – a basic need to monitor and evaluate service provision, identification here is only necessary to avoid duplicates when counting
    • developing longitudinal histories – e.g. identifying and tracking support given to a person over time can help service workers to develop understanding and improve support for individuals
    • as a means of accessing services – e.g. helping to identify eligibility for support
    • to coordinate service provision – e.g. sharing information about individuals with other agencies and services, which may also have different approaches to identification and use of identifiers
  3. The design provocations around database design were helpful to highlight some alternate approaches to capturing personal information and the needs of the service vs that of the individual

Thoughts and impressions

As someone who has not been directly involved in the design of digital systems to support human services, I found the perspectives and insight shared in this paper really useful. If you’ve been working in this space for some time, then it may be less insightful.

However I haven’t seen much discussion about good ways to design more humane digital services and, in particular, the databases behind them, so I suspect the paper could do with a wider airing. Its useful reading alongside things like Falsehoods Programmers Believe About Names and Falsehoods Programmers Believe About Gender.

Why don’t we have a better approach to managing personal information in databases? Are there solutions our there already?

Finally, the paper makes some pointed comments about the role of funders in data ecosystems. Funders are routinely collecting and aggregating data as part of evaluation studies, but this data might also help support service delivery if it were more accessible. It’s interesting to consider the balance between minimising unnecessary collection of data simply to support evaluation versus the potential role of funders as intermediaries that can provide additional support to charities, agencies or other service delivery organisations that may lack the time, funding and capability to do more with that data.

 

 

Posted at 13:05

January 17

AKSW Group - University of Leipzig: SANSA 0.7.1 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find usage guidelines and examples at http://sansa-stack.net/user-guide.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop and Tensors
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • TRIX support
  • A new query engine over compressed RDF data
  • OWL/XML Support

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based querying, scalable RDB2RDF query execution, quality assessment and semantic partitioning.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

 

Posted at 09:05

January 16

Egon Willighagen: Help! Digital Object Identifiers: Usability reduced if given at the bottom of the page

The (for J. Cheminform.) new SpringerNature article template has the Digital Object Identifier (DOI) at the bottom of the article page. So, every time I want to use the DOI I have to scroll all the way down to the page. That could be find for abstracts, but totally unusable for Open Access articles.

So, after our J. Cheminform. editors telcon this Monday, I started a Twitter poll:


Where I want the DOI? At the top, with the other metadata:
Recent article in the Journal of Cheminformatics.
If you agree, please vote. With enough votes, we can engage with upper SpringerNature manager to have journals choose where they want the DOI to be shown.

(Of course, the DOI as semantic data in the HTML is also important, but there is quite good annotation of that in the HTML <head>. Link out to RDF about the article, is still missing, I think.)

Posted at 06:59

January 14

Leigh Dodds: [Paper review] Open data for electricity modeling: Legal aspects

This blog post is a quick review and notes relating to a research paper called: Open data for electronic modeling: Legal aspects.

It’s part of my new research notebook to help me collect and share notes on research papers and reports.

Brief summary

The paper reviews the legal status of publicly available energy data (and some related datasets) in Europe, with a focus on German law. The paper is intended to help identify some of the legal issues relevant to creation of analytical models to support use of energy data, e.g. for capacity planning.

As background, the paper describes the types of data relevant to building these types of model, the relevant aspects of database and copyright law in the EU and the properties of open licences. This background is used to assess some of the key data assets published in the EU and how they are licensed (or not) for reuse.

The paper concludes that the majority of uses of this data to support energy modelling in the EU, whether for research or other purposes, is likely to be infringing on the rights of the database holders, meaning that users are currently carrying legal risks. The paper notes that in many cases this is likely not the intended outcome.

The paper provides a range of recommendations to address this issue, including the adoption of open licences.

Three reasons to read

Here’s three reasons why you might want to read this paper

  1. It provides a helpful primer on the range of datasets and data types that are used to develop applications in the energy sector in the EU. Useful if you want to know more about the domain
  2. The background information on database rights and related IP law is clearly written and a good introduction to the topic
  3. The paper provides a great case study of how licensing and legal protections applies to data use in a sector. The approach taken could be reused and extended to other areas

Three things I learned

Here’s three things that I learned from reading the paper.

  1. That a database might be covered by copyright (an “original” database) in addition to database rights. But the authors note this doesn’t apply in the case of a typical energy dataset
  2. That individual member states might have their own statutory exemptions to the the Database Directive. E.g. in Germany it doesn’t apply to use of data in non-commercial teaching. So there is variation in how it applies.
  3. The discussion on how the Database Directive relates to statutory obligations to publish data was interesting, but highlights that the situation is unclear.

Thoughts and impressions

Great paper that clearly articulates the legal issues relating to publication and use of data in the energy sector in the EU. It’s easy to extrapolate from this work to other use cases in energy and by extension to other sectors.

The paper concludes with a good set of recommendations: the adoption of open licences, the need to clarify rights around data reuse and the role of data institutions in doing that, and how policy makers can push towards a more open ecosystem.

However there’s a suggestion that funders should just mandate open licences when funding academic research. While this is the general trend I see across research funding, in the context of this article it lacks a bit of nuance. The paper clearly indicates that the current status quo is that data users do not have the rights to apply open licences to the data they are publishing and generating. I think funders also need to engage with other policy makers to ensure that upstream provision of data is aligned with an open research agenda. Otherwise we risk perpetuating an unclear landscape of rights and permissions. The authors do note the need to address wider issues, but I think there’s a potential role of research funders in helping to drive change.

Finally, in their review of open licences, the authors recommend a move towards adoption of CC0 (public domain waivers and marks) and CC-BY 4.0. But they don’t address the fact that upstream licensing might limit the choice of how researchers can licence downstream data.

Specifically, the authors note the use of OpenStreetmap data to provide infrastructure data. However depending on your use, you may need to adopt this licence when republishing data. This can be at odds with a mandate to use other licences or restrictive licences used by other data stewards.

 

Posted at 19:06

December 10

Peter Mika: Common Tag semantic tagging format released today

The Common Tag format for semantic tagging has been finally released today after almost a year of intense work on it by a group of Web companies active in the semantic technologies area, among them Yahoo. It’s been great fun working on this and I’m proud to have been involved: while there have been vocabularies before for representing tags in RDF, this effort is different in at least two respects.

First, a significant effort of time has been spent on making sure the specification meets the needs of all partners involved. The support of these companies for the specification will ensure that developers in the future can rely on a single format for annotation with semantic tags and interchanging tag data. The website already lists a number of applications but I’m pretty sure that a common tagging format will open entirely new possibilities in searching, navigating and aggregating web content.

Second, the format has been developed with publishers in mind, in particular in making it as easy as possible to embed semantic tags in HTML using RDFa, a syntax universally embraced by all those involved. The choice for RDF also means that unlike in the case of the rel-tag microformat, Common Tags can be applied to any object, not just documents.

So, it’s time for a new era in tagging!

Reblog this post [with Zemanta]

Posted at 17:10

Sandro Hawke: Simplified RDF

I propose that we designate a certain subset of the RDF model as “Simplified RDF” and standardize a method of encoding full RDF in Simplified RDF. The subset I have in mind is exactly the subset used by Facebook’s Open Graph Protocol (OGP), and my proposed encoding technique is relatively straightforward.

I’ve been mulling over this approach for a few months, and I’m fairly confident it will work, but I don’t claim to have all the details perfect yet. Comments and discussion are quite welcome, on this posting or on the semantic-web@w3.org mailing list. This discussion, I’m afraid, is going to be heavily steeped in RDF tech; simplified RDF will be useful for people who don’t know all the details of RDF, but this discussion probably wont be.

My motivation comes from several directions, including OGP. With OGP, Facebook has motivated a huge number of Web sites to add RDFa markup to their pages. But the RDF they’ve added is quite constrained, and is not practically interoperable with the rest of the Semantic Web, because it uses simplified RDF. One could argue that Facebook made a mistake here, that they should be requiring full “normal” RDF, but my feeling is their engineering decisions were correct, that this extreme degree of simplification is necessary to get any reasonable uptake.

I also think simplified RDF will play well with JSON developers. JRON is pretty simple, but simplified RDF would allow it to be simpler still. Or, rather, it would mean folks using JRON could limit themselves to an even smaller number of “easy steps” (about three, depending on how open design issues are resolved).

Cutting Out All The Confusing Stuff

Simplified RDF makes the following radical restrictions to the RDF model and to deployment practice:

  1. The subject URIs are always web page addresses. The content-negotiation hack for “hash” URIs and the 303-see-other hack for “slash” URIs are both avoided.

    (Open issue: are html fragment URIs okay? Not in OGP, but I think it will be okay and useful.)

  2. The values of the properties (the “object” components of the RDF triples) are always strings. No datatype information is provided in the data, and object references are done by just putting the object URI into the string, instead of making it a normal URI-label node.

    (Open issue: what about language tags? I think RDFa will provide this for free in OGP, if the html has a language tag.)

    (Open issue: what about multi-valued (repeated) properties? Are they just repeated, or are the multiple values packing into the string, perhaps? OGP has multiple administrators listed as “USER_ID1,USER_ID2”. JSON lists are another factor here.)

At first inspection this reduction appears to remove so much from RDF as to make it essentally useless. Our beloved RDF has been blown into a hundred pieces and scattered to the wind. It turns out, however, it still has enough enough magic to reassemble itself (with a little help from its friends, http and rdfs).

This image may give a feeling for the relationship of full RDF and simplified RDF:

Reassembling Full RDF

The basic idea is that given some metadata (mostly: the schema), we can construct a new set of triples in full RDF which convey what the simplified RDF intended. The new set will be distinguished by using different predicates, and the predicates are related by schema information available by dereferencing the predicate URI. The specific relations used, and other schema information, allows us to unambiguously perform the conversion.

For example, og:title is intended to convey the same basic notion as rdfs:label. They are not the same property, though, because og:title is applied to a page about the thing which is being labeled, rather than the thing itself. So rather than saying they are related by owl:equivalentProperty, we say:

  og:title srdf:twin rdfs:label.

This ties to them together, saying they are “parallel” or “convertable”, and allowing us to use other information in the schema(s) for og:title and rdfs:label to enable conversion.

The conversion goes something like this:

  1. The subject URLs should usually be taken as pages whose foaf:primaryTopic is the real subject. (Expressing the XFN microformat in RDF provides a gentle introduction to this kind of idea.) That real subject can be identified with a blank node or with a constructed URI using a “thing described by” service such as t-d-b.org. A little more work is needed on how to make such services efficient, but I think the concept is proven. I’d expect facebook to want to run such a service.

    In some cases, the subject URL really does identify the intended subject, such as when the triple is giving the license information for the web page itself. These cases can be distinguished in the schema by indicating the simplified RDF property is an IndirectProperty or MetadataProperty.

  2. The object (value) can be reconstructed by looking at the range of the full-RDF twin. For example, given that something has an og:latitude of “37.416343”, og:latitude and example:latitude are twins, and example:latitude has a range of xs:decimal, we can conclude the thing has an example:latitude of “37.416343”^^xs:decimal.

    Similarly, the Simplified RDF technique of puting URIs in strings for the object can be undone by know the twin is an ObjectProperty, or has some non-Literal range.

    I believe language tagging could also be wrapped into the predicate (like comment_fr, comment_en, comment_jp, etc) if that kind of thing turns out to be necessary, using an OWL 2 range restrictions on the rdf:langRange facet.

    Next Steps

    So, that’s a rough sketch, and I need to wrap this up. If you’re at ISWC, I’ll be giving a 2 minute lightning talk about this at lunch later today. But if you’ve ready this far, the talk wont say say anything you don’t already know.

    FWIW, I believe this is implementable in RIF Core, which would mean data consumers which do RIF Core processing could get this functionality automatically. But since we don’t have any data consumer libraries which do that yet, it’s probably easiest to implement this with normal code for now.

    I think this is a fairly urgent topic because of the adoption curve (and energy) on OGP, and because it might possibly inform the design of a standand JSON serialization for RDF, which I’m expecting W3C to work on very soon.

Posted at 17:09

Leigh Dodds: Lets talk about plugs

This is a summary of a short talk I gave internally at the ODI to help illustrate some of the important aspects of data standards for non-technical folk. I thought I’d write it up here too, in case its useful for anyone else. Let me know what you think.

We benefit from standards in every aspect of our daily lives. But because we take them for granted, we don’t tend to think about them very much. At the ODI we’re frequently talking about standards for data which, if you don’t have a technical background, might be even harder to wrap your heard around.

A good example can help to illustrate the value of standards. People frequently refer to telephone lines, railway tracks, etc. But there’s an example that we all have plenty of personal experience with.

Lets talk about plugs!

You can confidently plug any of your devices into a wall socket and it will just work. No thought required.

Have you ever thought about what it would be like if plugs and wall sockets were all different sizes and shapes?

You couldn’t rely on being able to consistently plug your device into any random socket, so you’d have to carry around loads of different cables. Manufacturers might not design their plugs and sockets very well so there might be greater risks of electrocution or fires. Or maybe the company that built your new house decided to only fit a specific type of wall socket because its agree a deal with an electrical manufacturer, so when you move in you needed to buy a completely new set of devices.

We don’t live in that world thankfully. As a nation we’ve agreed that all of our plugs should be designed the same way.

That’s all a standard is. A documented, reusable agreement that everyone uses.

Notice that a single standard, “how to design a really great plug“, has multiple benefits. Safety is increased. We save time and money. Manufacturers can be confident that their equipment will work in any home or office.

That’s true of different standards too. Standards have economic, policy, technical and social impacts.

Open up a UK plug and it looks a bit like this.

Notice that there are colours for different types of wires (2, 3, 4). And that fuses (5) are expected to be the same size and shape. Those are all standards too. The wiring and voltages are standardised too.

So the wiring, wall sockets and plugs in your house are designed affording to a whole family of different standards, that are designed to work with one another.

We can design more complex systems from smaller standards. It helps us make new things faster, because we are reusing existing work.

That’s a lot of time and agreement that we all benefit from. Someone somewhere has invested the time and energy into thinking all of that through. Lucky us!

When we visit other countries, we learn that their plugs and sockets are different. Oh no!

That can be a bit frustrating, and means we have to spend a bit more money and remember to pack the right adapters. It’d be nice if the whole world agreed on how to design a plug. But that seems unlikely. It would cost a lot of time and money in replacing wiring and sockets.

But maybe those different designs are intentional? Perhaps there are different local expectations around safety, for example. Or in what devices people might be using in their homes. There might be reasons why different communities choose to design and adopt slightly different standards. Because they’re meeting slightly different needs. But sometimes those differences might be unnecessary. It can be hard to tell sometimes.

The people most impacted by these differences aren’t tourists, its the manufacturers that have to design equipment to work in different locations. Which is why your electrical devices normally has a separate cable. So, depending on whether you travel or whether you’re a device manufacturer you’ll have different perceptions of how much a problem that is.

All of the above is true for data standards.

Standards for data are agreements that help us collect, access, share, use and publish data in consistent ways.  They have a range of different impacts.

There are lots of different types of standard and we combine them together to create different ways to successfully exchange data. Different communities often have their own standards for similar things, e.g. for describing metadata or accessing data via an API.

Sometimes those are simple differences that an adapter can easily fix. Sometimes those differences are because the standards are designed to meet different needs.

Unfortunately we don’t live in a world of standardised data plugs and wires and fuses. We live in that other world. The one where its hard to connect one thing to another thing. Where the stuff coming down the wires is completely unexpected. And we get repeated shocks from accidental releases of data.

I guarantee that in every user research, interview, government consultation or call for evidence, people will be consistently highlighting the need for more standards for data. People will often say this explicitly, “We need more standards!”. But sometimes they refer to the need in other ways: “We need make data more discoverable!” (metadata standards) or “We need to make it easier to safely release data!” (standardised codes of practice).

Unfortunately that’s not always that helpful because when you probe a little deeper you find that people are talking about lots of different things. Some people want to standardise the wiring. Others just want to agree on a voltage. While others are still debating the definition of “fuse”. These are all useful and important things. You just need to dig a little deeper to find the most useful place to start.

Its also not always clear whose job it is to actually create those standards. Because we take standards for granted, we’re not always clear about how they get created. Or how long it takes and what process to follow to ensure they’re well designed.

The reason we published the open standards for data guidebook was to help communities get started in designing the standards they need.

Standards development needs time and investment, as someone somewhere needs to do the work of creating them. That, as ever, is the really hard part.

Standards are part of the data infrastructure that help us unlock value from data. We need to invest in creating and maintaining them like we do other parts of our infrastructure.

Don’t just listen to me, listen to some of the people who’ve being creating standards for their communities.

Posted at 17:07

John Breslin: Book launch for "The Social Semantic Web"

We had the official book launch of “The Social Semantic Web” last month in the President’s Drawing Room at NUI Galway. The book was officially launched by Dr. James J. Browne, President of NUI Galway. The book was authored by myself, Dr. Alexandre Passant and Prof. Stefan Decker from the Digital Enterprise Research Institute at NUI Galway (sponsored by SFI). Here is a short blurb:

Web 2.0, a platform where people are connecting through their shared objects of interest, is encountering boundaries in the areas of information integration, portability, search, and demanding tasks like querying. The Semantic Web is an ideal platform for interlinking and performing operations on the diverse data available from Web 2.0, and has produced a variety of approaches to overcome limitations with Web 2.0. In this book, Breslin et al. describe some of the applications of Semantic Web technologies to Web 2.0. The book is intended for professionals, researchers, graduates, practitioners and developers.

Some photographs from the launch event are below.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Another successful defense by Uldis Bojars in November

Uldis Bojars submitted his PhD thesis entitled “The SIOC MEthodology for Lightweight Ontology Development” to the University in September 2009. We had a nice night out to celebrate in one of our favourite haunts, Oscars Bistro.

Jodi, John, Alex, Julie, Liga, Sheila and Smita
Jodi, John, Alex, Julie, Liga, Sheila and Smita

This was followed by a successful defense at the end of November 2009. The examiners were Chris Bizer and Stefan Decker. Uldis even wore a suit for the event, see below.

I will rule the world!
I will rule the world!

Uldis established a formal ontology design process called the SIOC MEthodology, based on an evolution of existing methodologies that have been streamlined, experience developing the SIOC ontology, and observations regarding the development of lightweight ontologies on the Web. Ontology promotion and dissemination is established as a core part of the ontology development process. To demonstrate the usage of the SIOC MEthodology, Uldis described the SIOC project case study which brings together the Social Web and the Semantic Web by providing semantic interoperability between social websites. This framework allows data to be exported, aggregated and consumed from social websites using the SIOC ontology (in the SIOC application food chain). Uldis’ research work has been published in 4 journal articles, 8 conference papers, 13 workshop papers, and 1 book chapter. The SIOC framework has also been adopted in 33 third-party applications. The Semantic Radar tool he initiated for Firefox has been downloaded 24,000 times. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).

We wish Uldis all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Haklae Kim and his successful defense in September

This is a few months late but better late then never! We said goodbye to PhD researcher Haklae Kim in May of this year when he returned to Korea and took up a position with Samsung Electronics soon afterward. We had a nice going away lunch for Haklae with the rest of the team from the Social Software Unit (picture below).

Sheila, Uldis, John, Haklae, Julie, Alex and Smita
Sheila, Uldis, John, Haklae, Julie, Alex and Smita

Haklae returned to Galway in September to defend his PhD entitled “Leveraging a Semantic Framework for Augmenting Social Tagging Practices in Heterogeneous Content Sharing Platforms”. The examiners were Stefan Decker, Tom Gruber and Philippe Laublet. Haklae successfully defended his thesis during the viva, and he will be awarded his PhD in 2010. We got a nice photo of the examiners during the viva which was conducted via Cisco Telepresence, with Stefan (in Galway) “resting” his hand on Tom’s shoulder (in San Jose)!

Philippe Laublet, Haklae Kim, Tom Gruber, Stefan Decker and John Breslin
Philippe Laublet, Haklae Kim, Tom Gruber, Stefan Decker and John Breslin

Haklae created a formal model called SCOT (Social Semantic Cloud of Tags) that can semantically describe tagging activities. The SCOT ontology provides enhanced features for representing tagging and folksonomies. This model can be used for sharing and exchanging tagging data across different platforms. To demonstrate the usage of SCOT, Haklae developed the int.ere.st open tagging platform that combined techniques from both the Social Web and the Semantic Web. The SCOT model also provides benefits for constructing social networks. Haklae’s work allows the discovery of social relationships by analysing tagging practices in SCOT metadata. He performed these analyses using both Formal Concept Analysis and tag clustering algorithms. The SCOT model has also been adopted in six applications (OpenLink Virtuoso, SPARCool, RelaxSEO, RDFa on Rails, OpenRDF, SCAN), and the int.ere.st service has 1,200 registered members. Haklae’s research work was published in 2 journal articles, 15 conference papers, 3 workshop papers, and 2 book chapters. His scholarship was funded by Science Foundation Ireland under grant numbers SFI/02/CE1/I131 (Líon) and SFI/08/CE/I1380 (Líon 2).

We wish Haklae all the best in his future career, and hope he will continue to communicate and collaborate with researchers in DERI, NUI Galway in the future.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Some of my (very) preliminary opinions on Google Wave

I was interviewed by Marie Boran from Silicon Republic recently for an interesting article she was writing entitled “Will Google Wave topple the e-mail status quo and change the way we work?“. I thought that maybe my longer answers may be of interest and am pasting them below.

Disclaimer: My knowledge of Google Wave is second hand through various videos and demonstrations I’ve seen… Also, my answers were written pretty quickly!

As someone who is both behind Ireland’s biggest online community boards.ie and a researcher at DERI on the Semantic Web, are you excited about Google Wave?

Technically, I think it’s an exciting development – commercially, it obviously provides potential for others (Google included) to set up a competing service to us (!), but I think what is good is the way it has been shown that Google Wave can integrate with existing platforms. For example, there’s a nice demo showing how Google Wave plus MediaWiki (the software that powers the Wikipedia) can be used to help editors who are simultaneously editing a wiki page. If it can be done for wikis, it could aid with lots of things relevant to online communities like boards.ie. For example, moderators could see what other moderators are online at the same time, communicate on issues such as troublesome users, posts with questionable content, and then avoid stepping on each other’s toes when dealing with issues.

Does it potential for collaborative research projects? Or is it heavyweight/serious enough?

I think it has some potential when combined with other tools that people are using already. There’s an example from SAP of Google Wave being integrated with a business process modelling application. People always seem to step back to e-mail for doing various research actions. While wikis and the like can be useful tools for quickly drafting research ideas, papers, projects, etc., there is that element of not knowing who is doing stuff at the same time as you. Just as people are using Gtalk to augment Gmail by being able to communicate in contacts in real-time when browsing e-mails, Google Wave could potentially be integrated with other platforms such as collaborative work environments, document sharing systems, etc. It may not be heavyweight enough on its own but at least it can augment what we already use.

Where does Google Wave sit in terms of the development of the Semantic Web?

I think it could be a huge source of data for the Semantic Web. What we find with various social and collaborative platforms is that people are voluntarily creating lots of useful related data about various objects (people, events, hobbies, organisations) and having a more real-time approach to creating content collaboratively will only make that source of data bigger and hopefully more interlinked. I’d hope that data from Google Wave can be made available using technologies such as SIOC from DERI, NUI Galway and the Online Presence Ontology (something we are also working on).

If we are to use Google Wave to pull in feeds from all over the Web will both RSS and widgets become sexy again?

I haven’t seen the example of Wave pulling in feeds, but in theory, what I could imagine is that real-time updating of information from various sources could allow that stream of current information to be updated, commented upon and forwarded to various other Waves in a very dynamic way. We’ve seen how Twitter has already provided some new life for RSS feeds in terms of services like Twitterfeed automatically pushing RSS updates to Twitter, and this results in some significant amounts of rebroadcasting of that content via retweets etc.

Certainly, one of the big things about Wave is its integration of various third-party widgets, and I think once it is fully launched we will see lots of cool applications building on the APIs that they provide. There’s been a few basic demonstrator gadgets shown already like polls, board games and event planning, but it’ll be the third-party ones that make good use of the real-time collaboration that will probably be the most interesting, as there’ll be many more people with ideas compared to some internal developers.

Is Wave the first serious example of a communications platform that will only be as good as the third-party developers that contribute to it?

Not really. I think that title applies to many of the communications platforms we use on the Web. Facebook was a busy service but really took off once the user-contributable applications layer was added. Drupal was obviously the work of a core group of people but again the third-party contributions outweigh those of the few that made it.

We already have e-mail and IM combined in Gmail and Google Docs covers the collaborative element so people might be thinking ‘what is so new, groundbreaking or beneficial about Wave?’ What’s your opinion on this?

Perhaps the real-time editing and updating process. Often times, it’s difficult to go back in a conversation and add to or fix something you’ve said earlier. But it’s not just a matter of rewriting the past – you can also go back and see what people said before they made an update (“rewind the Wave”).

Is Google heading towards unified communications with Wave, and is it possible that it will combine Gmail, Wave and Google Voice in the future?

I guess Wave could be one portion of a UC suite but I think the Wave idea doesn’t encompass all of the parts…

Do you think Google is looking to pull in conversations the way FriendFeed, Facebook and Twitter does? If so, will it succeed?

Yes, certainly Google have had interests in this area with their acquisition of Jaiku some time back (everyone assumed this would lead to a competitor to Twitter; most recently they made the Jaiku engine available as open source). I am not sure if Google intends to make available a single entry point to all public waves that would rival Twitter or Facebook status updates, but if so, it could be a very powerful competitor.

Is it possible that Wave will become as widely used and ubiquitous as Gmail?

It will take some critical mass to get it going, integrating it into Gmail could be a good first step.

And finally – is the game changing in your opinion?

Certainly, we’ve moved from frequently updated blogs (every few hours/days) to more frequently updated microblogs (every few minutes/seconds) to being able to not just update in real-time but go back and easily add to / update what’s been said any time in the past. People want the freshest content, and this is another step towards not just providing content that is fresh now but a way of freshening the content we’ve made in the past.

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: Open government and Linked Data; now it's time to draft…

For the past few months, there have been a variety of calls for feedback and suggestions on how the US Government can move towards becoming more open and transparent, especially in terms of their dealings with citizens and also for disseminating information about their recent financial stimulus package.

As part of this, the National Dialogue forum was set up to solicit solutions for ways of monitoring the “expenditure and use of recovery funds”. Tim Berners-Lee wrote a proposal on how linked open data could provide semantically-rich, linkable and reusable data from Recovery.gov. I also blogged about this recently, detailing some ideas for how discussions by citizens on the various uses of expenditure (represented using SIOC and FOAF) could be linked together with financial grant information (in custom vocabularies).

More recently, the Open Government Initiative solicited ideas for a government that is “more transparent, participatory, and collaborative”, and the brainstorming and discussion phases have just ended. This process is now in its third phase, where the ideas proposed to solve various challenges are to be more formally drafted in a collaborative manner.

What is surprising about this is how few submissions and contributions have been put into this third and final phase (see graph below), especially considering that there is only one week for this to be completed. Some topics have zero submissions, e.g. “Data Transparency via Data.gov: Putting More Data Online”.

20090624b

This doesn’t mean that people aren’t still thinking about this. On Monday, Tim Berners-Lee published a personal draft document entitled “Putting Government Data Online“. But we need more contributions from the Linked Data community to the drafts during phase three of the Open Government Directive if we truly believe that this solution can make a difference.

For those who want to learn more about Linked Data, click on the image below to go to Tim Berners-Lee’s TED talk on Linked Data.

(I watched it again today, and added a little speech bubble to the image below to express my delight at seeing SIOC profiles on the Linked Open Data cloud slide.)

We also have a recently-established Linked Data Research Centre at DERI in NUI Galway.

20090624a

Reblog this post [with Zemanta]

Posted at 17:05

John Breslin: BlogTalk 2009 (6th International Social Software Conference) – Call for Proposals – September 1st and 2nd – Jeju, Korea

20090529a

BlogTalk 2009
The 6th International Conf. on Social Software
September 1st and 2nd, 2009
Jeju Island, Korea

Overview

Following the international success of the last five BlogTalk events, the next BlogTalk – to be held in Jeju Island, Korea on September 1st and 2nd, 2009 – is continuing with its focus on social software, while remaining committed to the diverse cultures, practices and tools of our emerging networked society. The conference (which this year will be co-located with Lift Asia 09) is designed to maintain a sustainable dialog between developers, innovative academics and scholars who study social software and social media, practitioners and administrators in corporate and educational settings, and other general members of the social software and social media communities.

We invite you to submit a proposal for presentation at the BlogTalk 2009 conference. Possible areas include, but are not limited to:

  • Forms and consequences of emerging social software practices
  • Social software in enterprise and educational environments
  • The political impact of social software and social media
  • Applications, prototypes, concepts and standards

Participants and proposal categories

Due to the interdisciplinary nature of the conference, audiences will come from different fields of practice and will have different professional backgrounds. We strongly encourage proposals to bridge these cultural differences and to be understandable for all groups alike. Along those lines, we will offer three different submission categories:

  • Academic
  • Developer
  • Practitioner

For academics, BlogTalk is an ideal conference for presenting and exchanging research work from current and future social software projects at an international level. For developers, the conference is a great opportunity to fly ideas, visions and prototypes in front of a distinguished audience of peers, to discuss, to link-up and to learn (developers may choose to give a practical demonstration rather than a formal presentation if they so wish). For practitioners, this is a venue to discuss use cases for social software and social media, and to report on any results you may have with like-minded individuals.

Submitting your proposals

You must submit a one-page abstract of the work you intend to present for review purposes (not to exceed 600 words). Please upload your submission along with some personal information using the EasyChair conference area for BlogTalk 2009. You will receive a confirmation of the arrival of your submission immediately. The submission deadline is June 27th, 2009.

Following notification of acceptance, you will be invited to submit a short or long paper (four or eight pages respectively) for the conference proceedings. BlogTalk is a peer-reviewed conference.

Timeline and important dates

  • One-page abstract submission deadline: June 27th, 2009
  • Notification of acceptance or rejection: July 13th, 2009
  • Full paper submission deadline: August 27th, 2009

(Due to the tight schedule we expect that there will be no deadline extension. As with previous BlogTalk conferences, we will work hard to endow a fund for supporting travel costs. As soon as we review all of the papers we will be able to announce more details.)

Topics

Application Portability
Bookmarking
Business
Categorisation
Collaboration
Content Sharing
Data Acquisition
Data Mining
Data Portability
Digital Rights
Education
Enterprise
Ethnography
Folksonomies and Tagging
Human Computer Interaction
Identity
Microblogging
Mobile
Multimedia
Podcasting
Politics
Portals
Psychology
Recommender Systems
RSS and Syndication
Search
Semantic Web
Social Media
Social Networks
Social Software
Transparency and Openness
Trend Analysis
Trust and Reputation
Virtual Worlds
Web 2.0
Weblogs
Wikis
Reblog this post [with Zemanta]

Posted at 17:05

Copyright of the postings is owned by the original blog authors. Contact us.