Planet RDF

It's triples all the way down

July 09

Michael Hausenblas: Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Posted at 20:34

June 24

Jeen Broekstra: Making play series on Poker

Any different uncovered paper has to be exchanged. Check-raise is deducted on any game after the draw, and a seven or higher is no longer required to bet. No-Limit and Pot-Limit Lowball All the guidelines for no-limit and pot-limit poker practice to no-limit and pot-limit lowball. All different lowball regulations apply, without as noted. A participant is no longer entitled to be aware of that an adversary can’t have the tremendous feasible hand, so these regulations for uncovered playing tickets before the draw apply: A participant needs to take an uncovered card of A, 2, 3, 4, or 5, In ace-to-five lowball, and any different board has to be replaced. The participant should receive an uncovered card of 2, 3, 4, 5, or 7, In deuce-to-seven lowball, and any different card which includes a six need to be replaced. Following the draw, any uncovered card need to be replaced. Check-raise is allowed After the draw, and a participant can test any hand except penalty. “

There are two making a play series, one earlier than the draw and one after the draw. The sport is performed with an ante and a key. Professionals in flip bottle tell, open for the lowest, or open with a increase. After the opening having a bet round, gamers have the probability of drawing new playing boards to substitute the ones they drop. Action after the draw begins with the opener, or subsequent participant intending clockwise if the opener has folded. The having a bet restriction after the draw is twice the quantity of the making a bet restrict earlier than the draw.

Any excessive form of video plays permit a participant to open with one holding; others lack the opener to have a combination of jacks or more helpful. A most of one wager and four promotions is accepted in multi-handed joints. See Explanations, for more excellent statistics on this rule. Check-raise is accredited each earlier than and after the draw. Any card that is uncovered through the provider before the draw has to be kept. Five playing cards represent a taking part in hand. More scattered than five playing tickets for a participant earlier than motion has done a misdeal. If the proposal has been received, a participant with fewer than five playing cards can also draw the variety of playing cards critical to whole a five-card hand.

The switch can get hold of the fifth card even if the motion has germinated home. More or lowest than five playing boards after the draw creates soiled sheets. A participant can draw up to 4 consecutive cards. If a participant needs to draw five new maps, four are dealt proper away, and the fifth card after everybody else has drawn cards. If the final participant desires to draw five new cards, four are dealt appropriate away, and a card is burned earlier than the participant receives a fifth card. See Explanations, for more great facts on this rule. You can trade the range of playing tickets you desire to draw, provided: No playing tickets have been sold off the deck in response to your application.

The post Making play series on Poker appeared first on Rivull Development.

Posted at 19:09

Jeen Broekstra: Jürgen Klopp says he cried when NHS staff sang You’ll Never Walk Alone

Klopp, chatting with the club’s website, added: “There are numerous people out there that have much bigger problems so it might feel really embarrassing to myself if i used to be to speak about my ‘problems’ – I even have the issues every one within the world has within the moment. That’s the lesson we learn during this moment.

“Four or five weeks ago it’s sort of a lot of nations thought: ‘That’s our problem, that’s our problem, that’s our problem, we’ve a drag with them’ and stuff like this. Now nature shows us we are all an equivalent and that we have all an equivalent problems within the same moment, and that we need to work together on the answer . there’s nothing good therein situation aside from maybe what we will learn from that.”

Liverpool hosted the last high-profile match in England before football was suspended when Atlético Madrid, and approximately 3,000 of their supporters, visited Anfield within the Champions League on 11 March. Klopp recalled: “We played the Bournemouth game on Saturday, we won it, then Sunday City lost, therefore the information for us was ‘two wins to go’.on the other hand on Monday morning I awakened and heard about things in Madrid, that they might close the faculties and universities from Wednesday, so it had been really strange to organize for that game to be honest. agen sbobet online? maxbetsbobet.

“I usually don’t struggle with things around me, I can build barriers right and left once I steel oneself against a game, but therein moment it had been really difficult. Wednesday we had the sport , I loved the sport , I loved what I saw from the boys, it had been a very , specialized performance aside from the result – we didn’t score enough, we conceded too many, that’s all clear, but between these two main pieces of data it had been an excellent game!

“Thursday [we were] off then Friday once we arrived it had been already clear this is often not a session. Yes, we trained, but it had been more of a gathering . We had tons of things to speak about, tons of things to believe , things I never thought before in my life about.

“Nobody knew exactly – and no-one knows exactly – how it’ll continue , therefore the only way we could roll in the hay was to organise it nearly as good as possible for the boys and confirm everything is sorted the maximum amount as we will sort it in our little space, within the little area where we are responsible, really. That’s what we did during a very short time, then we sent the boys home, went home ourselves and here we are still.”

The post Jürgen Klopp says he cried when NHS staff sang You’ll Never Walk Alone appeared first on Rivull Development.

Posted at 19:09

Jeen Broekstra: ‘It’s horrible’: Halesowen halted with promotion and Wembley in sight

Simeon Cobourne scores Halesowen’s winner in the FA Trophy at Barnet.
“It was a touch surreal in certain stages,” said the captain Paul McCone. “As people from a lower division you usually say: ‘This is your final .’ Every game should are our final , but we were watching getting to subsequent stage, subsequent stage and therefore the next stage.”

The FA stated that it “remains hopeful” of completing the ultimate rounds of the FA Trophy but Halesowen are the sole team in either step three or four to succeed in this stage, which suggests that they stand alone on an island with no remaining league games and no-one else to play.

“If the games do get played, we’ve gotta attempt to find some opposition to urge a few of friendlies in before that,” said Smith. “Well, all step three to 6 sides have all been stepped down now. So, I don’t skills we might manage to try to to that.”

Halesowen’s success also reflects the transient nature of non-league football. Smith arrived to a revolving door of a football club that he says utilised nearly 100 players last season. He rebuilt the side, tying the squad players to contracts so as to foster a tight-knit team and enable fans to familiarise themselves with players. Smith convinced most of his players to step down from higher divisions, assuring them that they might go straight copy . within the end, they performed even better than they imagined but the ultimate result was out of their hands. Now Smith must convince them to stay in step four for another season. judi bola online judibolaterbaik.co

“I’m hoping that everyone wants to remain at the club due to that factor,” says McCone. “In theory, we’ve got loose end . Unforeseen circumstances that have caused that loose end , but it’s definitely something to urge ticked off the list. The plan was to travel there, get promoted and obtain Halesowen up the league. no matter what has gone on, that plan hasn’t changed.”

Smith seems to be at the proper place for an additional attempt. The club was near bankruptcy when it had been appropriated by Keith McKenna and Karen Brooks at the top of 2018 and that they have turned it around. The finances are healthy enough to survive these unprecedented stoppages and it seems that that the solidarity is, too.

“There’s no point of watching what ifs and the way the season could have progressed,” said Smith. “It’s so, so important that we take great pride in how far we’ve come … It’s been a magical, magical journey. the maximum amount because it could be wiped off the record books, nobody can take those memories faraway from these fans.”

The post ‘It’s horrible’: Halesowen halted with promotion and Wembley in sight appeared first on Rivull Development.

Posted at 19:09

June 11

Libby Miller: Pi / openCV / Tensorflow again

Cat detector code updated for Raspian Buster. I used lite. A few things have changed since the last time. The code is here.

Download Raspian

I got Raspbian Buster Lite (from https://www.raspberrypi.org/downloads/raspbian/ )

Burn it onto a SD card.

touch /Volumes/boot/ssh
Add the wifi
nano /Volumes/boot/wpa_supplicant.conf

The file should containing something like:

country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
   ssid="foo"
   psk="bar"
}

then eject the card and put it in the pi.

ssh into it from your laptop

pi@raspberrypi.local
password: raspberry
sudo nano /etc/hosts
sudo nano /etc/hostname

Reboot

sudo reboot

Set up a virtualenv for python

This is not strictly necessary but keeps things tidy. You can also just use the built in python, just make sure you are using python3 and pip3 if so.

ssh into the pi again, then:

sudo apt update
sudo apt-get install python3-pip
sudo pip3 install virtualenv
virtualenv env
source env/bin/activate
(env) pi@birdbot:~ $ python --version
Python 3.7.3 # or similar

Enable the camera

sudo raspi-config # and enable camera under 'interfacing'; reboot

Install Tensorflow

Increase the swap size:

sudo nano /etc/dphys-swapfile

The default value in Raspbian is:

CONF_SWAPSIZE=100

We will need to change this to:

CONF_SWAPSIZE=1024

Restart the service that manages the swapfile own Raspbian:

sudo /etc/init.d/dphys-swapfile restart

Install tensorflow dependencies

sudo apt-get install libatlas-base-dev
sudo apt-get install git
pip install --upgrade tensorflow

(this takes a few minutes)

Test that tensorflow installed ok:

python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

You may see an error about hadoop –

HadoopFileSystem load error: libhdfs.so: cannot open shared object file: No such file or directory.

See also tensorflow/tensorflow#36141 and tensorflow/tensorflow#36141. That doesn’t seem to matter.

You could try some user built tensorflow binaries – I tried this one, which seemed to corrupt my SD card, but not tried this one. Tensorflow 2 would be better to learn (the apis all changed between 1.4 and 2).

Install OpenCV

sudo apt-get install libjasper-dev libqtgui4 libqt4-test libhdf5-dev libharfbuzz0b libilmbase-dev libopenexr-dev libgstreamer1.0-dev libavcodec-dev libavformat-dev libswscale5

pip install opencv-contrib-python==3.4.3.18 #(see this)

test

python -c 'import cv2; print(cv2.__version__)'

Install camera dependencies

pip install imutils picamera

Install speaking dependencies

sudo apt-get install espeak-ng

Finally:

git clone https://github.com/libbymiller/cat_loving_robot
cd cat_loving_robot
python classify_image.py

If you want to add the servos and so on for cat detecting and running towards cats, or start it up automatically, there’s more info in github.

Posted at 16:06

June 09

John Goodwin: On Beyond OWL: challenges for ontologies on the Web by James Hendler

Posted at 12:06

May 30

Ebiquity research group UMBC: paper: Context Sensitive Access Control in Smart Home Environments


Context Sensitive Access Control in Smart Home Environments


Sofia Dutta, Sai Sree Laya Chukkapalli, Madhura Sulgekar, Swathi Krithivasan, Prajit Kumar Das, and Anupam Joshi, Context Sensitive Access Control in Smart Home Environments, 6th IEEE International Conference on Big Data Security on Cloud, May 2020

The rise in popularity of Internet of Things (IoT) devices has opened doors for privacy and security breaches in Cyber-Physical systems like smart homes, smart vehicles, and smart grids that affect our daily existence. IoT systems are also a source of big data that gets shared via the cloud. IoT systems in a smart home environment have sensitive access control issues since they are deployed in a personal space. The collected data can also be of a highly personal nature. Therefore, it is critical to building access control models that govern who, under what circumstances, can access which sensed data or actuate a physical system. Traditional access control mechanisms are not expressive enough to handle such complex access control needs, warranting the incorporation of new methodologies for privacy and security. In this paper, we propose the creation of the PALS system, that builds upon existing work in an attribute-based access control model, captures physical context collected from sensed data (attributes), and performs dynamic reasoning over these attributes and context-driven policies using Semantic Web technologies to execute access control decisions. Reasoning over user context, details of the information collected by the cloud service provider, and device type our mechanism generates as a consequent access control decisions. Our system’s access control decisions are supplemented by another sub-system that detects intrusions into smart home systems based on both network and behavioral data. The combined approach serves to determine indicators that a smart home system is under attack, as well as limit what data breach such attacks can achieve.

The post paper: Context Sensitive Access Control in Smart Home Environments appeared first on UMBC ebiquity.

Posted at 22:05

Ebiquity research group UMBC: paper: Automating GDPR Compliance using Policy Integrated Blockchain

Automating GDPR Compliance using Policy Integrated Blockchain


Abhishek Mahindrakar and Karuna Pande Joshi, Automating GDPR Compliance using Policy Integrated Blockchain, 6th IEEE International Conference on Big Data Security on Cloud, May 2020.

Data protection regulations, like GDPR, mandate security controls to secure personally identifiable information (PII) of the users which they share with service providers. With the volume of shared data reaching exascale proportions, it is challenging to ensure GDPR compliance in real-time. We propose a novel approach that integrates GDPR ontology with blockchain to facilitate real-time automated data compliance. Our framework ensures data operation is allowed only when validated by data privacy policies in compliance with privacy rules in GDPR. When a valid transaction takes place the PII data is automatically stored off-chain in a database. Our system, built using Semantic Web and Ethereum Blockchain, includes an access control system that enforces data privacy policy when data is shared with third parties.

The post paper: Automating GDPR Compliance using Policy Integrated Blockchain appeared first on UMBC ebiquity.

Posted at 16:05

May 29

Libby Miller: Zoom on a Pi 4 (4GB)

It works using chromium not the Zoom app (which only runs on x86, not ARM). I tested it with a two-person, two-video stream call. You need a screen (I happened to have a spare 7″ touchscreen). You also need a keyboard for the initial setup, and a mouse if you don’t have a touchscreen.

The really nice thing is that Video4Linux (bcm2835-v4l2) support has improved so it works with both v1 and v2 raspi cameras, and no need for options bcm2835-v4l2 gst_v4l2src_is_broken=1 🎉🎉

IMG_4695

So:

  • Install Raspian Buster
  • Connect the screen keyboard, mouse, camera and speaker/mic. I used a Sennheiser usb speaker / mic, and a standard 2.1 Raspberry pi camera.
  • Boot up. I had to add lcd_rotate=2 in /boot/config.txt for my screen to rotate it 180 degrees.
  • Don’t forget to enable the camera in raspi-config
  • Enable bcm2835-v4l2 – add it to sudo nano /etc/modules
  • I increased swapsize using sudo nano /etc/dphys-swapfile -> CONF_SWAPSIZE=2000 -> sudo /etc/init.d/dphys-swapfile restart
  • I increased GPU memory using sudo nano /boot/config.txt -> gpu_mem=512

You’ll need to set up Zoom and pass capchas using the keyboard and mouse. Once you have logged into Zoom you can often ssh in and start it remotely like this:

export DISPLAY=:0.0
/usr/bin/chromium-browser --kiosk --disable-infobars --disable-session-crashed-bubble --no-first-run https://zoom.us/wc/XXXXXXXXXX/join/

Note the url format – this is what you get when you click “join from my browser”. If you use the standard Zoom url you’ll need to click this url yourself, ignoring the Open xdg-open prompts.

IMG_4699

You’ll still need to select the audio and start the video, including allowing it in the browser. You might need to select the correct audio and video, but I didn’t need to.

I experimented a bit with an ancient logitech webcam-speaker-mic and the speaker-mic part worked and video started but stalled – which made me think that a better / more recent webcam might just work.

Posted at 17:06

May 27

Libby Miller: Removing rivets

I wanted to stay away from the computer during a week off work so I had a plan to fix up some garden chairs whose wooden slats had gone rotten:

IMG_4610

Looking more closely I realised the slats were riveted on. How do you get rivets off? I asked my hackspace buddies and Barney suggested drilling them out. They have an indentation in the back and you don’t have to drill very far to get them out.

The first chair took me two hours to drill out 15 rivets, and was a frustrating and sweaty experience. I checked YouTube to make sure I wasn’t doing anything stupid and tried a few different drill bits. My last chair today took 15 minutes, so! My amateurish top tips / reminder for me next time:

  1. Find a drill bit the same size as the hole that the rivet’s gone though
  2. Make sure it’s a tough drill bit, and not too pointy. You are trying to pop off the bottom end of the rivet – it comes off like a ring – and not drill a hole into the rivet itself.
  3. Wear eye protection – there’s the potential for little bits of sharp metal to be flying around
  4. Give it some welly – I found it was really fast once I started to put some pressure on the drill
  5. Get the angle right – it seemed to work best when I was drilling exactly vertically down into to the rivet, and not at a slight angle.
  6. Once drilled, you might need to pop them out with a screwdriver or something of the right width plus a hammer

IMG_4616

More about rivets.

Posted at 21:06

May 26

Leigh Dodds: Cooking up a new approach to supporting purposeful use of data

In my last post I explored how we might better support the use of datasets. To do that I applied the BASEDEF framework to outline the ways in which communities might collaborate to help unlock more value from individual datasets.

But what if we changed our focus from supporting discovery and use of datasets and instead focused on helping people explore specific types of problems or questions?

Our paradigm around data discovery is based on helping people find individual datasets. But unless a dataset has been designed to answer the specific question you have in mind, then it’s unlikely to be sufficient. Any non-trivial analysis is likely to need multiple datasets.

We know that data is more useful when it is combined, so why isn’t our approach to discovery based around identifying useful collections of datasets?

A cooking metaphor

To explore this further let’s use a cooking metaphor. I love cooking.

Many cuisines are based on a standard set of elements. Common spices or ingredients that become the base of most dishes. Like a mirepoix, a sofrito, the holy trinity of Cajun cooking, or the mother sauces in French cuisine.

As you learn to cook you come to appreciate how these flavour bases and sauces can be used to create a range of dishes. Add some extra spices and ingredients and you’ve created a complete dish.

Recipes help us consistently recreate these sauces.

A recipe consists of several elements. It will have a set of ingredients and a series of steps to combine them. A good recipe will also include some context. For example some background on the origins of the recipe and descriptions of unusual spices or ingredients. It might provide some things to watch out for during the cooking (“don’t burn the spices”) or suggest substitutions for difficult to source ingredients.

Our current approach to dataset discovery involves trying to document the provenance of an individual ingredient (a dataset) really well. We aren’t helping people combine them together to achieve results.

Efforts to improve dataset metadata, documentation and provenance reporting are important. Projects like the dataset nutrition label are great examples of that. We all want to be ethical, sustainable cooks. To do that we need to make informed choices about our ingredients.

But, to whisk these food metaphors together, nutrition labels are there to help you understand what’s gone into your supermarket pasta sauce. It’s not giving you a recipe to cook it from scratch for yourself. Or an idea of how to use the sauce to make a tasty dish.

Recipes for data-informed problem solving

I think we should think about sharing dataset recipes: instructions for how to mix up a selection of dataset ingredients. What would they consist of?

Firstly, the recipe would need to based around a specific type of question, problem or challenge.  Examples might include:

  • How can I understand air quality in my city?
  • How is deprivation changing in my local area?
  • What are the impacts of COVID-19 in my local authority?

Secondly, a recipe would include a list of datasets that have to be sourced, prepared and combined together to explore the specific problem. For example, if you’re exploring impacts of COVID-19 in your local authority you’re probably going to need:

  • demographic data from the most recent census
  • spatial boundaries to help visualise and present results
  • information about deprivation to help identify vulnerable people

Those three datasets are probably the holy trinity of any local spatial analysis?

Finally, you’re going to need some instructions for how to combine the datasets together. The instructions might identify some tools you need (Excel or QGIS), reference some techniques (Reprojection) and maybe some hints about how to substitute for key ingredients if you can’t get them in your local area (FOI).

The recipe might ways to vary the recipe for different purposes: add a sprinkle of Companies House data to understand your local business community, and a dash of OpenStreetMap to identify greenspaces?

As a time saver maybe you can find some pre-made versions of some of the steps in the recipe?

Examples in the wild

OK, its easy to come up with a metaphor and an idea. But would this actually meet a need? There’s a few reasons why I’m reasonably confident that dataset recipes could be helpful. Mostly because I can see this same approach re-appearing in some related contexts. For example:

If you have examples then let me know in the comments or on twitter.

How can dataset recipes help?

I think there’s a whole range of ways in which these types of recipe can be useful.

Data analysis always starts by posing a question. By documenting how datasets can be applied specific questions will make them easier to find on search engines. It just fits better with what people want to do.

Data discovery is important during periods where there is a sudden influx of new potential users. For example, where datasets have just been published under an open licence and are now available to more people, for a wider range of purposes.

In my experience data analysts and scientists who understand a domain, e.g population or transport modelling, have built up an tacit understanding of what datasets are most useful in different contexts. They understand the limitations and the process of combining datasets together. This thread from Chris Gale with a recipe about doing spatial analysis using PHE’s COVID-19 data is a perfect example. Documenting and sharing this knowledge can help others to do similar analyses. It’s like a cooking masterclass.

Discovery is also difficult when there is a sudden influx of new data available. Such as during this pandemic. Writing recipes is a good way to share learning across a community.

Documenting useful recipes might help us scale innovation across local areas.

Lastly, we’re still trying to understand which datasets are a most important part of our local, national and international data infrastructure. We’re currently lacking any real quantitative information about how datasets are combined together. In the same way that recipes can be analysed to create ingredient networks, dataset recipes could be analysed to find out how datasets are being used together. We can then strengthen that infrastructure.

If you’ve built something that helps people publish dataset recipes then send me a link to your app. I’d like to try it.

Posted at 19:05

May 24

Ebiquity research group UMBC: Why does Google think Raymond Chandler starred in Double Indemnity?

In my knowledge graph class yesterday we talked about the SPARQL query language and I illustrated it with DBpedia queries, including an example getting data about the movie Double Indemnity. I had brought a google assistant device and used it to compare its answers to those from DBpedia. When I asked the Google assistant “Who starred in the film Double Indemnity”, the first person it mentioned was Raymond Chandler. I knew this was wrong, since he was one of its screenwriters, not an actor, and shared an Academy Award for the screenplay. DBpedia’s data was correct and did not list Chandler as one of the actors.

I did not feel too bad about this — we shouldn’t expect perfect accuracy in these huge, general purpose knowledge graphs and at least Chandler played an important role in making the film.

After class I looked at the Wikidata page for Double Indemnity (Q478209) and saw that it did list Chandler as an actor. I take this as evidence that Google’s knowledge Graph got this incorrect fact from Wikidata, or perhaps from a precursor, Freebase.

The good news 🙂 is that Wikidata had flagged the fact that Chandler (Q180377) was a cast member in Double Indemnity with a “potential Issue“. Clicking on this revealed that the issue was that Chandler was not known to have an occupation property that a “cast member” property (P161) expects, which includes twelve types, such as actor, opera singer, comedian, and ballet dancer. Wikidata lists chandler’s occupations as screenwriter, novelist, write and poet.

More good news 😀 is that the Wikidata fact had provenance information in the form of a reference stating that it came from CSFD (Q3561957), a “Czech and Slovak web project providing a movie database”. Following the link Wikidata provided led me eventually to the resource, which allowed my to search for and find its Double Indemnity entry. Indeed, it lists Raymond Chandler as one of the movie’s Hrají. All that was left to do was to ask for a translation, which confirmed that Hrají means “starring”.

Case closed? Well, not quite. What remains is fixing the problem.

The final good news 🙂 is that it’s easy to edit or delete an incorrect fact in Wikidata. I plan to delete the incorrect fact in class next Monday. I’ll look into possible options to add an annotation in some way to ignore the incorrect ?SFD source for Chander being a cast member over the weekend.

Some possible bad news 🙁 that public knowledge graphs like Wikidata might be exploited by unscrupulous groups or individuals in the future to promote false or biased information. Wikipedia is reasonably resilient to this, but the problem may be harder to manage for public knowledge graphs, which get much their data from other sources that could be manipulated.

The post Why does Google think Raymond Chandler starred in Double Indemnity? appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

The CCS Dashboard’s sections provide information on sources and targets of network events, file operations monitored and sub-events that are part
of the APT kill chain. An alert is generated when a likely complete APT is detected after reasoning over events.

Early Detection of Cybersecurity Threats Using Collaborative Cognition

Sandeep Narayanan, Ashwinkumar Ganesan, Karuna Joshi, Tim Oates, Anupam Joshi and Tim Finin, Early detection of Cybersecurity Threats using Collaborative Cognition, 4th IEEE International Conference on Collaboration and Internet Computing, Philadelphia, October. 2018.

 

The early detection of cybersecurity events such as attacks is challenging given the constantly evolving threat landscape. Even with advanced monitoring, sophisticated attackers can spend more than 100 days in a system before being detected. This paper describes a novel, collaborative framework that assists a security analyst by exploiting the power of semantically rich knowledge representation and reasoning integrated with different machine learning techniques. Our Cognitive Cybersecurity System ingests information from various textual sources and stores them in a common knowledge graph using terms from an extended version of the Unified Cybersecurity Ontology. The system then reasons over the knowledge graph that combines a variety of collaborative agents representing host and network-based sensors to derive improved actionable intelligence for security administrators, decreasing their cognitive load and increasing their confidence in the result. We describe a proof of concept framework for our approach and demonstrate its capabilities by testing it against a custom-built ransomware similar to WannaCry.

The post paper: Early Detection of Cybersecurity Threats Using Collaborative Cognition appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: talk: Design and Implementation of an Attribute Based Access Controller using OpenStack Services

Design and Implementation of an Attribute Based Access Controller using OpenStack Services

Sharad Dixit, Graduate Student, UMBC
10:30am Monday, 24 September 2018, ITE346

With the advent of cloud computing, industries began a paradigm shift from the traditional way of computing towards cloud computing as it fulfilled organizations present requirements such as on-demand resource allocation, lower capital expenditure, scalability and flexibility but with that it brought a variety of security and user data breach issues. To solve the issues of user data and security breach, organizations have started to implement hybrid cloud where underlying cloud infrastructure is set by the organization and is accessible from anywhere around the world because of the distinguishable security edges provided by it. However, most of the cloud platforms provide a Role Based Access Controller which does not adequate for complex organizational structures. A novel mechanism is proposed using OpenStack services and semantic web technologies to develop a module which evaluates user’s and project’s multi-varied attributes and run them against access policy rules defined by an organization before granting the access to the user. Henceforth, an organization can deploy our module to obtain a robust and trustworthy access control based on multiple attributes of a user and the project the user has requested in a hybrid cloud platform like OpenStack.

The post talk: Design and Implementation of an Attribute Based Access Controller using OpenStack Services appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: paper: Attribute Based Encryption for Secure Access to Cloud Based EHR Systems

Attribute Based Encryption for Secure Access to Cloud Based EHR Systems

Attribute Based Encryption for Secure Access to Cloud Based EHR Systems

Maithilee Joshi, Karuna Joshi and Tim Finin, Attribute Based Encryption for Secure Access to Cloud Based EHR Systems, IEEE International Conference on Cloud Computing, San Francisco CA, July 2018

 

Medical organizations find it challenging to adopt cloud-based electronic medical records services, due to the risk of data breaches and the resulting compromise of patient data. Existing authorization models follow a patient centric approach for EHR management where the responsibility of authorizing data access is handled at the patients’ end. This however creates a significant overhead for the patient who has to authorize every access of their health record. This is not practical given the multiple personnel involved in providing care and that at times the patient may not be in a state to provide this authorization. Hence there is a need of developing a proper authorization delegation mechanism for safe, secure and easy cloud-based EHR management. We have developed a novel, centralized, attribute based authorization mechanism that uses Attribute Based Encryption (ABE) and allows for delegated secure access of patient records. This mechanism transfers the service management overhead from the patient to the medical organization and allows easy delegation of cloud-based EHR’s access authority to the medical providers. In this paper, we describe this novel ABE approach as well as the prototype system that we have created to illustrate it.

The post paper: Attribute Based Encryption for Secure Access to Cloud Based EHR Systems appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: Videos of ISWC 2017 talks

Videos of almost all of the talks from the 16th International Semantic Web Conference (ISWC) held in Vienna in 2017 are online at videolectures.net. They include 89 research presentations, two keynote talks, the one-minute madness event and the opening and closing ceremonies.

The post Videos of ISWC 2017 talks appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: paper: Automated Knowledge Extraction from the Federal Acquisition Regulations System

Automated Knowledge Extraction from the Federal Acquisition Regulations System (FARS)

Srishty Saha and Karuna Pande Joshi, Automated Knowledge Extraction from the Federal Acquisition Regulations System (FARS), 2nd International Workshop on Enterprise Big Data Semantic and Analytics Modeling, IEEE Big Data Conference, December 2017.

With increasing regulation of Big Data, it is becoming essential for organizations to ensure compliance with various data protection standards. The Federal Acquisition Regulations System (FARS) within the Code of Federal Regulations (CFR) includes facts and rules for individuals and organizations seeking to do business with the US Federal government. Parsing and gathering knowledge from such lengthy regulation documents is currently done manually and is time and human intensive.Hence, developing a cognitive assistant for automated analysis of such legal documents has become a necessity. We have developed semantically rich approach to automate the analysis of legal documents and have implemented a system to capture various facts and rules contributing towards building an ef?cient legal knowledge base that contains details of the relationships between various legal elements, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules. In this paper, we describe our framework along with the results of automating knowledge extraction from the FARS document (Title48, CFR). Our approach can be used by Big Data Users to automate knowledge extraction from Large Legal documents.

The post paper: Automated Knowledge Extraction from the Federal Acquisition Regulations System appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: W3C Recommendation: Time Ontology in OWL

W3C Recommendation: Time Ontology in OWL

The Spatial Data on the Web Working Group has published a W3C Recommendation of the Time Ontology in OWL specification. The ontology provides a vocabulary for expressing facts about  relations among instants and intervals, together with information about durations, and about temporal position including date-time information. Time positions and durations may be expressed using either the conventional Gregorian calendar and clock, or using another temporal reference system such as Unix-time, geologic time, or different calendars.

The post W3C Recommendation: Time Ontology in OWL appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: Agniva Banerjee on Managing Privacy Policies through Blockchain

Link before you Share: Managing Privacy Policies through Blockchain

Agniva Banerjee

11:00am Monday, 16 October 2017

An automated access-control and audit mechanism that enforces users’ data privacy policies when sharing their data across third parties, by utilizing privacy policy ontology instances with the properties of blockchain.

The post Agniva Banerjee on Managing Privacy Policies through Blockchain appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: talk: Automated Knowledge Extraction from the Federal Acquisition Regulations System

In this week’s meeting, Srishty Saha, Michael Aebig and Jiayong Lin will talk about their work on extracting knowledge from the US FAR System.

Automated Knowledge Extraction from the Federal Acquisition Regulations System

Srishty Saha, Michael Aebig and Jiayong Lin

11am-12pm Monday, 25 September 2017, ITE346, UMBC

The Federal Acquisition Regulations System (FARS) within the Code of Federal Regulations (CFR) includes facts and rules for individuals and organizations seeking to do business with the US Federal government. Parsing and extracting knowledge from such lengthy regulation documents is currently done manually and is time and human intensive. Hence, developing a cognitive assistant for automated analysis of such legal documents has become a necessity. We are developing a semantically rich legal knowledge base representing legal entities and their relationships, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules.

The post talk: Automated Knowledge Extraction from the Federal Acquisition Regulations System appeared first on UMBC ebiquity.

Posted at 17:05

Ebiquity research group UMBC: 2018 Ontology Summit: Ontologies in Context

2018 Ontology Summit: Ontologies in Context

The OntologySummit is an annual series of online and in-person events that involves the ontology community and communities related to each year’s topic. The topic chosen for the 2018 Ontology Summit will be Ontologies in Context, which the summit describes as follows.

“In general, a context is defined to be the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed. Some examples of synonyms include circumstances, conditions, factors, state of affairs, situation, background, scene, setting, and frame of reference. There are many meanings of “context” in general, and also for ontologies in particular. The summit this year will survey these meanings and identify the research problems that must be solved so that contexts can succeed in achieving the full understanding and assessment of an ontology.”

Each year’s Summit comprises of a series of both online and face-to-face events that span about three months. These include a vigorous three-month online discourse on the theme, and online panel discussions, research activities which will culminate in a two-day face-to-face workshop and symposium.

Over the next two months, there will be a sequence of weekly online meetings to discuss, plan and develop the 2018 topic. The summit itself will start in January with weekly online sessions of invited speakers. Visit the the 2018 Ontology Summit site for more information and to see how you can participate in the planning sessions.

The post 2018 Ontology Summit: Ontologies in Context appeared first on UMBC ebiquity.

Posted at 17:05

May 23

Leigh Dodds: How can you help support the use of a dataset?

Getting the most value from data, whilst minimising its harmful impacts, is a community activity. Datasets need to be governed and published well. Most of that responsibility falls on the data publisher. Because the choices they make shapes data ecosystems.

But other people have a role to play too. Being a good data user means engaging with that process.

Helping others to find data and find the value in it, feels particularly important at the moment. During the pandemic there are many new datasets becoming available. And there are lots of questions to be answered. Some of them can be answered through better use of data.

So, how can communities work together to support use of data?

There are a lot of different ways to explore that question. But there’s a framework called BASEDEF, created by the open source community, which I find helpful.

BASEDEF stands for Blog, Apply, Suggest, Extend, Document, Evangelize and Fix. It describes the different types of contributions that can support an open source project. It can also be applied to help organise a small team in doing that work. Here’s a handy cheat sheet.

But the framework can also be applied to the task of supporting the use of an openly licensed dataset. Let’s run through the framework with that in mind.


Blog

You can write about a dataset to help others to discover it. You can help explain the potential value of applying the dataset to specific problems. Or perhaps you can see some downsides that others should consider.

Writing about how a dataset has been useful to you, by describing how you’ve successfully applied it in a project, will also help others see its potential value.

Apply

You can show how a dataset can be used, by creating something with it. You might do a detailed analysis of the data, but some simpler contributions can also be helpful.

For example you might create a simple visualisation. Or write and publish some code that illustrates how the dataset can be accessed and used. You could publish a quick demo showing how the dataset can be imported and used in some frequently used tools and platforms.

At the moment everyone is a bit tired of charts and graphs. And I agree with the first principle in the visualisation design principles for the pandemic. But a helpful visualisation can do a range of things. Visualisation can be exploratory rather than explanatory.

A visualisation could support other people in understanding the shape of a dataset, to inform their analysis and interpretation of it. It can help identify outliers, gaps, or highlight some of the richness in the data. I’d recommend making it clear when you’re doing it type of visualisation, rather than trying to derive specific insights.

Suggest

Read the documentation. Download and explore the dataset. Ask questions. Give feedback.

Make suggestions to the publisher about changes they could make to publish the data better. Rather than just offer academic critique, be clear about how suggested changes will support your needs or that of your community.

Extend

The freedoms granted by an open licence allow you to enrich and improve a dataset.

Sometimes the smallest changes can have the most impact. Convert the data into other common or standard formats. Extracting data from spreadsheets into CSV files. Convert data published in more complex formats or via APIs into simpler tabular data to make it more accessible to analysts rather than programmers.

Or maybe you can enrich a dataset by adding identifiers that will allow it to be linked to other sources. Do the work of merging with other datasets to bring in more context.

The downside here is that if the original data changes your extended version will get out of date. If you can’t commit to keeping your version up to date, then be sure to share your code and document your methods.

Allow others to repeat the steps you’ve taken. And don’t forget to suggest the improvements to the publisher.

Document

Write additional documentation to fill in gaps where the publisher has not provided sufficient background or explanation. Explain technical concepts or academic terms to a non-specialist audience.

As a user of the data, you’re able to write that documentation from a perspective that reflects the needs and questions of your specific community and the kinds of questions you need to ask. The original publisher might not have all that context or understand those needs, so this work can be really helpful.

Good documentation can be a finding aid. There are structured ways that you can go about writing documentation, such as this tool for writing civic data guides. (Check out some of the examples).

Evangelise

Email people that might have a need for the data. Tweet about it to a wider community. Highlight it in a presentation. Talk about it over coffee Zoom.

Fix

If the dataset is collaboratively maintained then go ahead and fix errors and omissions. If you’re not confident about making a fix, then submit an error report. In addition to fixing errors you might be able to help verify that data is correct.

If a dataset isn’t collaboratively maintained then, when you find errors, be sure to flag them to the publisher and highlight the issue for others. Or consider publishing an enriched version with fixes applied.


This framework isn’t perfect. The name is a bit clunky for a start. But there’s a couple of things that I like about it.

Firstly, it recognises that not all contributions need to be technical. There’s room for others to use different skills and in different ways.

Secondly, the elements overlap and reinforce one another. Writing documentation and blogging about how you’ve used a dataset helps to evangelise it. Enriching a dataset can help demonstrate in a practical way how a publisher can improve how data is published.

Finally, it serves to highlight some important aspects of community curation which aren’t always well supported in existing data platforms and portals. We can do better here.

If you’re interested in working on adapting this further then happy to chat!. It might be useful to have a cheat sheet that supports its application to data and more examples of how to do these different elements well.

Posted at 08:05

May 22

Libby Miller: Cat detector with Tensorflow on a Raspberry Pi 3B+

Like this

Edit: code is now here. These are more recent instructions.

Download Raspian Stretch with Desktop    

Burn a card with Etcher.

(Assuming a Mac) Enable ssh

touch /Volumes/boot/ssh

Put a wifi password in

nano /Volumes/boot/wpa_supplicant.conf
country=GB
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
  ssid="foo"
  psk="bar"
}

Connect the Pi camera, attach a dial to GPIO pin 12 and ground, boot up the Pi, ssh in, then

sudo apt-get update
sudo apt-get upgrade
sudo raspi-config # and enable camera; reboot

install tensorflow

sudo apt install python3-dev python3-pip
sudo apt install libatlas-base-dev
pip3 install --user --upgrade tensorflow

Test it

python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

get imagenet

git clone https://github.com/tensorflow/models.git
cd ~/models/tutorials/image/imagenet
python3 classify_image.py

install openCV

pip3 install opencv-python
sudo apt-get install libjasper-dev
sudo apt-get install libqtgui4
sudo apt install libqt4-test
python3 -c 'import cv2; print(cv2.__version__)'

install the pieces for talking to the camera

cd ~/models/tutorials/image/imagenet
pip3 install imutils picamera
mkdir results

download edited version classify_image

curl -O https://gist.githubusercontent.com/libbymiller/d542d596566774a35752d134f80b1332/raw/471f066e4dc498501bab7731a07fa0c1926c1575/classify_image_dial.py

Run it, and point at a cat

python3 classify_image_dial.py

Posted at 08:06

April 24

Sandro Hawke: Elevator Pitch for the Semantic Web

SemanticWeb.com invited people to make video elevator pitches for the Semantic Web, focused on the question “What is the Semantic Web?”. I decided to give it a go.

I’d love to hear comments from folks who share my motivation, trying to solve this ‘every app is a walled garden’ problem.

In case you’re curious, here’s the script I’d written down, which turned out to be wayyyy to long for the elevators in my building, and also too long for me to remember.

Eric Franzon of SemanticWeb.Com invited people to send in an elevator pitch for the Semantic Web. Here’s mine, aimed at a non-technical audience. I’m Sandro Hawke, and I work for W3C at MIT, but this is entirely my own view.

The problem I’m trying to solve comes from the fact that if you want to do something online with other people, your software has to be compatible with theirs. In practice this usually means you all have to use the same software, and that’s a problem. If you want to share photos with a group, and you use facebook, they all have to use facebook. If you use flickr, they all have to use flickr.

It’s like this for nearly every kind of software out there.

The exceptions show what’s possible if we solve this problem. In a few cases, through years of hard work, people have been able to create standards which allow compatible software to be built. We see this with email and we see this with the web. Because of this, email and the Web are everywhere. They permeate our lives and now it’s hard to imagine modern life without them.

In other areas, though, we’re stuck, because we don’t have these standards, and we’re not likely to get them any time soon. So if you want to create, explore, play a game, or generally collaborate with a group of people on line, every person in the group has to use the same software you do. That’s a pain, and it seriously limits how much we can use these systems.

I see the answer in the Semantic Web. I believe the Semantic Web will provide the infrastructure to solve this problem. It’s not ready yet, but when it is, programs will be able to use the Semantic Web to automatically merge data with other programs, making them all — automatically — compatible.

If I were up to doing another take, I’d change the line about the Semantic Web not being much yet. And maybe add a little more detail about how I see it working. I suppose I’d go for this script:

Okay, elevator pitch for the Semantic Web.

What is the Semantic Web?

Well, right now, it’s a set of technologies that are seeing some adoption and can be useful in their own right, but what I want it to become is the way everyone shares their data, the way all software works together.

This is important because every program we use locks us into its own little silo, its own walled garden

For example, imagine I want to share photos with you. If I use facebook, you have to use facebook. If I use flickr, you have to use flicker. And if I want to share with a group, they all have to use the same system

That’s a problem, and I think it’s one the Semantic Web can solve with a mixture of standards, downloadable data mappings, and existing Web technologies.

I’m Sandro Hawke, and I work for W3C at MIT. This has been entirely my own opinion.

(If only I could change the video as easily as that text. Alas, that’s part of the magic of movies.)

So, back to the subject at hand. Who is with me on this?

Posted at 17:06

Michael Hausenblas: Turning tabular data into entities

Two widely used data formats on the Web are CSV and JSON. In order to enable fine-grained access in an hypermedia-oriented fashion I’ve started to work on Tride, a mapping language that takes one or more CSV files as inputs and produces a set of (connected) JSON documents.

In the 2 min demo video I use two CSV files (people.csv and group.csv) as well as a mapping file (group-map.json) to produce a set of interconnected JSON documents.

So, the following mapping file:

{
 "input" : [
  { "name" : "people", "src" : "people.csv" },
  { "name" : "group", "src" : "group.csv" }
 ],
 "map" : {
  "people" : {
   "base" : "http://localhost:8000/people/",
   "output" : "../out/people/",
   "with" : { 
    "fname" : "people.first-name", 
    "lname" : "people.last-name",
    "member" : "link:people.group-id to:group.ID"
   }
  },
  "group" : {
   "base" : "http://localhost:8000/group/",
    "output" : "../out/group/",
    "with" : {
     "title" : "group.title",
     "homepage" : "group.homepage",
     "members" : "where:people.group-id=group.ID link:group.ID to:people.ID"
    }
   }
 }
}

… produces JSON documents representing groups. One concrete example output is shown below:

Posted at 17:06

John Goodwin: Using Machine Learning to write the Queen’s Christmas Message

In their excellent book “The Indisputable Existence of Santa Claus” Hannah Fry (aka @FryRsquared) and Thomas Oléron Evans (aka @Mathistopheles) talked about using Markov Chains to generate the Queen’s Christmas message. You can read a bit about that here. After reading that chapter I asked Hannah and Thomas if they had considered repeating this using recurrant neural networks. A couple of years ago Andrej Karpathy wrote a blog that he summarised as follows:

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

In his blog he posed the question:

It looks like we can learn to spell English words. But how about if there is more structure and style in the data?

and went on to train an rNN to write text in the style of  Shakespeare. I recommend you read his blog to see the examples and get an idea of what is going on.

You might also want to watch this video from the excellent Computerphile on a similar subject.

In a previous blog I decide to have a go at using rNNs to generate lyrics for my favourite band New Model Army. Thomas was kind enough to send me the text for the Queen’s previous speeches that he and Hannah had used in their work, and I thought I’d see how good rNNs were at generating a Royal Christmas message.  I didn’t do anything particular fancy here…just followed the code and examples in Karpathy’s original blog on a fairly standard home desktop machine.

Here are some of the results… (I take no responsibility for any text generated by the neural network :)).

One parameter that can be adjusted to generate the output is ‘temperature’. This takes a value between 0 and 1. Lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes. Running with a temperaure of 0.5 we get:

i want to send a special message of the human spirit are much to be for everyone this is a happy and part of the christmas story so in the same streed. we are all truly new zealand in an example of the commonwealth has an equal place in the lives of men and women and what they have to show that they are all life in their own peoples of the commonwealth which we have come to the sacrifice of all the fortunes of hope and there are many heart of our grandchildren the presents the prince of wales and will have a preacher group of the commonwealth. on a strong and spreading the responsibility to the commonwealth who are much on the courage of the modern world and the commonwealth games and there are many others at home and the rest of the world have been inspired by the courage and enjoyment and the present peace of a commonwealth which is how to promote the games and from all walks of life and to become a growing peace on earth which we should me to see the strength to try to do so in the world today. it was a great conflict remain inspired by the courage and prince philip and i were able to see buckingham palace which the most special year for the commonwealth is to be movement and in a different furmestand and they are all source of spirit all over the world have shared a moment in the year and the christmas message is to be consed rather a quicter and the contribution of the family of nations and i have meen and wonderful in both these ordarisasions or a common sense and encouragement. this year i am speaking to you from the same family make the world in the world. the christmas message is so many of them and i have difference to the best of ourselves. and this year there are something to look at the world they are may be a life of all the world is the commonwealth is a sense of belonging to the world i was well as they are all sort of moment the more depression

I’m not sure this makes a lot of sense, and there are a few oddities/errors in the text.

I tried lowering the temperature to 0.3:

i hope that the christmas message is to be there and it is not only a very happy christmas to you all. i am speaking to you from the commonwealth and i have been able to see that the commonwealth and i have been deeply touched this year i am speaking to you from the same family friends and communities as the future of the commonwealth and as i have always given to many of you will have heard of the more conscious of the commonwealth and around us we should be able to see that the commonwealth and i have been deeply touched the strength to try to do the same family gathering in the commonwealth and i have all been about the future as well as a great comfort to us all the problems of the commonwealth and i have seen in their own way the problems of the commonwealth and i have been deeply touched the world the progress of the commonwealth and around us we are common throughout the commonwealth who are struck by the state of the commonwealth and i have depended to see in the world today that we should remember those who have seen in their own way a celebration of the child who was born at christmas time for families and friends will never be very different problems but it is not only a time for reflection and confidence to the commonwealth and i have all been about the world that we should not be our lives and to remind us of the future. i am speaking to you from the commonwealth and i have been deeply touched the world that we can all try to make a splendid birthday and the commonwealth and i have been floods and sadness and the best of the world have been able to discuss the best of ourselves. i believe that this christmas day i want to send a special message of hope in the face of hardship is nothing new that of the commonwealth who have seen in their lives in the commonwealth and in the commonwealth and i have been able to discuss the best of ourselves. we are all live together as a great daily and its war the commonwealth

As suggested the the result here is a bit more predictable/boring.

It is also possible to prime the model with some starting text using. This starts out the RNN with some hardcoded characters to warm it up with some context before it starts generating text.

Using a temperature of 0.5 and prime text of ‘Each Christmas’ we get:

each christmas is a time for reflection and grandchildren the more depends of a dedication of the commonwealth and around us of the future of the commonwealth games in the rich and proud of life in a great religions and its members. but i am also speaking by instince in the right through the members of my family and i have been great and personal responsibility to our lives and they say to me that there are many happy or so many countries of the future with the prince of peace and happiness and self-respect for the commonwealth and is a happy and prosperous future of the world we will be achieved with the opportunity to help them are a contribution to the powerful ways of spirit and learning to see the problems of the commonwealth and i have seen in their inestivation the peoples of the commonwealth and i have been for the better but that is possible to the people and the christians it’s all those who are great practical form we have all been for them to be a chance to get the gift of a carren of different lives the service of their people throughout the world. we are all live in the rolin to our children and grandchildren the present generation to see this country and arrived the same for one another which is most popularity and the rest of the world and all the great commonwealth. pronomed news the responsibilities for the duty of events which we have all been the holidic of science. it is they all who are so there and shared heritage and sometimes in saint paul’s and much considerate and to communication the human spirit. and they can be a contralle commonwealth and a firm landmark in the course of the many servicemen and women who are broken for many that can be a precious given us to be a witness this continuing that we can all be the contribution of the commonwealth and as we all have all the features of the love of hope and goodwill. this year has constantly will be overcome. i believe that this year i am speaking to you from the hope of a determination and continues to all

Using a temperature of 0.5 and prime text of ‘This  past year’ we get:

this past year has been one to see that what i have been for the commonwealth and around us we are thinking of the problems show in the face of the fact that he wanted the strongly that we should show many of them service is the birth of the commonwealth. it is a contribution of a lives of family and friends for parents and religious difficulties at the commonwealth and as we all share the future as the future of the future. it was a moment and that the christmas message of the things that have been god shown in the earliest care and for the commonwealth can give the united kingdom and across the commonwealth. it is a helping one of the march of people who are so easy to go but the rest of the world have shaped for their determination and courage of what is right the life of life is word if we can do the state of the same time last month i was welcomed as you and that the opportunity to honour the most important of the thread which have provided a strong and family. even the commonwealth is a common bond that the old games and in the courage which the generations of the commonwealth and i have those to succeed without problems which have their course all that the world has been complete strangers which could have our response. i believe that the triditings of reconciliation but there is nothing in and happy or acts of the commonwealth and around us by science the right of all the members of the world have been difficult and the benefits of dreads and happiness and service to the commonwealth of the future. i wish you all a very happy christmas to you all.

So there you have it. Not sure we’ll be replacing the Queen with an AI anytime soon. Merry Christmas!

Posted at 17:06

April 17

Leigh Dodds: Why is change discovery important for open data?

Change discovery is the process of identifying changes to a resource. For example, that a document has been updated. Or, in the case of a dataset, whether some part of the data has been amended, e.g. to add data, fill in missing values, or correct existing data. If we can identify that changes have been made to a dataset, then we can update our locally cached copies, re-run analyses or generate new, enriched versions of the original.

Any developer who is building more than a disposable prototype will be looking for information about the ongoing stability and change frequency of a dataset. Typical questions might be:

  • How often will a dataset get routinely updated and republished?
  • What types of data updates are anticipated? E.g. are only new records added, or might data be amended and removed?
  • How will the dataset, or parts of it be version controlled?
  • How will changes to the dataset, or part of it (e.g. individual rows or objects) in the dataset be flagged?
  • How will planned and unplanned updates and changes be communicated to users of the dataset?
  • How will data updates be published, e.g. will there be a means of monitoring for or accepting incremental updates, or just refreshed data downloads?
  • Are large scale changes to the data model expected, and if so over what timescale?
  • Are changes to the technical infrastructure planned, and if so over what timescale?
  • How will planned (and unplanned) service downtime, e.g. for upgrades, be notified and reported?

These questions span a range of levels: from changes to individual elements of a dataset, through to the system by which it is delivered. These changes will happen at different frequencies and will be communicated in different ways.

Some times of change discovery can be done after the fact, e.g. by comparing two versions of a dataset. But in practice this is an inefficient way to synchronize and share data, as the consumer needs to reconstruct a series of edits and changes that have already been applied by the publisher of the data. To efficiently publish and distribute data we need to be able to understand when changes have happened.

Some times of changes, e.g. to data models and formats, will just break downstream systems if not properly advertised in advance. So it’s even more important to consider the impacts of these types of change.

A robust data infrastructure will include an appropriate change notification system for different levels of the system. Some of these will be automated. Some will be part of the process of supporting end users. For example:

  • changes to a row in a dataset might be flagged with a timestamp and a change notice
  • API responses might indicate the version of the object being retrieved
  • dataset metadata might include an indication of the planned frequency of publication and a timestamp for when the dataset was last modified
  • a data portal might include a calendar indicating when key datasets will be updated or a feed of recently updated or changed datasets
  • changes to the data model and the API used to deliver a dataset might be announced and discussed via a developer support forum

These might be implemented as technical features of the platform. But they might also be as simple as an email to users, or a public tweet.

Versioning of data can also help data publishers improve the scalability of their infrastructure and reduce the costs of data publishing. For example, adding features to data portals that might let data users:

  • make API calls that will only return responses if data has been updated since the user last requested it, e.g. using HTTP Conditional GET. This can reduce bandwidth and load on the publisher by encouraging local caching of data
  • use a checksum and/or timestamps to detect whether bulk downloads have changed to reduce bandwidth
  • subscribe to machine-readable feeds of dataset level changes, to avoid the need for users to repeatedly re-downloading large datasets
  • subscribe to machine-readable feeds of new datasets, to facilitate mirroring of data across systems

Supporting change notification and discovery, even if its just through documentation rather than more automated means, is an important part of engineering any good data platform.

I think its particularly important for open data (and other data that is liberally licensed) because these datasets are frequently copied, distributed and republished across different platforms. The ability to distribute a dataset, in different formats or with improvements and corrections, is one of the key freedoms that an open licence provides.

The downside to secondary publishing is that we end up with multiple copies of a dataset, some or all of which might be out of date, or have diverged from the original at different points in time.

Without robust approaches to provenance, change control and discovery, we run the risk of that data becoming out of date and leading to poor analyses and decision making. Multiple copies of the same dataset while increasing ease of use, also increases friction by requiring users to have to find the original authoritative data among all the copies. Or try to figure out whether the copy available in their preferred platform is completely up to date with the original.

Documentation and linking to original sources can help mitigate those problems. But automating change notifications, to allow copies of datasets to be easily synchronised between platforms, at the point they are updated, is also important. I’ve not seen a lot of recent work on documenting these as best practices. I think there’s still some gaps in the standards landscape around data platforms. So I’d be interested to hear of examples.

In the meantime, if you’re building a data platform, think about how you can enable users to more efficiently and automatically consume updated data.

And if you’re republishing primary data in other platforms, make sure you’re including detailed information and documentation about how and when you have last refreshed the dataset. Ideally you copies will be automatically updating as the source changes. Linking to the open source code you ran to make the secondary copy will allow others can repeat that process if they need an updated version faster than you plan to produce one.

Posted at 16:05

Leigh Dodds: How can publishing more data decrease the value of existing data?

Last month I wrote a post looking at how publishing new data might increase the value of existing data. I ended up listing seven different ways including things like improving validation, increasing coverage, supporting the ability to link together datasets, etc.

But that post only looked at half of the issue. What about the opposite? Are there ways in which publishing new data might reduce the value of data that’s already available?

The short answer is: yes there are.  But before jumping into that, lets take a moment to reflect on the language we’re using.

A note on language

The original post was prompted by an economic framing of the value of data. I was exploring how the option value for a dataset might be affected by increasing access to other data. While this post is primarily looking at how option value might be reduced, we need to acknowledge that “value” isn’t the only way to frame this type of question.

We might also ask, “how might increasing access to data increase potential for harms?” As part of a wider debate around the issues of increasing access to data, we need to use more than just economic language. There’s a wealth of good writing about the impacts of data on privacy and society which I’m not going to attempt to precis here.

It’s also important to highlight that “increasing value” and “decreasing value” are relative terms.

Increasing the value of existing datasets will not seem like a positive outcome if your goal is to attempt to capture as much value as possible, rather than benefit a broader ecosystem. Similarly, decreasing value of existing data, e.g. through obfuscation, might be seen as a positive outcome if it results in better privacy or increased personal safety.

Decreasing value of existing data

Having acknowledged that, lets try and answer the earlier question. In what ways can publishing new data reduce the value we can derive from existing data?

Increased harms leading to retraction and reduced trust

Publishing new data always runs the risk of re-identification and the enabling of unintended inferences. While the impacts of these harms are likely to be most directly felt by both communities and individuals, there are also broader commercial and national security issues. Together, these issues might ultimately reduce the value of the existing data ecosystem in several ways:

  • Existing datasets may need to be retracted, have their scope changed, or have their circulation reduced in order to avoid further harm. Data privacy impact assessments will need to be updated as the contexts in which data is being shared and published change
  • Increased concerns over potential privacy impacts might lead to organisations to choose not to increase access to similar or related datasets
  • Increased concerns might also lead communities and individuals to reduce the amount of data they are willing to share with previously trusted sources

Overall this can lead to a reduction in the overall coverage, quality and linking of data across a data ecosystem. It’s likely to be one of the most significant impact of poorly considered data releases. It can be mitigated through proper impact assessments, consultation and engagement.

Reducing overall quality

Newly published data might be intended to increase coverage, enrich, link, validate or otherwise improve existing data. But it might actually have the opposite effect because its of poor quality. I’ve briefly touched on this in a previous post on fictional data.

Publication of poor quality data might be unintended. For example an organisation may just be publishing the data it has to help address an issue, without properly considering or addressing underlying problems with it. Or a researcher may publish data that contains honest mistakes.

But publication of poor quality data might also be deliberate. For example as spam or misinformation intended to “poison the well“.

More subtly, practices like p-hacking and falsification of data which might be intended to have a short-term direct benefit to the publisher or author, might have longer term issues by impacting the use of other datasets.

This is why understanding and documenting the provenance of data, monitoring of retractions, fixes and updates to data, and the ability to link analyses with datasets are all so important.

Creating unnecessary competition or increasing friction

Publishing new datasets containing new observations and data about an area or topic of interest can lead to positive impacts, e.g. by increasing confidence or coverage. But datasets are also competing with one another. The same types of data might be available from different sources, but under different licences, access arrangements, pricing, etc.

This competition isn’t necessarily positive. For example, the data ecosystem might not benefit as much from the network effects that follow from linking data because key datasets are not linked or cannot be used together. Incompatible and competing datasets can add friction across an ecosystem.

Building poor foundations

Data is often published as a means of building stronger data infrastructure for a sector, or to address a specific challenge. But if that data is poorly maintained or is not sustainably funded, then the energy that goes into building the communities, tools and other datasets around that infrastructure might be wasted.

That reduces the value of existing datasets which might otherwise have provided a better foundation to build upon. Or whose quality is dependent on the shared infrastructure. While this issue is similar to that of the previous one about competition, its root causes and impacts are slightly different.

 

As I noted in my earlier post. I don’t think this is an exhaustive list and it can be improved by contributions. Leave a comment if you have any thoughts.

Posted at 12:05

April 15

Leigh Dodds: Exploring registration agencies as data institutions

A key focus for our research and delivery work at the ODI at the moment is exploring how to design sustainable and trustworthy data institutions. Data institutions are organisations that steward data on behalf of a community. They have a variety of legal forms, roles and purposes.

Yesterday I wrote (again!) about identifiers and specifically, how different communities have been designing and using identifier systems within their business and data ecosystems. In that post I provided an outline of centralised and federated models for assigning identifiers. Both of those models rely on organisations that are known as registration agencies, registration authorities or registrars.

In this post, I’m going to briefly explore the role of registration agencies as a specific form of data institution.

What problem are registration agencies solving?

Organisations working within the same sector, whether they are publishing books, shipping cargo, manufacturing cars or streaming media, need to be able to consistently identify things. Which book has been sold? Where did this cargo container come from? When was this car manufactured? Which artist produced this song?

Whether a group of organisations are competing with one another, providing services or funding to each other, or collaborating as part of a supply chain, they need to be able to refer to the physical and digital objects, people, places and things that are core to their businesses.

Consistent, unique identifiers are one of the building blocks of data infrastructure. As I described in my previous blog post, there are different ways to create identifiers, but a common pattern is to use a registration agency as a central point of coordination.

Registration agencies fulfill the role of having an independent, cross-industry organisation responsible for assigning and managing identifiers for those things of shared interest.

What data does a registration agency steward?

The core role of a registration agency is to govern the identifier scheme. That will involve deciding on details such as the syntax and rules for constructing identifiers, how they are assigned and by whom. It will also manage how the scheme evolves over time in order to support the changing needs of its community. Identifier schemes are standards for data and need to be maintained over the long term.

Registration agencies might directly create and assign identifiers at the request of its community. Or it might delegate that activity to other organisations. Depending on the specifics of the identifier scheme, the agency may only manage a small amount of data.

For example, the IFPI is the Registration Agency for the ISRC identifier used in the music industry since 1986. As an organisation, to create an ISRC for music you are publishing, you first apply for a registration code (a prefix used in the identifiers) from a national agency. You can then locally assign identifiers to your recordings. There is no requirement to register the individual codes with either IFPI or the national agency. There isn’t a central database of the identifiers. So for a long time the IFPI will likely only have had a small database listing the prefixes that had been assigned to specific organisations.

Other registration agencies capture more information about the things that are being identified. Organisations requesting an identifier either provide that data at the point of assignment or later deposit it with the agency. This seems to me to be a more common setup: having a central database supports a variety of additional use cases. For example, it can help answer some of the questions I posed above, e.g. when was this car manufactured?

In 2016, IFPI worked with a vendor called SoundExchange to launch a search engine and database, although this is not a complete source of all the data. This presumably addressed needs not covered by the existing system.

So, the data stewarded by a registration agency may vary. It may ranges from basic administrative information about the identifier scheme to a much broader set of data deemed to be useful to the community. Registration agencies may be key data intermediaries in their sector and so fulfill a wider purpose. This is why there is often commercial interest and competing projects to creating identifier schemes for specific industries, there is a lot of potential value to be captured.

How are they setup, and how do they approach sustainability?

In practice any community could work together to setup a common identifier scheme and an organisation to manage it. It just needs a shared understanding of the value of common identifiers and/or a common registry. For example, ZooBank and the LSID in the biosciences. Or the role of the IEEE in managing identifiers the electronics industry.

Existing data intermediaries may branch out into launching identifier schemes to support aggregation and distribution of other data. For example, Refinitiv’s PermId.

Governments also often setup registers and organisations to steward them. For example, Companies House in the UK. Registers frequently address a different set of needs, but assigning identifiers is frequently part of the task of maintaining a register.

Governments can create registers and registration agencies whenever they see fit. As can commercial organisations and community initiatives, given sufficient agreement, funding and resources.

A fourth approach to starting a registration agency is via ISO. Some identifier schemes end up being published as international standards. According to ISO policy, if a new standard identifier is going to require a registration process, then ISO will appoint an organisation as the official registration authority for that standard. This creates a monopoly situation so there is a process of review of the proposed approach, the agency and their approach to sustainability.

ISO publish a list of registration agencies for ISO standards. It includes IFPI as the agency for the ISRC standard

Registration agencies can charge fees for providing the registration services. But ISO requires those to be done on a cost recovery basis only. Approval for the charging of fees requires an additional level of review within ISO. But an agency might provide other supporting services.

Looking across some of the ISO appointed authorities, many appear to charge fees for registration both at the point of assignment of an identifier and on an annual basis. Many also seem to offer additional services and/or operate on a membership basis.

Different approaches to governance

From my reading so far, it seems that registration agencies supporting identifier schemes that are part of the public sector, commercial or community initiatives tend to be more centralised.

Looking across the ISO nominated registration agencies, these tend to use a federated assignment approach, similar to the IFPI, where much of the work is delegated to national agencies with the primary agency primarily acting as the custodian of the overall scheme and a point of coordination. The primary registration agency might also be a fallback for circumstances where a national agency hasn’t been appointed.

This country based approach makes sense for international standards: national agencies can work more closely with their communities.

Another example of this approach is the International Standard Name Identifier (ISNI) which is governed by the ISNI International Agency which appears to have been set up specifically for this purpose. It’s work is delegated to a long list of specific assignment agencies. One of which is the British Library. As it happens, the British Library fulfills a similar role for a number of identifier schemes. This suggests that long-term sustainability for the identifier scheme and the primary registration agency is related to the sustainability of a broader set of organisations which might be acting as a national registration agency only as part of their operations.

One slightly different approach to governance is that of the DOI Foundation, which is the ISO appointed registration agency for DOI identifiers. DOIs can be assigned to a very broad category of different things and so, while the Foundation does delegate to other agencies, these aren’t along national lines. Instead there are different DOI registration agencies for different communities and purposes.

One example is CrossRef which works in the publishing industry, another is EIDR which operates in the entertainment industry. Both are covered by common rules published by the DOI Foundation which outlines acceptable business models, roles and and responsibilities.

While the individual agencies run their own technical platforms, the DOI Foundation also provides some common technical infrastructure to support its registration agencies and enable long-term persistence of the identifiers. This common infrastructure was moved to a separate not-for-profit in 2014, apparently as a means to increase trust.

Posted at 12:05

April 14

Leigh Dodds: How do different communities create unique identifiers?

Identifiers are part of data infrastructure. They play an important role, helping to publish, structure and link together data. Identifiers are boundary objects, that cross communities. That means they need to be well-documented in order to be most useful.

Understanding how identifiers are created, assigned and governed can help us think through how to strengthen our data infrastructure. With that in mind, let’s take a quick tour of how different communities and systems have created identifier systems to help to uniquely refer to different digital and physical objects.

The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.

For example the Ordnance Survey TOID identifier is a serial number that looks like this: osgb1000006032892. UPRNs are similar.

Serial numbers work well when you have a single organisation and/or system generating the identifiers. They’re simple to implement, but can have their downsides, especially when they’re shared with others.

Some serial numbering systems include built in error-checking to deal with copying errors, using a check digit. Examples include the CAS registry number for identifying chemicals, and the basic form of the ISSN for identifying academic journals.

 

 

 

 

 

 

As we can see in the bar code form of the ISSN shown above, identifiers often have more structure to them. And they may not be assigned as a simple serial number.

The second way of providing unique identifiers is using a name or code. These are typically still assigned by a central authority, sometimes known as a registration agency, but they are constructed in different ways.

Identifiers for geographic locations typically rely on administrative regions or other areas to help structure identifiers. For example the statistics community in the EU created the NUTS codes to help identify country sub-divisions in statistical datasets. These are assigned based on hierarchy beginning with the country and then smaller geographic regions. Bath is UKK12 for example.

 

 

 

 

 

 

 

 

Postal codes are another geographically based set of codes. Both the UK and US postal codes use a geographical hierarchy. Only here the regions are those meaningful to how the Royal Mail and USPS manages its delivery operations, rather than being administratively defined by the government.

 

 

 

 

 

Hierarchies that are based on geography and/or organisational structures are common patterns in identifiers. Existing hierarchies provide a handy way to partition up sets of things for identification purposes.

The SWIFT code used in banking has a mixture of organisational and geographic hierarchies.

 

 

 

 

 

 

Encoding information about geography and hierarchy within codes can be useful. It can make them easier to validate. It also mean you can also manipulate them, e.g. by truncation, to find the identifiers for broader regions.

But encoding lots of information in identifiers also has its downsides. The main one being dealing with changes to administrative areas that mean the hierarchy has changed. Do you reassign all the identifiers?

Assigning identifiers from a single, central authority isn’t always ideal. It can add coordination overhead which can be particularly problematic if you need to assign lots of identifiers quickly. So some identifier systems look at reducing the burden on that central authority.

A solution to this is to delegate identifier assignment to other organisations. There are two ways this is done in practice.

The first is what we might call federated assignment. This is where the registration agency shares the work of assigning identifiers with other organisations. A typical approach is to delegate the work of registration and assignment to national organisations. Although other approaches are possible.

The delegation of work might be handled entirely “behind the scenes” as an operational approach. But sometimes it ends up being a feature of the identifier system.

For example the  (LEI) uses federated assignment where “Local Operating Units” do the work of assigning identifiers with. As you can see below, the identifiers for the LOUs become part of the identifiers they assign.

 

 

 

The International Standard Recording Code uses a similar approach with national agencies assigning identifiers.

 

 

 

 

Another approach to reducing dependence on, and coordination with a single registration agency, is to use what I’ll call “local assignment“. In this approach individual organisations are empowered to assign identifiers as they need them.

A simplistic approach to local assignment is “block allocation“: handing out blocks of pregenerated identifiers to organisations which can locally assign them. Blocks of IP addresses are handed out to Internet Service Providers. Similarly, blocks of UPRNs are handed out to local authorities.

Here the registration agency still generates the identifiers, but the assignment of identifier to “thing” is done locally. And, in the second case at least, a record of this assignment will still be shared with the agency.

A more common approach is to use “prefix allocation“. In this approach the registration agency assigns individual organisations a prefix within the identifier system. The organisation then generates new unique identifiers by combining their prefix with a locally generated suffix.

A suffix might be generated by adding a local serial number to the prefix. Or by some other approach. Again, after generating and assigning an identifier they are commonly still centrally registered.

Many identifiers use this approach. The EIDR identifiers used in the entertainment industry look like this:

 

 

A GTIN looks like this:

 

 

 

 

And the BIC code for shipping contains look like this:

 

 

 

One challenge with prefix allocation is ensuring that the rules for locally assigned suffixes work in every context where the identifier needs to appear. This typically means providing some rules about how suffixes are constructed.

The DOI system encountered problems because publishers were generating identifiers that didn’t work well when DOIs were expressed as URLs, due to the need for extra encoding. This made them tricky to work with.

For a complicated example that mixes use of prefixes, country codes and check digits, then we can look at the VIN, which is a unique identifier for vehicles. This 17 digit code includes multiple segments but there are four competing standards for what the segments mean. Sigh.

 

 

 

 

 

It’s possible to go further than just reducing dependency on registration agencies. They can be eliminated completely.

In distributed assignment of identifiers, anyone can create an identifier. Rather than requesting an identifier, or a prefix from a registration agency, these systems operate by agreeing rules for how unique identifiers can be constructed.

One approach to distributed assignment is to use an element of randomness to generate a unique identifier at the point of time its needed. The goal is to design an algorithm that uses a random number generator and sometimes additional information like a timestamp or a MAC address, to construct an identifier where there is an extremely low chance that someone could have created the same identifier at the same moment in time. (Known as a “collision”).

This is how UUIDs work. You can play with generating some using online tools.

Identifiers like UUIDs are cheap to generate and require no coordination beyond an agreed algorithm. They work very well when you just need a reliable way to assign an identifier to something with reasonable confidence that if our data is later combined then we won’t encounter any issues.

But what if we need to independently assign an identifier to the same thing? So that when we later combine our datasets, then our data will link up?

For this we need to use a hash-based identifier. A hash based identifier takes some properties of the thing we want to identify and then use that to construct an identifier. If we have a good enough algorithm then even if we do this independently we should end up constructing the same identifier.

This is sometimes referred to as creating a “digital fingerprint” of the object. It’s commonly used to identify copies of objects. For example, the approach is used to construct content identifiers in the IPFS system. And as part of YouTube’s Content ID system to manage copyright claims.

But hash-based identifiers don’t have to be used for managing content, they can be used as pure identifiers. The most complex example I’m familiar with is the InChi, which is a means of generating a unique identifier for chemicals by using information about their structure.

 

 

 

 

By using a consistent algorithm provided as open source software, chemists can reliably create identifiers for the same structures.

The SICI code used to identify academic papers was a hash based system that used metadata about the publication to generate an identifier. However in practice it was difficult to work with due to the variety of ways in which content was actually published and the variety of contexts in which identifiers needed to be generated.

Hash-based identifiers are very tricky to get right as you need a robust algorithm, that is widely adopted. Those needing to generate identifiers will also need to be able to reliably access all of the information required to create the identifier. Variations in availability of metadata, object formats, etc can all impact how well they work in practice.

Posted at 14:05

Copyright of the postings is owned by the original blog authors. Contact us.