It's triples all the way down
We are happy to invite you to join the 13th edition of the DBpedia Community Meeting, which will be held in Leipzig. Following the LDK conference, May 20-22, the DBpedia Community will get together on May 23rd, 2019 at Mediencampus Villa Ida. Once again the meeting will be accompanied by a varied program of exciting lectures and showcases.
Highlights/ Sessions
Call for Contribution
What cool things do you do with DBpedia? Present your tools and datasets at the DBpedia Community Meeting! Please submit your presentations, posters, demos or other forms of contributions through our web form.
Tickets
Attending the DBpedia Community meeting costs 40 €. You need to buy a ticket via eshop.sachsen.de. DBpedia members get free admission. Please contact your nearest DBpedia chapter for a promotion code, or please contact the DBpedia Association.
If you would like to attend the LDK conference, please register here.
We are looking forward to meeting you in Leipzig!
Posted at 11:22
Posted at 00:00
![]() |
What a
traditional research article looks like. Nice layout, hard to reuse the knowledge from. Image: CC BY-SA 4.0. |
Posted at 16:11
Posted at 07:01
The American Geophysical Union (AGU) Fall Meeting 2018 was the first time I attended a conference that was of such magnitude in all aspects – attendees, arrangements, content and information. It is an overwhelming experience for a first timer but totally worth it. The amount of knowledge and information that one can learn at this event is the biggest takeaway; depends on each person’s abilities but trying to get the most out of it is what one will always aim for.
There were 5 to 6 types of events that were held throughout the
day for all 5 days. The ones that stood out for me were the poster
sessions, e-lightning talks, oral sessions and the centennial
plenary sessions.
The poster sessions helped to see at a glance the research that is
going on in the various fields all over the world. No matter how
much I tried, I found it hard to cover all the sections that piqued
my interest in the poster hall. The e-lightning talks were a good
way to strike up a conversation on the topic of the talks and get a
discussion going among all the attendees. Being a group discussion
structure I felt that there was more interaction as compared to the
other venues. The oral sessions were a great place to get to know
how people are exploring their areas of interests and the various
methods and approaches that they are using for the same. However, I
felt that it is hard for the presenter to cover everything that is
important and relevant in the given time span. The time constraints
are there for a very valid reason but that might lead to someone
losing out on leads if the audience doesn’t fully get the concept.
Not all presenters were up to the mark. I could feel a stark
difference between the TWC presenters (who knew how to get all the
right points across) and the rest of the presenters. The
centennial plenary sessions were a special this year as AGU is
celebrating the centennial year. These sessions highlighted the
best of research practices, innovations, achievements and studies.
The time slots for this session were very small but the work spoke
for itself.
The Exhibit Hall had all the companies and organisations that are in the field or related to it. Google, NASA and AGU had sessions, talks and events being conducted here as well. While Google and NASA were focussing on showcasing the ‘Geo-‘ aspect of their work. AGU was focussing on the data aspect too which was refreshing. They had sessions going on about data from the domain scientists’ point of view. This comes across as fundamental or elementary knowledge to us at TWC but the way they are trying to enable domain scientists to be able to communicate better with data scientists is commendable. AGU is also working on an initiative called “Make data ‘FAIR’ (Findable Accessible Interoperable Reusable) again’ which is once again trying to spread awareness amongst the domain scientists. The exhibit hall is also a nice place to interact with industry, universities and organisations who have research programs for the doctorate students and postdocs.
In retrospect, I think planning REALLY ahead of time is a good idea so that you know what to ditch and what not to miss. A list of ‘must attend’ could have helped with the decision making process. A group discussion at one of our meetings where everyone shares what they find important, before AGU, could be a good idea. Being just an audience is great and one gets to learn a lot, but contributing to this event would be even better. This event was amazing and has given me a good idea as to how to be prepared the next time I am attending it.
Posted at 14:49
If this is your first American Geophysical Union (AGU) conference, be ready! Below are a few pointers for future first-timers.
The conference I attended was hosted in Washington, D.C. at the Walter E. Washington Convention Center during the week of December 10th, 2018. It brought together over 25,000 people. Until this conference, I had not experienced the pleasure and the power of so many like-minds in one space. The experience, while exhausting, was exhilarating!
One of the top universal concerns at the AGU Conference is scheduling. You should know that I was not naïve to the opportunities and scheduling difficulties prior to 2018, my first year of attendance. I had spent the last several months organizing an application development team that successfully created a faceted browsing app with calendaring for this particular conference using live data. Believe me when I say, “Schedule before you go”. Engage domain scientists and past participants about sessions, presentations, events, and posters that are a must-see. There is so much to learn at the conference. Do not miss the important stuff. The possibilities are endless, and you will need the expertise of those prior attendees. Plan breaks for yourself. Use those breaks to wander the poster hall, exhibit hall, or the vendor displays.
Finally, take some time to see the city that holds the conference. There are many experiences to be had that will add to your education.
So. Many. Sessions!
There are e-lightning talks. There are oral sessions. There are poster sessions. There are town hall sessions. There are scientific workshops. There are tutorial talks. There are keynotes. Wow!
The e-lightning talks are exciting. There are lots of opportunity to interact in this presentation mode. The e-lightning talks are held in the Poster Hall. A small section provides chairs for about 15 – 20 attendees, with plenty of standing room only space. This informal session leads to great discussion amongst attendees. Be sure to put one of these in your schedule!
Oral sessions are what you would expect; people working in the topic, sitting in chairs at the front of the room, each giving a brief talk, then, time permitting, a Q&A session at the end. Remember these panels are filled with knowledge. For the oral sessions that you schedule to attend, read the papers prior to attending. More importantly, have some questions prepared.
//Steps onto soapbox//
//Steps down from soapbox//
The poster sessions are a great way to unwind by getting in some walking. There are e-posters which are presented on screens provided by AGU or the venue. There are the usual posters as well. The highlights of attending a poster session, besides the opportunity to stretch your legs, include the opportunity to practice meeting new people, asking in-depth questions on topics of interest, talking to people doing the research, and checking out the data being used for the research. You will want to have a notepad with you for the poster sessions. Don’t just take notes; take business cards! Remember, what makes poster sessions special is that they are an example of the latest research that has not, yet, become a published paper. The person doing the research is quite likely the presenter of the poster.
All those special sessions – the town halls, the scientific workshops, the tutorial talks, and keynotes – these are the ones that you ask prior attendees, past participants, and experts on which ones are the must-see. Get them in your schedule. Pay attention. Take notes. Read the papers behind the sessions; if not the papers, the abstracts as a minimum. Have your questions ready before you go!
This is really important. Do NOT arrive without your time at this conference well planned. To do that you are going to need to spend several weeks preparing; reading papers, studying schedules, writing questions, and more. In order to have a really successful, time-well-spent type of experience, you are going to need to begin preparing for this immense conference by November 1st.
Put an hour per day in your calendar, from November 1st until AGU Conference Week, to study and prepare for this conference. I promise you will not regret the time you spent preparing.
The biggest thing to remember and the one thing that all attendees must do is:
Nature International Journal of Science. (2018, October 17). Why fewer women than men ask questions at conferences. Retrieved from Nature International Journal of Science Career Brief: https://www.nature.com/articles/d41586-018-07049-x
Posted at 18:47
This week I was at the third pidapalooza conference in Dublin. It’s a conference that is dedicated open identifiers: how to create them, steward them, drive adoption and promote their benefits.
Anyone who has spent any time reading this blog or following me on twitter will know that this is a topic close to my heart. Open identifiers are infrastructure.
I’ve separately written up the talk I gave on documenting identifiers to help drive adoption and spur the creation of additional services. I had lots of great nerdy discussions around URIs, identifier schemes, compact URIs, standards development and open data. But I wanted to briefly capture and share a few general impressions.
Firstly, while the conference topic is very much my thing, and the attendees were very much my people (including a number of ex-colleagues and collaborators), I was approaching the event from a very different perspective to the majority of other attendees.
Pidapalooza as a conference has been created by organisations from the scholarly publishing, research and archiving communities. Identifiers are a key part of how the integrity of the scholarly record is maintained over the long term. They’re essential to support archiving and access to a variety of research outputs, with data being a key growth area. Open access and open data were very much in evidence.
But I think I was one of only a few (perhaps the only?) attendee from what I’ll call the “broader” open data community. That wasn’t a complete surprise but I think the conference as a whole could benefit from a wider audience and set of participants.
If you’re working in and around open data, I’d encourage you to go to pidapalooza, submit some talk ideas and consider sponsoring. I think that would be beneficial for several reasons.
Firstly, in the pidapalooza community, the idea of data infrastructure is just a given. It was refreshing to be around a group of people that past the idea of thinking of data as infrastructure and were instead focusing on how to build, govern and drive adoption of that infrastructure. There’s a lot of lessons there that are more generally applicable.
For example I went to a fascinating talk about how EIDR, an identifier for movie and television assets, had helped to drive digital transformation in that sector. Persistent identifiers are critical to digital supply chains (Netflix, streaming services, etc). There are lessons here for other sectors around benefits of wider sharing of data.
I also attended a great talk by the Australian Research Data Commons that reviewed the ways in which they were engaging with their community to drive adoption and best practices for their data infrastructure. They have a programme of policy change, awareness raising, skills development, community building and culture change which could easily be replicated in other areas. It paralleled some of the activities that the Open Data Institute has carried out around its sector programmes like OpenActive.
The need for transparent governance and long-term sustainability were also frequent topics. As was the recognition that data infrastructure takes time to build. Technology is easy, its growing a community and building consensus around an approach that takes time.
(btw, I’d love to spend some time capturing some of the lessons learned by the research and publishing community, perhaps as a new entry to the series of data infrastructure papers that the ODI has previously written. If you’d like to collaborate with or sponsor the ODI to explore that, then drop me a line?)
Secondly, while the pidapalooza community seem to have generally accepted (with a few exceptions) the importance of web identifiers and open licensing of reference data. But that practice is still not widely adopted in other domains. Few of the identifiers I encounter in open government data, for example, are well documented, openly licensed or are supported by a range of APIs and services.
Finally, much of the focus of pidapalooza was on identifying research outputs and related objects: papers, conferences, organisations, datasets, researchers, etc. I didn’t see many discussions around the potential benefits and consequences of use of identifiers in research datasets. Again, this focus follows from the community around the conference.
But as the research, data science and machine-learning communities begin exploring new approaches to increase access to data, it will be increasingly important to explore the use of standard identifiers in that context. Identifiers have a clear role in helping to integrate data from different sources, but there are wider risks around data privacy and ethical considerations around identification of individuals, for example, that will need to happen.
I think we should be building a wider community of practice around use of identifiers in different contexts, and I think pidapalooza could become a great venue to do that.
Posted at 12:33
This is a rough transcript of a talk I recently gave at a session at Pidapalooza 2019. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing. I’d also really welcome you thoughts and feedback on this discussion document.
At the Open Data Institute we think of data as infrastructure. Something that must be invested in and maintained so that we can maximise the value we get from data. For research, to inform policy and for a wide variety of social and economic benefits.
Identifiers, registers and open standards are some of the key building blocks of data infrastructure. We’ve done a lot of work to explore how to build strong, open foundations for our data infrastructure.
A couple of years ago we published a white paper highlighting the importance of openly licensed identifiers in creating open ecosystems around data. We used that to introduce some case studies from different sectors and to explore some of the characteristics of good identifier systems.
We’ve also explored ways to manage and publish registers. “Register” isn’t a word that I’ve encountered much in this community. But its frequently used to describe a whole set of government data assets.
Registers are reference datasets that provide both unique and/or
persistent identifiers for things, and data about those things. The
datasets of metadata that describe ORCIDs and DOIs are registers.
As are lists of doctors, countries and locations where you can get
our car taxed. We’ve explored different models for stewarding
registers and ways to build trust
around how they are created and maintained.
In the work I’ve done and the conversations I’ve been involved with around identifiers, I think we tend to focus on two things.
The first is persistence. We need identifiers to be persistent in order to be able to rely on them enough to build them into our systems and processes. I’ve seen lots of discussion about the technical and organisational foundations necessary to ensure identifiers are persistent.
There’s also been great work and progress around giving identifiers affordance. Making them actionable.
Identifiers that are URIs can be clicked on in documents and emails. They can be used by humans and machines to find content, data and metadata. Where identifiers are not URIs, then there are often resolvers that will help to make to integrate them with the web.
Persistence and affordance are both vital qualities for identifiers that will help us build a stronger data infrastructure.
But lately I’ve been thinking that there should be more discussion and thought put into how we document identifiers. I think there are three reasons for this.
Firstly, identifiers are boundary objects. As we increase access to data, by sharing it between organisations or publishing it as open data, then an increasing number of data users and communities are likely to encounter these identifiers.
I’m sure everyone in this room know what a DOI is (aside: they did). But how many people know what a TOID is? (Aside: none of them did). TOIDs are a national identifier scheme. There’s a TOID for every geographic feature on Ordnance Survey maps. As access to OS data increases, more developers will be introduced to TOIDs and could start using them in their applications.
As identifiers become shared between communities. It’s important that the context around how those identifiers are created and managed is accessible, so that we can properly interpret the data that uses them.
Secondly, identifiers are standards. There are many different types of standard. But they all face common problems of achieving wide adoption and impact. Getting a sector to adopt a common set of identifiers is a process of agreement and implementation. Adoption is driven by engagement and support.
To help drive adoption of standards, we need to ensure that they are well documented. So that users can understand their utility and benefits.
Finally identifiers usually exist as part of registers or similar reference data. So when we are publishing identifiers we face all the general challenges of being good data publishers. The data needs to be well described and documented. And to meet a variety of data user needs, we may need a range of services to help people consume and use it.
Together I think these different issues can lead to additional friction that can hinder the adoption of open identifiers. Better documentation could go some way towards addressing some of these challenges.
So what documentation should we publish around identifier schemes?
I’ve created a discussion document to gather and present some thoughts around this. Please have a read and leave you comments and suggestions on that document. For this presentation I’ll just talk through some of the key categories of information.
I think these are:
I take it pretty much as a given that this type of important documentation and metadata should be machine-readable in some form. So we need to approach all of the above in a way that can meet the needs of both human and machine data users.
Before jumping into bike-shedding around formats. There’s a few immediate questions to consider:
I’m interested to know whether others think this would be a useful exercise to take further. And also the best forum for doing that. For example should there be a W3C community group or similar that we could use to discuss and publish some best practice.
Please have a look at the discussion document. I’m keen to learn from this community. So let me know what you think.
Thanks for listening.
Posted at 11:41
This is a rough transcript of a talk I recently gave at a workshop on Linked Open Statistical Data. You can view the slides from the talk here. I’m sharing my notes for the talk here, with a bit of light editing.
At the Open Data Institute our mission is to work with companies and governments to build an open trustworthy data ecosystem. An ecosystem in which we can maximise the value from use of data whilst minimising its potential for harmful impacts.
An important part of building that ecosystem will be ensuring that everyone — including governments, companies, communities and individuals — can find and use the data that might help them to make better decisions and to understand the world around them
We’re living in a period where there’s a lot of disinformation around. So the ability to find high quality data from reputable sources is increasingly important. Not just for us as individuals, but also for journalists and other information intermediaries, like fact-checking organisations.
Combating misinformation, regardless of its source, is an increasingly important activity. To do that at scale, data needs to be more than just easy to find. It also needs to be easily integrated into data flows and analysis. And the context that describes its limitations and potential uses needs to be readily available.
The statistics community has long had standards and codes of practice that help to ensure that data is published in ways that help to deliver on these needs.
Technology is also changing. The ways in which we find and consume information is evolving. Simple questions are now being directly answered from search results, or through agents like Alexa and Siri.
New technologies and interfaces mean new challenges in integrating and using data. This means that we need to continually review how we are publishing data. So that our standards and practices continue to evolve to meet data user needs.
So how do we integrate data with the web? To ensure that statistics are well described and easy to find?
We’ve actually got a good understanding of basic data user needs. Good quality metadata and documentation. Clear licensing. Consistent schemas. Use of open formats, etc, etc. These are consistent requirements across a broad range of data users.
What standards can help us meet those needs? We have DCAT and Data Packages. Schema.org Dataset metadata, and its use in Google dataset search, now provides a useful feedback loop that will encourage more investment in creating and maintaining metadata. You should all adopt it.
And we also have CSV on the Web. It does a variety of things which aren’t covered by some of those other standards. It’s a collection of W3C Recommendations that:
The primer provides an excellent walk through of all of the capabilities and I’d encourage you to explore it.
One of the nice examples in the primer shows how you can annotate individual cells or groups of cells. As you all know this capability is essential for statistical data. Because statistical data is rarely just tabular: it’s usually decorated with lots of contextual information that is difficult to express in most data formats. Users of data need this context to properly interpret and display statistical information.
Unfortunately, CSV on the Web is still not that widely adopted. Even though its relatively simple to implement.
(Aside: several audience members noted they are using it internally in their data workflows. I believe the Office of National Statistics are also moving to adopt it)
This might be because of a lack of understanding of some of the benefits it provides. Or that those benefits are limited in scope.
There also aren’t a great many tools that support CSV on the web currently.
It might also be that actually there’s some other missing pieces of data infrastructure that are blocking us from making best use of CSV on the Web and other similar standards and formats. Perhaps we need to invest further in creating open identifiers to help us describe statistical observations. E.g. so that we can clearly describe what type of statistics are being reported in a dataset?
But adoption could be driven from multiple angles. For example:
Not everyone needs to implement or use the full set of capabilities. But with some small changes to tools and processes, we could collectively improve how tabular data is integrated into the web.
Thanks for listening.
Posted at 10:55
Posted at 00:00
Posted at 00:00
In 2018, AGU celebrated its centennial year. TWC had a good showing at this AGU, with 8 members attending and presenting on a number of projects.
We arrived at DC on Saturday night, to attend the DCO Virtual Reality workshop organized by Louis Kellogg and the DCO Engagement Team, where research from greater DCO community came together to present, discuss and understand how the use of VR can facilitate and improve both research and teaching. Oliver Kreylos and Louis Kellogg spent various session presenting the results of DCO VR project, which involved recreating some of the visualizations used commonly at TWC, i.e the mineral networks. For a preview of using the VR environment, check out these three tweets. Visualizing mineral networks in a VR environment has yielded some promising results, we observed interesting patterns in the networks which need to be explored and validated in the near future.
With a successful pre-AGU workshop behind us, we geared up for the main event. First thing Monday morning, was the “Predictive Analytics” poster session, which Shaunna Morrison, Fang Huang, and Marshall Ma helped me convene. The session, while low on abstracts submitted, was full of very interesting applications of analytics methods in various earth and space science domains.
Fang Huang also co-convened a VGP session on Tuesday, titled “Data Science and Geochemistry“. It was a very popular session, with 38 abstracts. Very encouraging to see divisions other than ESSI have Data Science sessions. This session also highlighted the work of many of TWC’s collaborators from the DTDI project. Kathy Fontaine convened a e-lightning session on Data policy. This new format was very successfully in drawing a large crowd to the event and enabled a great discussion on the topic. The day ended with Fang’s talk, presenting our findings about the network analysis of samples from the cerro negro volcano.
Over the next 2 days, many of TWC’s collaborators presented, but no one from TWC presented until Friday. Friday though was the busiest day for all of us from TWC. Starting with Peter Fox’s talk in the morning, Mark Parsons, Ahmed Eleish, Kathy Fontaine and Brenda Thomson all presented their work during the day. Oh yeah…and I presented too! My poster on the creation of the “Global Earth Mineral Inventory” got good feedback. Last, but definitely not the least, Peter represented the ESSI division during the AGU centennial plenary, where he talked about the future of Big Data and Artificial Intelligence in the Earth Sciences. The video of the entire plenary can be found here.
Overall, AGU18 was great, other than the talk mentioned above, multiple productive meetings and potential collaboration emerged from meeting various scientists and talking to them about their work. It was an incredible learning experience for me and the other students (for whom this was the first AGU).
As for other posters and talks I found interesting. I tweeted a lot about them during AGU. Fortunately, I did make a list of some interesting posters.
Posted at 17:41
Posted at 14:57
When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.
Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:
Provision an SD card with Stretch e.g. using Etcher or manually
Enable ssh e.g. by
touch /Volumes/boot/ssh
Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:
sudo raspi-config
then Localisation Options -> Set wifi country.
Install wifi-connect
bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)
Add their bash script
curl https://raw.githubusercontent.com/balena-io/wifi-connect/master/scripts/start.sh > start-wifi-connect.sh
Add a systemd script to start it on boot.
sudo nano/lib/systemd/system/wifi-connect-start.service
-> contents:
[Unit] Description=Balena wifi connect service After=NetworkManager.service [Service] Type=simple ExecStart=/home/pi/start-wifi-connect.sh Restart=on-failure StandardOutput=syslog SyslogIdentifier=wifi-connect Type=idle User=root [Install] WantedBy=multi-user.target
Enable the systemd service
sudo systemctl enable wifi-connect-start.service
Reboot the pi
sudo reboot
and a wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi
ping raspberrypi.local
.
Posted at 17:21
When you move a Raspberry Pi between wifi networks and you want it to behave like an appliance, one way to set the wifi network easily as a user rather than a developer is to have it create an access point itself that you can connect to with a phone or laptop, enter the wifi information in a browser, and then reconnect to the proper network. Balena have a video explaining the idea.
Andrew Nicolaou has written things to do this periodically as part of Radiodan. His most recent suggestion was to try Resin (now Balena)’s wifi-connect. Since Andrew last tried, there’s a bash script from Balena to install it as well as a Docker file, so it’s super easy with just a few tiny pieces missing. This is what I did to get it working:
Provision an SD card with Stretch e.g. using Etcher or manually
Enable ssh e.g. by
touch /Volumes/boot/ssh
Share your network with the pi via ethernet, ssh in and enable wifi by setting your country:
sudo raspi-config
then Localisation Options -> Set wifi country.
Install wifi-connect
bash <(curl -L https://github.com/balena-io/wifi-connect/raw/master/scripts/raspbian-install.sh)
Add a slightly-edited version of their bash script
curl https://gist.githubusercontent.com/libbymiller/e8fe6821e122e0a0ac921c8e557320a9/raw/46138fb4d28b494728e66515e46bd7d736b19132/start.sh > /home/pi/start-wifi-connect.sh
Add a systemd script to start it on boot.
sudo nano /lib/systemd/system/wifi-connect-start.service
-> contents:
[Unit] Description=Balena wifi connect service After=NetworkManager.service [Service] Type=simple ExecStart=/home/pi/start-wifi-connect.sh Restart=on-failure StandardOutput=syslog SyslogIdentifier=wifi-connect Type=idle User=root [Install] WantedBy=multi-user.target
Enable the systemd service
sudo systemctl enable wifi-connect-start.service
Reboot the pi
sudo reboot
A wifi network should come up called “Wifi Connect”. Connect to it, add in your details into the captive portal, and wait. The portal will go away and then you should be able to ping your pi over the wifi:
ping raspberrypi.local
(You might need to disconnect your ethernet from the Pi before connecting to the Wifi Connect network if you were sharing network that way).
Posted at 17:21
![]() |
Compound found
in Taphrorychus bicolor (doi:10.1002/JLAC.199619961005). Published in Liebigs Annalen, see this post about the history of that journal. |
Posted at 06:59
Posted at 14:51
![]() |
2D structure of
caffeine, also known as theine. |
Posted at 08:15
We are happy to announce SANSA 0.5 – the fifth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.
You can find the FAQ and usage examples at http://sansa-stack.net/faq/.
The following features are currently supported by SANSA:
Noteworthy changes or updates since the previous release are:
Deployment and getting started:
We want to thank everyone who helped to create this release, in particular the projects HOBBIT, Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin and Simple-ML.
Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.
Greetings from the SANSA Development Team
Posted at 08:25
Like
this
Download Raspian
Stretch with Desktop
Burn a card with Etcher.
(Assuming a Mac) Enable ssh
touch /Volumes/boot/ssh
Put a wifi password in
nano /Volumes/boot/wpa_supplicant.conf
country=GB ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev update_config=1 network={ ssid="foo" psk="bar" }
Connect the Pi camera, attach a dial to GPIO pin 12 and ground, boot up the Pi, ssh in, then
sudo apt-get update sudo apt-get upgrade sudo raspi-config # and enable camera; reboot
install tensorflow
sudo apt install python3-dev python3-pip sudo apt install libatlas-base-dev pip3 install --user --upgrade tensorflow
Test it
python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
get imagenet
git clone https://github.com/tensorflow/models.git cd ~/models/tutorials/image/imagenet python3 classify_image.py
install openCV
pip3 install opencv-python sudo apt-get install libjasper-dev sudo apt-get install libqtgui4 sudo apt install libqt4-test python3 -c 'import cv2; print(cv2.__version__)'
install the pieces for talking to the camera
cd ~/models/tutorials/image/imagenet pip3 install imutils picamera mkdir results
download edited version classify_image
curl -O https://gist.githubusercontent.com/libbymiller/d542d596566774a35752d134f80b1332/raw/471f066e4dc498501bab7731a07fa0c1926c1575/classify_image_dial.py
Run it, and point at a cat
python3 classify_image_dial.py
Posted at 20:48
Posted at 00:00
Posted at 00:00
Posted at 00:00
I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.
HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.
I ran into a number of outright problems when trying to implement HDT from scratch:
The HDT documentation is incomplete/incorrect in places, and required reverse engineering the existing implementations to determine critical format details; questions remain about specifics (e.g. canonical dictionary escaping):
Here are some of the issues I found during implementation:
DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose
FourSectionDictionary
conflicts with the previous section on the format URI
(http://purl.org/HDT/hdt#dictionaryPlain
vs.
http://purl.org/HDT/hdt#dictionaryFour
)
The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits
“Log64” referenced in BitmapTriples does not seem to be defined anywhere
There doesn’t seem to be documentation on exactly how RDF term
data (“strings”) is encoded in the dictionary. Example datasets are
enough to intuit the format, but it’s not clear why \u
and \U
escapes are supported, as this adds complexity
and inefficiency. Moreover, without a canonical format (including
when/if escapes must be used), it is impossible to efficiently
implement dictionary lookup
The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding
There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible
The secondary index format seems to be entirely undocumented
In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:
The default dictionary encoding format (plain front coding) is inefficient for datatyped literals and unnecessarily allows escaped content, resulting in inefficient parsing
Distinct value space for predicate and subject/object dictionary
IDs is at odds with many triple stores, and makes interoperability
difficult (e.g. dictionary lookup is not just dict[id] ->
term
, but dict[id, pos] -> term
; a single
term might have two IDs if it is used as both predicate and
subject/object)
The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit
A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)
The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging
Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.
In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.
Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.
Posted at 16:58
Posted at 00:00
This weekend I started a side project which I plan to spend some time on this winter. The goal is to create a web interface that will let people explore geospatial datasets published by the three local authorities that make up the West of England Combined Authority: Bristol City Council, South Gloucestershire Council and Bath & North East Somerset Council.
Through Bath: Hacked we’ve already worked with the council to publish a lot of geospatial data. We’ve also run community mapping events and created online tools to explore geospatial datasets. But we don’t have a single web interface that makes it easy for anyone to explore that data and perhaps mix it with new data that they have collected.
Rather than build something new, which would be fun but time consuming, I’ve decided to try out TerriaJS. Its an open source, web based mapping tool that is already being used to publish the Australian National Map. It should handle doing the West of England quite comfortably. It’s got a great set of features and can connect to existing data catalogues and endpoints. It seems to be perfect for my needs.
I decided to start by configuring the datasets that are already in the Bath: Hacked Datastore, the Bristol Open Data portal, and data.gov.uk. Every council also has to publish some data via standard APIs as part of the INSPIRE regulations, so I hoped to be able to quickly bring a list of existing datasets without having to download and manage them myself.
Unfortunately this hasn’t proved as easy as I’d hoped. Based on what we’ve learned so far about the state of geospatial data infrastructure in our project at the ODI I had reasonably low expectations. But there’s nothing like some practical experience to really drive things home.
Here’s a few of the challenges and issues I’ve encountered so far.
The goal of the INSPIRE legislation was to provide a common geospatial data infrastructure across Europe. What I’m trying to do here should be relatively quick and easy to do. Looking at this graph of INSPIRE conformance for the UK, everything looks rosy.
But, based on an admittedly small sample of only three local authorities, the reality seems to be that:
It’s important that we find ways to resolve these problems. As this recent survey by the ODI highlights, SMEs, startups and local community groups all need to be able to use this data. Local government needs more support to help strengthen our geospatial data infrastructure.
Posted at 18:07
Posted at 00:00
Posted at 14:41
![]() |
Figure from the
article showing the interactive Open PHACTS documentation to access interactions. |
Posted at 10:30
![]() |
Download
statistics of J. Cheminform. Additional Files show a clear growth. |
Posted at 09:45