Interview with Open Knowledge Foundation’s Jonathan Gray

Jonathan Gray (@jwyg) went to OKF Co-Founder Rufus Pollock about five years ago with a one page document which many moons later turned into WhereDoesMyMoneyGo.org. Since then he’s been working hard to make it easier to find and reuse all kinds of content and data – including but not limited to government data.

Here he answers a few questions…

1. What does the Open Knowledge Foundation do and what is your role?

The OKF is a not-for-profit organisation dedicated to promoting ‘open knowledge’ in all its forms. This includes open data, open content and material which is in the public domain. We work to make knowledge easier for anyone to reuse in lots of different domains – from government information to scientific research data to cultural works. For a while our informal motto has been ‘sonnets to statistics, genes to geodata’.

I’m the OKF’s Community Coordinator – which means I spend lots of time talking, writing, emailing and meeting people in order to ultimately build a stronger and better connected community around the Foundation. I try to close loops between those who have common interests and encourage them to collaborate with each other, and to engage in shared projects and initiatives at the OKF. If I had to depict my role in a game of Pictionary, I’d probably draw an address book with legs and a hat.

2. How can OKF be used as a tool for journalists?

We have lots of different projects and initiatives which might be relevant to journalists. For example you can request data you’re interested in on GetTheData.org. You can find open datasets from around the world using OpenDataSearch.org, or from around Europe at PublicData.eu. You can explore UK public spending on WhereDoesMyMoneyGo.org – and soon you will be able to explore spending data from around the world at OpenSpending.org. We also help to run numerous data catalogues – from the community driven CKAN.net (where anyone can add an open dataset) to official government data catalogues such as data.gov.uk, data.norge.no, or the official Dutch data catalogue.

Perhaps most importantly of all, the OKF serves as a decentralised network for people interested in reusing open data – from developers to designers to data journalists to data literate citizens. If you want to find someone to help you with something you can ping one of our many mailing lists, and there are lots of people who are very knowledgeable about all things related to finding, getting and reusing datasets. We also put on a mixture of events – from big events like the Open Government Data Camp last autumn to small hands-on workshops aimed at making things like the Eurostat Hackday.

3. As governments become more open (e.g. data.gov.uk), is the OKF made redundant?

Not really. In a way I love the idea of initiatives that are like a pair of bicycle stabilisers – used for initial support and then discarded when they are no longer needed. One can imagine a world in which the OKF and other organisations like it helped to open up the world’s data and then faded into the fog. I even used to joke about the built in obsolescence in my own role (i.e. building communities – then stepping quietly away).

But alas I don’t think this is going to be the case in relation to open government data for several reasons. First we need not only data but strong reuser communities around it. The recipe for a thriving open data initiative is not just raw data now, but data plus tools
plus communities. I think there will always be room for loose-knit network based organisations like the OKF to put on fun events, to keep people with shared interests in touch, and to act as a hub for people who want to work on shared projects (like CKAN or OpenSpending.org or the Open Data Manual).

Jonathan Gray, OKF Community Coordinator

4. Is data approachable to the non-programmer?

I believe so. Of course not all data is equally as approachable – just as not all pieces of text are equally approachable (compare Harry Potter to James Joyce’s Ulysses or a university physics textbook).

  1. There is a lot of very valuable data that is very easy to interpret with the naked eye.
  2. There is a lot of very valuable data that is pretty easy to interpret, with a little bit of effort, patience, research and possibly guidance.
  3. Then there is also (not unexpectedly) a lot of data which is a complete nightmare and very difficult to interpret – and that is why we have clever tools and clever geeks and lots of cunning experts who know things about the data because perhaps they have built or administered the database, or gathered the data themselves.

Either way – the non-approachability of data to some people is an exceptionally poor argument in itself for not opening datasets up. And in the long term while more data literacy is probably a Good Thing for society, broadly speaking – not everyone needs to be a data geek in order to benefit from more open data. Just in the same way that not everyone needs to be a plumber or an urban planner in order to benefit from pipes or roads.

5. What is the future for open data?

Frankly – I have no idea. Some people argue we’ll see ‘small pieces, loosely joined’. The OKF is very keen on seeing how the open data ‘movement’ can learn from methodologies and techniques from the world of open source software – where you see lots of quite sophisticated, distributed collaboration. My colleague Dr. Rufus Pollock and others at the Foundation are very keen on this and have a strong vision of an ‘ecosystem’ of open data, similar to the ecosystem of open source software.

I think generally we’d like to hope that open data will be both ubiquitous and routine – i.e. as its value is recognised we’ll hopefully go from saying an enthusiastic ‘wow’ to saying an impatient ‘yes of course’. A lot of this is about unlocking potential innovation – i.e. useful and interesting things which we can’t anticipate.

Ultimately I hope that open data will enable more evidence based policy-making, better reportage and richer, more informed conversation across society. More open data in and of itself will not improve the world – but having better ‘maps’ showing where we’ve come from, where we are and where we’re going might help us plot our path into the future more intelligently, and to have more inclusive discussions about the various possible routes that we might take.

6. How could Driven by Data’s involvement help with the OKF?

We’d be delighted to see if any of you are interested in spending data, in the UK, in Europe or internationally, to help put the numbers we now have into context. We’re also always very keen to understand more about what journalists (and aspiring journalists) want, and what would be useful for them. Hopefully if there’s particular datasets you need – you’ll consider posting requests on GetTheData.org so others can help dig around with you!

If you have any ideas for projects you’d like to undertake, you can post them on ideas.okfn.org and we can try and find people for you to collaborate with, or perhaps even funding. And do get in touch if there is anything you need. We’re here to be helpful.

By Michael Greenfield (@mgreenfield13)

Posted in Interview | Tagged , , , , , , , , , , , , | Leave a comment

Data Journalism – The Hidden Truth

Data-driven journalism is respected for its credibility of fact and figures. If you ever need to clarify information, it is presented right there in front of you. It is truthful and valid but only IF the information is analysed correctly.

The ability to interpret data is an impressive must-have skill on any journalist’s C.V. After all, it would be pointless to be presented with the most incredible set of raw data and not know what to do with it.

Why is it such an important skill? The wealth of stories that can be sourced from data is endless. What you have to remember with data is that it can be manipulated in so many ways to give a journalist different angles to tackle a story.

But without a good understanding of how to scrutinize data, you could be left with stories that don’t discuss the results of the data, but instead debate the concept of data!

For example, Wikileaks revealed so much information that was hidden from the public. It shocked, appalled and uncovered the wrongdoings of people in high-flying positions. Even what was uncovered was momentarily discussed, the focus of the data’s exposure turned to whether or not the data itself should have been released.

The exposure’s potential threat to national security was discussed. However, I believe the journalists had no idea how to extract the stories from the data. So instead, the actual leak of data became the story.

Accessing data sometimes poses few problems, but acquiring the tools to edit and explore that data can be difficult. For example, data released in .pdf forms can only be read. For a journalist working to deadlines, reading through .pdf files can be time-consuming.

Plus, it is impossible to copy data in .pdf files to a spreadsheet. Instead, the journalist has to manually type out and implement data into another more user-friendly format.

This use of .pdf files could be intentional  to limit the amount of data scrutiny but it is debatable.

However, the government have plans to change the law so that all information released under the freedom of information act is accessible by computer. This could make the journalist’s task easier. SA Mathieson and Gill Hitchcock posted an article in the Guardian Datablog about this. Click here to read it.

The Combined Online Information System (COINS) website was set up in June 2010. It is an enormous database containing HM Treasury’s detailed analysis of departmental spending sorted into thousands of categories.

Prior to the 2010 general election, the Treasury turned down requests under the freedom of information act to release data in COINS. However, the Conservatives pledged to release the information if they came to power. So the 120gb of data was made publicly available on June 4th 2010.

The release of data to some companies can be damaging to their reputation. After tooting their own “iHorn” about their world domination of portable media devices, Apple have yet to release sales figures of their iPad2. With the release of the first iPad, they bragged about selling a phenomenal 300,000 units over one weekend and published its sales figures first thing on the Monday morning after the Friday release.

It has been suggested by market analysts that Apple could have sold anywhere from 400,000 to 1,000,000 units this weekend.

Are Apple refraining from releasing their sales figures because of the lack of units shifted? Or maybe boasting about sales figures was a marketing technique they can no longer use to promote their product further as the data will be there in black and white.

In an interview with Guardian Datablog editor Simon Rogers,  I asked him if there was a negative side to data journalism.

He commented, “Poor visualisations, poor analysis, an overly-heavy focus on things that might seem diverting don’t actually tell us anything.”

 

Posted in Data Sources, How is data journalism used?, Misuse of data | Tagged , , , , , , , , , , | Leave a comment

Data in classical music

The harp is an instrument that couldn’t POSSIBLY be related to data in any way. Or can it? Although one of the oldest, arguably one of the most traditional and beautifully classical instruments with a sound so unique and therapeutic, it has slowly evolved.

There was the classical harp, then emerged the electric harp and now, most astonishingly of all is the MIDI harp. MIDI stands for musical instrumental digital interface.

It is now possible, with the wonders of data and technology to present a harp with a MIDI controller on a computer, on the basis of frequency analysis, using their modified Axon technology. It makes the conversion into MIDI data on the harp at extremely high speed with no noticeable delay.

The MIDI harp is a regular concert harp that you play quite normally, but it can translate audio data capable of orchestral magnitude, the sound of a dog bark or even humans speaking to every string from your PC or laptop.

It is essentially what a keyboard is to a piano but much more advanced and vastly more technological.

Sioned Williams, is the lead harpist for the BBC Symphony Orchestra and has spent the last few months premiering the MIDI harp with her technician ‘Graham Fitkins’ She told me,

‘It really is an enormous privilege to be doing this. Beautiful as the harp is, I think the MIDI adds another element to an instrument already so mysterious.’

‘People ask me how I make those sounds on the MIDI harp! Of course I say that I don’t. The harp itself is silent, it’s the data from the computer that is being translated to the harp.’

I asked her which she preferred, classical or MIDI harp.

‘I have to say I prefer classical but that’s because I’m a traditionalist. With the surge of technology and the data that we can use for the MIDI, it means endless possibilities for the harp. It’s a very exciting time.’

This is Sioned playing her MIDI harp, concerto piece, ‘No Doubt’

Here is another example of what the MIDI harp can do. All the sounds you hear are coming from one harp; there are no other people with instruments in the room except for the camera man.

This shows the real power data has and how its capabilities have leapt from a computer to the harp, an instrument that one would think would remain untapped by complex technology and data intelligence.

By Alex Lawton (@AlexandraLawton)

Posted in Aesthetics, Uncategorized | Tagged , , , , , , , , , , | Leave a comment

Google, data and privacy abuse

Yesterday, the Federal Trade Commission (FTC) ruled that Google should face 20 years of monitoring. They were charged with deceptive privacy practices since the launch of their ‘twitter-esque’ social networking site, Buzz in early 2010.

Thousands of people that had subscribed to Buzz complained that their privacy had been abused.

Instead of being an opt-out service, Google Buzz would opt-in users, automatically followed people for you, and disclose emails on a very public scale.

This resulted in the Electronic Privacy Information Center (EPIC) and eleven members of the US House of Representatives sending a complaint letter to the chairman of the FTC, Jon Leibowitz, criticising Google’s rule as a potential curator of private data.

The FTC has declared the company used “deceptive tactics and violated its own privacy promises to consumers.”

While the ‘buzz’ surrounding the launch of Buzz was mostly negative, some Googlers did find a good use for it.

However, early stage problems included a user’s inability to prohibit or block followers, many of whom were automatically chosen based on your most-contacted people within Google‘s emailing service, Gmail.

Many people who simply used Gmail were automatically placed into the Buzz ecosystem – this seems a little presumptuous.

Google has been barred from any future privacy misrepresentations and will be consistently audited over the next 20 years – a forceful slap on the proverbial wrist.

They have already issued a formal apology and hope that they will be in users’ good graces once again before they are rumoured to launch their second social networking site, ‘Google Circles‘… Hmmn…

Hopefully this ruling will mean better protection of personal data and information as well as a higher privacy bar for users now concerned with how much they’re actually giving away when joining a social networking site.

By Alex Lawton (@AlexandraLawton).

Posted in Misuse of data, Uncategorized | Tagged , , , , , , , , , , , , , , | 1 Comment

‘Data porn’ – where is the journalism?

With a name like Chart Porn, you might get a few disappointed visitors to your website.  It describes itself as ‘An addictive collection of beautiful charts, graphs, maps, and interactive data visualization toys — on topics from around the world.

You can find example after example of ‘beautiful’ infographics, but it is the word ‘porn’ that speaks volumes.  It suggests quick visual gratification, rather than explanation and analysis.

So the big question must be… is the journalism being lost behind the graphic designing?

As Claire Gilmore discussed in an earlier post on this blog, there is a distinct trend towards turning data visualisations into artwork.

The 2 dangers facing data journalists that emerge from this trend are neatly summarised by Paul Bradshaw:

  • DATA CHURNALISM = producing stories from data sets without context or proper interrogation
  • DATA PORN = where journalists look for big, attention grabbing numbers or produce visualisations of data that add no value to the story

These are major concerns.  My worry is that in the rush to embrace an exciting form of story telling, the journalism is being left behind.

There are murmurs of discontent among some journalists.  One source told me about criticisms of David McCandless and Andy PerkinsSnake Oil piece.  The evidence behind the evidence-based medicine presented in the image (if that makes sense) ‘is a touch flaky’ the source tells me.

In other words, the whole story is based on questionable medical research.  This is an issue that David McCandless tackles head on:

‘This piece was doggedly researched by myself, and researchers Pearl Doughty-White and Alexia Wdowski. We looked at the abstracts of over 1500 studies on PubMed (run by US National Library Of Medicine) and Cochrane.org (which hosts meta-studies of scientific research). It took us several months to seek out the evidence – or lack of.’

Whichever way you look at it, the fundamental point it raises is that proper journalism is required for proper data journalism.

The everyday principles of good journalism have to apply, involving data doesn’t change a thing:

  1. Accuracy
  2. Objectivity
  3. Originality

An interesting discussion on a Martin Belam post really gets to the heart of issue:

We [journalists] can do more [than developers and designers]: where a developer might make a graphic, we can find a story. We’re more likely to chase up the anomalies, look for wrongdoing, and to pick up the phone and talk to the people involved.

Grabbing a data set and throwing up a visualisation because it looks amazing is just data porn.  Journalism is a highly skilled profession, and just because you can generate a front page infographic (ahem, The Independent Tuesday 25 May 2010) doesn’t qualify you as a data journalist.  It makes you a data artist.

It feels only right to finish on this quote from James Ball:

it’s crucial journalists learn to treat it [data] properly – and that’ll only come when it’s treated with the same respect (and fear) that surrounds misspelling someone’s name.

By Michael Greenfield (@mgreenfield13)

Posted in Future of journalism | Tagged , , , , , , , , , , | 5 Comments

Youth unemployment: is it really as drastic as they say?

The Office for National Statistics (ONS) published some dramatic quotes last month relating to youth unemployment.

The opposition leader, Ed Miliband, chants and jibes at the coalition for the ‘lost generation’. But figures that were not taken into account are suggesting that the youth of today perhaps don’t deserve this special treatment.

Nearly 800,000 18-24s were unemployed in the early 1990s, compared to around the 700-750,000 mark of recent months. So, in terms of actual numbers, as opposed to rates, on the same measure over the same period, the situation is not quite so bad now as it was.

The figures and rates the ONS have published seem to have been calculated in rather a strange way. Instead of considering youth in the UK as a whole, they take a percentage of the employed and unemployed economically active and counter that with those that are not active.

This subtle, albeit important difference is vital as the rates of inactive youths has been rising now for over a decade due to factors (that weren’t taken into account in the survey) like wider accessibility to education.

In other words, the denominator used to calculate the rate has been shrinking.’ Youths exiting the labour market to enter education is by no means a bad thing, but it DOES highlight the inevitability of figures showing the rise in youth unemployment!

This data has been presented by the ONS could well be regarded as misleading. The results and the fact of the matter of youth unemployment is that it is neither progressive nor regressive, but stoking the fire and producing results that might conjure a negative effect without presenting all the data, when it should be investigated further.

This distortion is particularly prevalent in the ONS‘s publication of unemployment rates amongst 16-17 year olds. It states that 36% of people in this age bracket are unemployed with 200,000 out of work and 350,000 in work. It fails to mention the 1 million other people in this ‘group’ that are in education.

If you add that to the denominator, the unemployment rate is practically halved to 15% and NOT 36% as the original ONS statistics suggest.

Data can be very misleading if it is not presented responsibly and in its entirety.

by Alex Lawton (@AlexandraLawton).

Posted in Misuse of data, Uncategorized | Tagged , , , , , , , , , , , , , | 2 Comments

Testing visualisation tools on the world’s most endangered species

In the Guardian Datastore, I found a spreadsheet outlining the number of endangered species in each country in 2008 and 2009. The document also subdivided the species into groups of mammals, reptiles, birds and so on.

Usually, I start data experiments with a question and then look for the appropriate data. But in this case, the dataset was so interesting that the questions came naturally: What groups of species were the most at risk of extinction? Where did they live? Had the number of threatened species increased from 2008 to 2009?

I couldn’t even hazard a guess at patterns in species or continental trends by looking at the dataset because the countries were displayed in alphabetical order. I soon realised I would have to call upon various visualisation tools to find answers to my questions.

First of all, using Tableau Public, I created a world map with circles over each country – the smaller the circle, the fewer the endangered species.

This visualisation helped me locate the countries where there was a major threat to wildlife: Ecuador, the USA, Malaysia, Indonesia, China and Mexico.

I was no closer to analysing continental trends, though. So I created another map using the same information, this time with OpenHeatMaps.

Instead of marking each country with circles, OpenHeatMaps filled in the countries with varying shades of one colour according to the number of endangered species living there.

The continental patterns were suddenly easy to spot:

  • a dark blue strip ran from the United States to Brazil via Mexico and Ecuador
  • nearly all of South-East Asia was dark blue, from China and India to Australia:

I now wanted to find out whether the number of endangered species had increased in 2009. Again using Tableau Public, I created a scatter chart where the blue dots represented the numbers in 2008 and the orange ones the facts in 2009.

I could easily compare the height of the dots and establish whether the number of endangered species had increased from one year to the next. For most countries, the two dots were superimposed but in some cases, the 2008 dot was , to my surprise, higher than the 2009 one. So some regions had seen a decrease in the number of endangered species within a year.

I had also hoped to find out from the Guardian dataset which wildlife families were more at risk than others. So I created a simple bar chart with Tableau Public.

It shows that plants were the most threatened group in 2009 with 11,025 endangered species. Molluscs, on the other hand, were the least threatened with ‘just’ 1,144 endangered species.

Lastly, I wanted to find out which countries or regions each wildlife group was most at risk in. I achieved this by creating a map where the constant value was the country and the wildlife groups were the variables. I used Tableau Public‘s excellent drag and drop tool to chop and change between variables

Some very intriguing trends appeared:

  • Amphibians are almost exclusively endangered in Central America
  • 273 species of molluscs are threatened in the United States but next to none are endangered in neighbouring Canada and Mexico

  • the only wildlife group which is notably under threat in Europe is the fish group
  • mammals are more endangered in Malaysia than anywhere else

I realise I could have obtained these figures from the initial dataset but by actually seeing them on the map, it was much easier to pick up and analyse regional trends. For example, water pollution in Europe’s rivers is probably to blame for the endangerment of hundreds of fish species.

If you find a great dataset and choose your graphics tools carefully, you can ask draw so many interesting conclusions from your visualisations.

Claire Gilmore (@ClaireEGilmore)

Posted in Data Journalism Experiment, Data Sources, Uncategorized, Visualization Experiment | Tagged , , , , , , , , , , , , , , , | 3 Comments