‘Publish and polish’: how to clean up dirty data

By on 05/06/2019
Debating data: At Putting Citizens First, the international panel debated how to handle flawed datasets

Public sectors may be short of cash, but they’re rich in data – giving them the tools to dramatically improve public services. Before picking up those tools, though, civil servants must ensure they’re fit for purpose; at Putting Citizens First, a panel debated how to clean and deploy flawed datasets. Natalie Leal reports

“We really are awash in an ocean of data,” said Jane Wiseman. “There are five billion internet searches a day, 14 million text messages sent every minute; all of this is data that’s created.” And this ocean is growing rapidly: “90% of the data ever created in the history of the world was created in the last couple of years.”

Government creates “tons of data,” she continued – including “all kinds of data about requests for service and service usage and quality of service usage.” But while governments are saturated in data, only 1% of it has so far been analysed, said Wiseman – who, as CEO at the US Institute for Excellence in Government, helps agencies to improve their performance using data analytics. There is enormous “untapped potential” for governments to make much better use of their vast banks of information, she argued.

Wiseman was talking during the Data and Analytics session at Putting Citizens First 2019: a Global Government Forum conference, organised in association with Yesser and EY. Three other panelists joined Wiseman at the event to give their thoughts on the subject: John Kedar, director of international engagement at UK’s Ordnance Survey; Ott Vatter, managing director of Estonia’s E-Residency scheme; and Basem Aljedai, VP of innovation & digital capability at Saudi Arabia’s e-government programme Yesser.

Let the public polish

But before can governments can make use of their data, they must understand and address its flaws. For most datasets are riddled with omissions, errors, and the consequences of bias or duplicity among its human creators.

According to Wiseman, there is a 13% error rate in data entry. “There’s a lot of missing data in a typical government data set,” she said – and as many inaccurate or transposed values. “Typically, what we have isn’t perfect. And we don’t know that until we publish it,” she commented. “In fact, most data scientists tell me they spend 80, 85, 95 per cent of their time cleaning the data.”

The way government agencies handle their data can also be problematic. Analysts require raw data: the unprocessed, original information. But instead, governments often publish aggregate data, gleaned from various sources and processed – reducing its value to data scientists.

For example, said Aljedai, the Saudi Ministry of Education may publish aggregate data on schools, combining all the information from 400 or 500 schools into one record. “That benefits nobody,” he said. “What we need is all the schools, geolocations of these schools, size of the schools, how many students there are, the grades of the students.”

Public bodies also often publish information in PDF format, but this renders it inaccessible to analytics applications. “My term for PDF is ‘pretty darn frustrating’,” said Wiseman, “because we can’t use it.” Operators need Excel files, or other formats from which data can be exported.

Wiseman: “My term for PDF is ‘pretty darn frustrating’.”

Computers can be racist too

There is also the tricky issue of bias. In the US, for instance, “there’s a lot of discussion around our justice system data having racial bias: decades of heavily-policed urban neighbourhoods, that are primarily African-American, has caused a disproportionate number of arrests of black men in our country,” said Wiseman. “So if you look at arrest data, you see it’s racially biased.”

Prejudice on the part of American police officers, of course, also skews arrests. And these inherent biases in the data represent a major danger, because feeding this raw information into – for example – a predictive analytics system could generate results encouraging officers to arrest African-Americans, embedding human prejudice into the technology’s operation.

Aljedai: Publish raw data, not aggregated datasets

Tackling prejudice

People can be reluctant to use data analytics because of these in-built biases. But Wiseman encouraged the audience to instead identify traces of such discrimination in datasets, and to compensate for them by amending the data or tweaking the algorithms. “Don’t throw the data out with the bathwater: just recognise the bias, and say we are going to do our best, because human decision-making is biased; data decision-making is biased,” she said.

To tackle these quality issues, governments first need to “publish and polish” the data they hold, says Wiseman. Then feedback from businesses, academics and communities can alert agencies to its problems, enabling them to take the next step of “polishing” it before it’s used.

So the open data agenda is an important facilitator of public sector analytics, said Wiseman: “It’s how we get to the next step of having better data and more data; and the more high-quality data we have, the better we are.”

Vatter: “When you are injured, I would like the hospital to know [your medical] information in advance.”

Data in action

One New Orleans project demonstrates the power of data, said Wiseman. Following a lethal house fire, the city’s fire and data chiefs brought together data on housing – covering key values such as its age and construction method – and information on which kinds of homes were most likely to have a smoke alarm. “Putting disparate data sources into one database, developing risk profiles,” they then handed out free smoke alarms at the points of greatest risk; very soon, lives were being saved. “Data really does have the ability to transform government: it can change lives, it can save lives,” adds Wiseman.

Another life-saving example comes from the UK, explained Ordnance Survey’s John Kedar. “By combining data from many sources – predictions for rain, Environment Agency flooding likelihood statistics, and so on – we can predict where flooding will happen, and then the local authorities can help move citizens out the way before the flooding occurs,” he explained.

In Estonia, efficient data-sharing between government agencies and health providers improves emergency medical care, said Ott Vatter – ensuring that ambulance teams and receiving hospitals have accurate information on patients’ allergies and medical histories. “When you are injured, I would like the hospital to know [your medical] information in advance, so the likelihood of my survival would be greater,” he said.

Kedar: “By combining data from many sources… we can predict where flooding will happen.”

Good AI demands emotional intelligence

As countries move towards ever-greater reliance on data to drive operations, it’s important to ensure that citizens are familiar with the issues around generating and using data. A proportion, of course, need the technical abilities to help lead the digital revolution – using the science, technology, maths and engineering (STEM) skills so heavily promoted in many countries. But automation and machine learning are reducing the need for basic and mid-level programmers and IT staff; in the panelists’ view, it’s just as important that people understand the ethical and privacy issues around using data – and develop the very human skills required to ensure that technologies are used for good.

As well as supporting children to study STEM subjects, the panelists agreed, schools and parents should help them develop the essential human characteristics of curiosity and empathy.

“If we have those two things, we’re going to make really good government,” said Wiseman. “Because curiosity is continuously looking at the data and saying: ‘What does it mean? How do we do better? Are we meeting our customers needs?’ And empathy is that connection to what the customer, what the citizen, what the public needs.”

Previous Global Government Forum articles on the Putting Citizens First conference focused on the value of building services around users; technologies, and blockchain’s journey through the ‘hype cycle’; and the potential to create fully-automated public services.

About Natalie Leal

Natalie Leal is an NCTJ qualified journalist based in the UK. She holds a BSc and Master's degree in Social Anthropology and writes about society, poverty, politics, welfare reform, innovation and sustainable business. Her work has appeared in The Guardian, Positive News, The Brighton Argus, UCAS, Welfare Weekly, Bdaily News and more.

Leave a Reply

Your email address will not be published. Required fields are marked *