‘Publish and polish’: how to clean up dirty data

By Natalie Leal on 05/06/2019 | Updated on 05/08/2019

Debating data: At Putting Citizens First, the international panel debated how to handle flawed datasets

Public sectors may be short of cash, but they’re rich in data – giving them the tools to dramatically improve public services. Before picking up those tools, though, civil servants must ensure they’re fit for purpose; at Putting Citizens First, a panel debated how to clean and deploy flawed datasets. Natalie Leal reports

“We really are awash in an ocean of data,” said Jane Wiseman. “There are five billion internet searches a day, 14 million text messages sent every minute; all of this is data that’s created.” And this ocean is growing rapidly: “90% of the data ever created in the history of the world was created in the last couple of years.”

Government creates “tons of data,” she continued – including “all kinds of data about requests for service and service usage and quality of service usage.” But while governments are saturated in data, only 1% of it has so far been analysed, said Wiseman – who, as CEO at the US Institute for Excellence in Government, helps agencies to improve their performance using data analytics. There is enormous “untapped potential” for governments to make much better use of their vast banks of information, she argued.

Wiseman was talking during the Data and Analytics session at Putting Citizens First 2019: a Global Government Forum conference, organised in association with Yesser and EY. Three other panelists joined Wiseman at the event to give their thoughts on the subject: John Kedar, director of international engagement at UK’s Ordnance Survey; Ott Vatter, managing director of Estonia’s E-Residency scheme; and Basem Aljedai, VP of innovation & digital capability at Saudi Arabia’s e-government programme Yesser.

Let the public polish

But before can governments can make use of their data, they must understand and address its flaws. For most datasets are riddled with omissions, errors, and the consequences of bias or duplicity among its human creators.

According to Wiseman, there is a 13% error rate in data entry. “There’s a lot of missing data in a typical government data set,” she said – and as many inaccurate or transposed values. “Typically, what we have isn’t perfect. And we don’t know that until we publish it,” she commented. “In fact, most data scientists tell me they spend 80, 85, 95 per cent of their time cleaning the data.”

The way government agencies handle their data can also be problematic. Analysts require raw data: the unprocessed, original information. But instead, governments often publish aggregate data, gleaned from various sources and processed – reducing its value to data scientists.

For example, said Aljedai, the Saudi Ministry of Education may publish aggregate data on schools, combining all the information from 400 or 500 schools into one record. “That benefits nobody,” he said. “What we need is all the schools, geolocations of these schools, size of the schools, how many students there are, the grades of the students.”

Public bodies also often publish information in PDF format, but this renders it inaccessible to analytics applications. “My term for PDF is ‘pretty darn frustrating’,” said Wiseman, “because we can’t use it.” Operators need Excel files, or other formats from which data can be exported.

Wiseman: “My term for PDF is ‘pretty darn frustrating’.”

Computers can be racist too

There is also the tricky issue of bias. In the US, for instance, “there’s a lot of discussion around our justice system data having racial bias: decades of heavily-policed urban neighbourhoods, that are primarily African-American, has caused a disproportionate number of arrests of black men in our country,” said Wiseman. “So if you look at arrest data, you see it’s racially biased.”

Prejudice on the part of American police officers, of course, also skews arrests. And these inherent biases in the data represent a major danger, because feeding this raw information into – for example – a predictive analytics system could generate results encouraging officers to arrest African-Americans, embedding human prejudice into the technology’s operation.

Aljedai: Publish raw data, not aggregated datasets

Tackling prejudice

People can be reluctant to use data analytics because of these in-built biases. But Wiseman encouraged the audience to instead identify traces of such discrimination in datasets, and to compensate for them by amending the data or tweaking the algorithms. “Don’t throw the data out with the bathwater: just recognise the bias, and say we are going to do our best, because human decision-making is biased; data decision-making is biased,” she said.

To tackle these quality issues, governments first need to “publish and polish” the data they hold, says Wiseman. Then feedback from businesses, academics and communities can alert agencies to its problems, enabling them to take the next step of “polishing” it before it’s used.

So the open data agenda is an important facilitator of public sector analytics, said Wiseman: “It’s how we get to the next step of having better data and more data; and the more high-quality data we have, the better we are.”

Vatter: “When you are injured, I would like the hospital to know [your medical] information in advance.”

Data in action

One New Orleans project demonstrates the power of data, said Wiseman. Following a lethal house fire, the city’s fire and data chiefs brought together data on housing – covering key values such as its age and construction method – and information on which kinds of homes were most likely to have a smoke alarm. “Putting disparate data sources into one database, developing risk profiles,” they then handed out free smoke alarms at the points of greatest risk; very soon, lives were being saved. “Data really does have the ability to transform government: it can change lives, it can save lives,” adds Wiseman.

Another life-saving example comes from the UK, explained Ordnance Survey’s John Kedar. “By combining data from many sources – predictions for rain, Environment Agency flooding likelihood statistics, and so on – we can predict where flooding will happen, and then the local authorities can help move citizens out the way before the flooding occurs,” he explained.

In Estonia, efficient data-sharing between government agencies and health providers improves emergency medical care, said Ott Vatter – ensuring that ambulance teams and receiving hospitals have accurate information on patients’ allergies and medical histories. “When you are injured, I would like the hospital to know [your medical] information in advance, so the likelihood of my survival would be greater,” he said.

Kedar: “By combining data from many sources… we can predict where flooding will happen.”

Good AI demands emotional intelligence

As countries move towards ever-greater reliance on data to drive operations, it’s important to ensure that citizens are familiar with the issues around generating and using data. A proportion, of course, need the technical abilities to help lead the digital revolution – using the science, technology, maths and engineering (STEM) skills so heavily promoted in many countries. But automation and machine learning are reducing the need for basic and mid-level programmers and IT staff; in the panelists’ view, it’s just as important that people understand the ethical and privacy issues around using data – and develop the very human skills required to ensure that technologies are used for good.

As well as supporting children to study STEM subjects, the panelists agreed, schools and parents should help them develop the essential human characteristics of curiosity and empathy.

“If we have those two things, we’re going to make really good government,” said Wiseman. “Because curiosity is continuously looking at the data and saying: ‘What does it mean? How do we do better? Are we meeting our customers needs?’ And empathy is that connection to what the customer, what the citizen, what the public needs.”

Previous Global Government Forum articles on the Putting Citizens First conference focused on the value of building services around users; technologies, and blockchain’s journey through the ‘hype cycle’; and the potential to create fully-automated public services.

About Natalie Leal

Natalie is a freelance journalist whose work has been published by The Sun Online, The Guardian, Novara Media, Positive News, and Welfare Weekly, among others. She also writes reports and case studies on global business trends for behavioural insights agency, Canvas8. Prior to working as a journalist Natalie worked for the public sector in social services for several years. She switched careers in 2013 after winning a fully funded NCTJ in a national writing competition. She holds a Masters degree in social anthropology from Sussex University where she specialised in processes of social change and international conflict and reconciliation processes.

Latest News

How governments are using mobile IDs to transform services for citizens
As governments around the world look to deliver digitally-enabled services,...
- Posted July 14, 2023
- 1
Women’s Network news roundup: near gender parity in Canada’s new Cabinet – and the mountain Japan has to climb
GGF brings you the latest roundup of women-focused and gender...
- Posted June 18, 2025
- 0
Building blocks of modern government: the digital identity opportunity
Digital IDs can unlock government service transformation – but delivery...
- Posted June 18, 2025
- 0
Relive all the sessions from Innovation 2025
Thousands of delegates flocked to London in March to hear...
- Posted June 18, 2025
- 0
How to lead public sector data and digital transformation
Insight on how to use data and digital integration to...
- Posted June 17, 2025
- 0
Projects shortlisted for Canada Public Service Data/AI Challenge
Four projects focusing on how to better use data or...
- Posted June 17, 2025
- 0
GenAI and distributed-ledger tech to have ‘transformative’ impact, says Ireland’s digitalisation minister
Generative artificial intelligence (AI) and distributed-ledger technology (DLT) are among...
- Posted June 17, 2025
- 0
Public sector fintech moves up a gear as Dublin hosts fourth Global Government Fintech Lab
Financial technology’s ‘transformative potential’ for public sector transformation has been...
- Posted June 12, 2025
- 0

How to lead public sector data and digital transformation

Insight on how to use data and digital integration to cut costs and improve public services How are government departments using data and digital services to support better decision-making and […]

How UK metro mayors are focused on digital transformation

A CGI analysis of the digital priorities of metro mayors around the UK finds many similarities in the areas of focus – but also several shared barriers Tara McGeehan, president […]

AI deployment in government: laying the groundwork for success

In this webinar, experts discussed how best to implement artificial intelligence in the public sector, and how civil servants could ready themselves for an AI future Governments around the world […]

In pursuit of seamless service: how AI is revolutionising citizen-government interactions

Satisfaction with online government services lags behind those offered by the private sector by quite some margin. But as Jennifer Robinson, global strategic advisor for SAS’ public sector practice, writes, […]

Building trust in AI to help government deploy it in the UK government

Join this webinar to find out: • How the government can set AI assurance standards that can apply across the public and private sectors. • How individual public sector organisations can work to build confidence in the use of AI. • How international collaboration can help tackle AI risks that cross borders.

Confident decision-making for better outcomes in government

This session will discuss: · How governments are using data to drive more responsive decision-making, and the value that can be unlocked from more timely and accurate data in government. · How governments are changing the way they make decisions to become more agile in how they react to changing and evolving conditions. · The skills needed to make confident decisions in government · The role of AI in better decision-making

Getting information to the right people at the right time: how identity systems connect government

Join this session to find out: • How identity systems can build trust across government to share information. • How to unlock the benefits of better information sharing through ICAM. • How identity technology can be deployed across government systems – regardless of their age and deployment method – to build confidence around information sharing.

Real time insight for missions: how to measure progress on the government’s priorities

Join this webinar to find out: • How government policies can be developed to make the most of up-to-date data. • Lessons from how real time data is used in government. • How the public sector can collect data from both public and private sectors to maximise insight.

‘Publish and polish’: how to clean up dirty data

Let the public polish

Computers can be racist too

Tackling prejudice

Data in action

Good AI demands emotional intelligence

About Natalie Leal

Leave a Reply

Latest News

General

Resources

Popular Public Bodies

UK

USA & Canada

Asia Pacific

World

Global Government Forum Events

‘Publish and polish’: how to clean up dirty data

Let the public polish

Computers can be racist too

Tackling prejudice

Data in action

Good AI demands emotional intelligence

About Natalie Leal

Related Posts

Leave a Reply

Latest News

General

Resources

Popular Public Bodies

UK

USA & Canada

Asia Pacific

World

Global Government Forum Events