Big data is data collected from the internet and while the information is supposed to be anonymised it’s becoming more and more apparent just how easy it is to identify single individuals in the datasets. (Photo: Shutterstock)

Your identity is not safe in anonymous data

New study reveals how easy it is to identify anonymous credit card users.

Scientists from the Massachusetts Institute of Technology (MIT) and Aarhus University have shown how easy it is to identify people from datasets containing anonymous credit card transactions.

The method is described in a study recently published in Science.

"The results came as a surprise because we had not envisioned that the chance of identifying people would be so big," says co-author Laura Radaelli, now a post-doc at Tel Aviv University after having completed her Ph.D. at Aarhus University in 2014.

Radaelli studied big data at the Department of Computer Science at Aarhus University and it was in connection with an extended visit to MIT that the research into credit card data fell into place.

The researchers at MIT had already demonstrated how easy it was to identify users in a mobile dataset even though both their names and their phone numbers have been removed.

Simultaneously, an increasing number of scientific journals started requiring that when scientific articles were based on big data that the dataset should be published to enable other scientists to scrutinise the results.

"Although it’s become easier to collect big data, not all scientists are able to do so. Sharing the data opens up the opportunity for many different researchers to find different interesting results in the dataset. But this data is on people -- so before we can share it we need to know that it doesn’t pose any risk to them,” says Radaelli.

Just a few purchases can reveal identity

The scientists set out to find out whether anonymisation -- which normally involves the removal of names, addresses, account numbers, telephone numbers, and other indicators which can identify a person -- is enough.

The names are often replaced with a unique user number, making it possible to follow the person's purchasing habits without knowing who the person actually is.

In order to test the anonymity, the scientists used a dataset containing the credit card transactions of 1.1 million people in 10,000 shops over a period of three months.

The aim was to find out how much you had to know about a person in the real world to find out who they were in the dataset.

The dataset also contained, in addition to the anonymous user number, the shop in which the purchase was made, the date of the purchase, and the price of the goods.

"Let's assume that I know you visited four specific shops on four specific dates," says Radaelli. "In 90 per cent of cases you’ll be the only person to have done that."

So nine out of ten people could be identified in the dataset on the basis of knowledge about the place and date of only four transactions.

Knowing the precise price would naturally make it easy for the scientists to find people since they could simply search the transactions with the precise amount and see who was responsible for it. Instead, they made do with looking at approximate prices, which could be guessed by looking at what the purchaser left the shop with.

"If you made your purchase in a shoe shop and we could guess an approximate price, the probability of your being identified increased by 22 per cent," says Radielli, and added that women and people with high incomes where the easiest to identify.

"We have not examined the reason for this. This was simply an observation we found worthy to take note of to make people aware of the fact that some individuals can be easier to identify than others," she says.

This should be taken serious: professor

At Denmark's Technical University (DTU), assistant professor Sune Lehmann, who studies big data at the Department of Mathematics and Computer Science, says the debate caused by the new study is important.

"It points to an important problem and presents a fine analysis," says Lehmann.

To him, this is yet another reminder of the fact that it’s really dangerous to share data if all you do is remove the indicators. Such data should not be regarded as anonymised data but rather as de-identified, he says.

"I could've gone on Twitter and posted that I'd bought new runners or logged on Amazon and shared that I'd bought new jogging pants," says Lehmann. "If you make these things public, it's possible to connect a few dots and find people in the datasets," he says.

“You need to take this pretty seriously,” says Lehmann. “It can be really difficult to know beforehand what it takes to re-identify people. As citizens we need to start recognising these concepts," says Lehmann.


Read the original story in Danish on

Translated by: Hugh Matthews

Scientific links

External links

Related content
Powered by Labrador CMS