The Biggest Data Breaches Aren’t Even Breaches Anymore: The Rise of the Scraped Data Breach

When you picture a “data breach”, you’re probably thinking of an attacker using a vulnerability to sneak confidential data out of an organization before selling it on the dark web or installing malware onto a device, pivoting through the network , and exfiltrating files before encrypting everything and asking for a ransom for the encryption key and another for the files or they’ll be posted online. 

However we seem to be hearing more and more about data scraping, whether it’s 23andMe’s large breach resulting from only a few compromised accounts or the Trello breach from publicly available boards.

These breaches aren’t caused by a vulnerability necessarily, but from accessing publicly available data, using bots to hoover up all available data. In an ironic twist we’ve also seen the MOAB: Mother of all Breaches, a data scraping from publicly available data breaches

To conduct scraping on this scale scrapers will employ bots, usually onto APIs. Scraping on this scale takes advantage of the API, with the clear formatting of JSON output, scaling up to meet client demand and predictable endpoints.

We’ve seen this in our customers too. In fact, we recently identified that bots make up 80% of login traffic on a single API of one of our customers.

As we take advantage of cloud infrastructure, so do they.


Playing the Blame Game

In the case of 23andMe, the blame for the breach was placed on users reusing passwords. While this is poor practice, the attackers managed to get into 14,000 accounts and expose the data of 6.9 million users. How? Well, the attackers were able to use the DNA relatives feature, allowing users to voluntarily share their information with relatives. 

This scraping doesn’t use typical vulnerabilities that can be blocked with a WAF, with no clear signature of malicious payloads.

Data scraping looks like legitimate traffic, without the clear malicious payload signature. What sets it apart is usually just volume, and on a high traffic API, this can be difficult to detect. With more sophisticated bot owners distributing their attack over a longer time or more IP addresses, which we call a “low and slow” attack.

Fundamentally, data scraping traffic may not be recognized as malicious, but it should flag as anomalous. It generates a lot of traffic on specific API endpoints, and tends to do so from an IP associated with data centers rather than home internet users, or with mismatched device information, such as an Android version that doesn’t exist.

While it can be tempting to blame users for opting into features that allow public sharing of sensitive information, there are steps that organizations can take to combat these attacks, and regulations which say they must.


How To Stop Scraping

Part of what makes APIs the perfect solution for a lot of apps, also makes them the perfect target for data scraping using bots. You should always investigate anomalous traffic to an API, whether location, IP address, time or traffic. Many bot attacks will attempt to fly under the radar, however, an attacker may conduct a burst of attacks, so it’s always important to look for spikes in traffic you wouldn’t expect.

On particularly sensitive endpoints, ensure that rate limiting is turned on, and use account lockouts rather than IP bans, so an attacker cannot simply move their attack to a new cloud provider. Regularly review public data and data available to other users, practice data minimization where you can, only implementing endpoints that return the data the user needs and no more. It is difficult to completely stop all data scraping, but early detection is key to avoid large scale data breaches.

It is important to recognize the regulation aspect of data scraping, GDPR and similar legislation considers personal data to be any data that can identify someone, however, that is not necessarily one data point, and collections of data can also be considered personal data under GDPR. Under this legislation there is a requirement and obligation on organizations to take data protection seriously and comply or face large fines.


The Traceable Advantage

We’ve invested a lot into fraud with our Digital Fraud Prevention combining our comprehensive threat prevention, advanced anomaly detection and deep data lake with our context-first API security. We’ve already helped our customers discover large, low and slow bot attacks, as attackers test the balance of payment accounts, brute force logins even over distributed attacks. If you’d like to find out more you can read the solutions brief or request a demo.