Searching a data breach with ElasticSearch
Introduction
Facebook managed to somehow degrade user trust further when 500m user records showed up online. The breach included users email, location, work place and most concerningly, phone numbers. I was curious if I was affected as I have an account on many of their services.
Thankfully I was not, but after scanning through the files with grep I wondered if I could do it an easier way. So I turned to Elasticsearch.
Elastic search allows you to index data from numerous data sources. It’s sort of like search-in-a-box. You install it on a server, point your data at it and fire away.
It handles all the complicated indexing, sharding and computer-science parts of making search fast and effective, and it gives you a lovely web interface called Kibana to search through the data.
I had used ES in many projects before, both professionally and in side projects and I knew it could handle the data I was going to throw at it.
Here are the steps I went through to analyse the UK data set;
Steps
Step 1 - Setup an Elasticsearch stack
There are a few main parts of an “Elastic stack” - I needed Elasticsearch to do all the upload and indexing, and Kibana to view the data in a web interface. This is very simple if you use Docker. Here is the docker compose file I used. Save this somewhere as docker-compose.yml
and run docker compose up
and it will be running in minutes. Once it’s done loading you can go to kibana via http://localhost:5601
version: "3.3"
services:
elasticsearch:
container_name: es-container
image: docker.elastic.co/elasticsearch/elasticsearch:7.11.0
environment:
- xpack.security.enabled=false
- "discovery.type=single-node"
networks:
- es-net
ports:
- 9200:9200
kibana:
container_name: kb-container
image: docker.elastic.co/kibana/kibana:7.11.0
environment:
- ELASTICSEARCH_HOSTS=http://es-container:9200
networks:
- es-net
depends_on:
- elasticsearch
ports:
- 5601:5601
networks:
es-net:
driver: bridge
Step 2 - Clean the data
Once my stack was up an running I needed to take the 3 very large CSVs containing the Facebook UK data (they were between 200 to 400mb each) and get them into Elasticsearch somehow. The first thing I tried was the fancy AI/ML File Search tool. Weirdly, it has a 100mb file limit, so it didn’t even suit my smallest file, and it doesn’t even have a bulk importer. Useless! So I needed to split the CSVs into smaller parts.
I did some napkin math and found that if each file had about 200k rows they would end up around 15mb~ each, which would mean I could use the file uploader to check my data looks about right in Kibana, then I could move on to just use the Elasticsearch REST API to upload the parts 1 by 1.
At this point, I wasn’t sure why the data was split into three files (1.txt, 2.txt and 3.txt), I assumed there might have been some subtle differences between the data of each set (not my first time cleaning data), so I decided to keep them seperate so I could validate they all were around the same shape.
I created three different directories (part_1
, part_2
& part_3
) and wrote a little bash script to split each file into 200k chunks, attach a header I guessed was right from just manually scanning through the data, and then it spat it out into its corresponding folder. E.g; 1.txt of 1.7m rows -> Divide / 1.7m into 200k
-> For each 200k write the header, add the rows and save to disk under part_1/Part{N}.csv
.
Here is that script (listen here, I know I am not good at Bash, and this probably sucks, but it works!)
for idx in 1 2 3
do
FILENAME="${idx}.csv"
HDR='phone_number:id:firstname:lastname:gender:location_one:location_two:relationship_status:works_at:account_create_date:email:birthdate'
split -l 200000 $FILENAME xyz
n=1
for f in xyz*
do
echo ${HDR} > "part_${idx}/Part${n}.csv"
cat $f >> "part_${idx}/Part${n}.csv"
rm $f
((n++))
done
done
Step 3 - Load the data
Now I had my little chunks I wanted to do a test run use the crappy file uploader. I dropped a part into the interface, change a few settings to let Elasticsearch know that I was using :
rather than ,
for my delimiter, and that the data had a header row.
I uploaded it all and it looked about right. First time, every time baby. The benefit of doing this part in the web interface means you can wander your way around the data and turn it into some form of structured output that Elastic will understand, and in turn you get an index out of it, which means you can then use that index to upload more data against that fits the same shape, which is exactly what I wanted to do.
Step 4 - Really load the data
Okay cool, data looks ok, there are tonnes of missing data, but thats cool, not everything in this world is perfect, but looked like I was ready to rock. I needed to use the REST API to upload all the little chunks because I wasn’t about to click 69 times on that stupid little screen (seriously, the shift click thing doesn’t work, there is an honest-to-god limit of a single file you can upload to that interface).
I knew I had a index from the previous step and I wanted all the rest of the parts to be indexed using that, so I knew there was some tool out there that could help me do that without fudging around writing my own script. I settled on elasticsearch-loader (a python lib that you can install through pip) because it was the first one I found and it worked. That is a good a reason as any.
So I made another little bash script to increment through all the parts, throwing the part at elasticsearch_loader, with the index and the delimiter and I was done with the upload. Very simple!
for idx in 1 2 3
do
for file in part_${idx}/*
do
echo "Uploading: ${file}"
elasticsearch_loader --index facebook_uk_breach csv ${file} --delimiter ":"
done
done
Step 5 - Investigate
So now I had my data, I wanted to have a peek at all my friends names to let them know if they had been affected. A few had been. :(. I am sure they will enjoy the countless spam they will receive in the years to come.
So the way I did this was by using the “Discover” feature in Kibana. This lets me create queries using KQL a weird language that sort of looks like SQL. By creating a query like firstname:"Adam" AND lastname:"Fallon"
Elasticsearch quickly races through the data in ways you can’t even fathom and returns a result in milliseconds.
Like a slightly nerdier Google I made over this data, I was sitting thinking things to type into this search box. I was able to filter using the fields, i.e works_at:'Github' and location_one:'London'
Some results;
- (*) - A star in Elasticsearch shows every result -
10,620,521 hits
(The dump I had was a subset) - Rows who have put their workplace as “MI6” -
136 hits
- Rows where they hometown matches my hometown -
721 hits
- Rows where the members work at the Conservative Party -
82 Hits
- Rows where the name matched an old school friend who I hadn’t talked to in a while -
1 Hit
, the number was correct.
Most common jobs;
Most common location and relationship status;
Step 6 - Mapping regions where users were affected
This didn’t work very well. The data in the breach had two address fields, and they seem to be allowed free-text entry here. I tried to use Elasticsearches in-built mapping data to join on the location fields by their full name, but as you can see it didn’t work very well.
- Add a layer to a map;
- Create the layer;
- Modify the layer;
- Add ISO-3166-2 Code;
- The final mapped data;
Conclusion
So there you have it - a better way to search over this huge data breach, with the ability to generate complex queries, and have results from millions of rows back in milliseconds.
This data breach is really not good news. Phone numbers should be private, because for some reason, we treat them as a security token.
To have 500m of them floating around in an easily accessible text file is a gigantic pain in the ass for many people. You can’t easily change your phone number without a lot of trouble, which makes it especially bad. For the people who have been affected by this I am truly sorry.