Elasticsearch Data Breach Analysis: Searching 500M Records

Searching a data breach with ElasticSearch Dec 22, 2022

Introduction

When 500m Facebook records appeared online, I wanted to know if I was in the leak. I was not. But the files were huge, and `grep` was painful.

So I used Elasticsearch and Kibana to index the data and query it like a search engine. It handles indexing, sharding, and performance, and Kibana gives you a friendly UI. I had used Elastic before and knew it could handle the scale.

659363dc786b3a42e605e671a331ac2341c3f3fd.png

Here is how I analyzed the UK data set.

Steps

Step 1 - Setup an Elasticsearch stack

An Elastic stack is just Elasticsearch + Kibana. Docker makes it easy. If you also need to send email alerts from a container, I wrote a short guide here: docker send email with bytemark/smtp.

Save this as docker-compose.yml and run docker compose up. Once it is running, open Kibana at http://localhost:5601.

~version : " 3.3" services : elasticsearch : container_name : es-container image : docker.elastic.co/elasticsearch/elasticsearch:7.11.0 environment : - xpack.security.enabled=false - " discovery.type=single-node" networks : - es-net ports : - 9200:9200 kibana : container_name : kb-container image : docker.elastic.co/kibana/kibana:7.11.0 environment : - ELASTICSEARCH_HOSTS=http://es-container:9200 networks : - es-net depends_on : - elasticsearch ports : - 5601:5601 networks : es-net : driver : bridge ~

4d94ad6cddac6415015b28e9af90a89ce097a203.png

Step 2 - Clean the data

The UK data was three large CSV files (200-400 MB each). Kibana's file uploader tops out at 100 MB and has no bulk import. So I split each file into smaller chunks.

I aimed for 200k rows per chunk (about 15 MB), which let me validate the data in Kibana and then switch to the REST API for the full import. I kept the three files separate in case they had different shapes.

I created three directories (part_1, part_2, part_3) and wrote a bash script to split each file into 200k rows, add a header, and write the parts out. It is rough, but it worked.

~for idx in 1 2 3 do FILENAME = " ${ idx } .csv" HDR = 'phone_number:id:firstname:lastname:gender:location_one:location_two:relationship_status:works_at:account_create_date:email:birthdate' split -l 200000 $FILENAME xyz n = 1 for f in xyz * do echo ${ HDR } > "part_ ${ idx } /Part ${ n } .csv" cat $f >> "part_ ${ idx } /Part ${ n } .csv" rm $f (( n++ )) done
done ~

39b2714765c3a0d27c40f9b246572737635ac204.png

Step 3 - Load the data (test)

I first used the file uploader to check that the data looked correct. I told Kibana the delimiter was : instead of ,, and that the data had a header row. That gave me a working index and a schema I could reuse.

83691d9fd6a27525c0d138d065f8110f0de888a5.png

7b4b30924849a89894405f8f7a0aeb585c9f8bfb.png

b71a7de1f1490a82060890d5986c41163804e5c3.png

bf6119a0f2577db9662ba51dc194ca505905cd29.png

e77f7c3ab067ac9ebb0244656d48a9e58e028090.png

Step 4 - Load the data (real)

The UI only lets you upload one file at a time, so I switched to the REST API. I used elasticsearch-loader (a Python library) to load each chunk into the same index.

~for idx in 1 2 3 do
 for file in part_ ${ idx } / * do echo "Uploading: ${ file } " elasticsearch_loader --index facebook_uk_breach csv ${ file } --delimiter ":" done
done ~

Step 5 - Investigate

With the data indexed, I searched for friends and family to see if they were affected. Some were.

Kibana's Discover view lets you query with KQL (a SQL-like syntax). For example: firstname:"Adam" AND lastname:"Fallon". Queries that would take minutes with `grep` return in milliseconds.

I also filtered by fields like works_at:'Github' and location_one:'London'.

a9cbb133c701b829a58a3e1a817f19a033f8ea3d.png

Some quick results:

  • (*) A star in Elasticsearch shows every result: 10,620,521 hits (the dump I had was a subset)
  • Rows with workplace set to "MI6": 136 hits
  • Rows where hometown matched my hometown: 721 hits
  • Rows where members worked at the Conservative Party: 82 hits
  • Rows where the name matched an old school friend: 1 hit (the number was correct)

Most common jobs:

1cff2cad4fda26a22d23c5ba04bd49b7d9223b40.png

Most common location and relationship status:

1c527dfb91281b24ac459643d7780c1b753a4f8e.png

Step 6 - Mapping regions where users were affected

This did not work well. The breach data had free-text address fields, so matching against geographic data was messy. I tried to use Elasticsearch's mapping tools, but the results were noisy.

  • Add a layer to a map

d55c81acafacd445683b69e596731c3bf8a343dd.png

  • Create the layer

bba1239243679e3f7fefeba406a9f0817c2f6437.png

  • Modify the layer

94db1c9daece9dadd83f4fb0159225d7a78805d7.png

  • Add ISO-3166-2 Code

b7536cd170959d6de5b7165d2805742d6097ea8e.png

  • The final mapped data

0507ec1d8966dab1095cbcb1b1dda805af797c21.png

Conclusion

Elasticsearch is a fast way to search huge data sets. Once indexed, you can ask complex questions and get answers in milliseconds.

This breach is still bad news. Phone numbers are treated like security tokens, and they are hard to change. Having 500m of them floating around is a real problem for anyone affected.

If you are interested in other database techniques, check out my post on building location-based features with Postgres geospatial queries.