Detecting cheaters in advent of code

Introduction

Discuss this post on HackerNews

Since the release of ChatGPT, there have been many articles discussing how the tool could be used for cheating in various contexts, such as school exams and GCSE History.

It was also used in this year's Advent of Code competition. However, it appears that its effectiveness declined after around day 3, with some people claiming it was useful until day 8 with additional input. The possibility of bots dominating the leaderboard in future competitions using improved versions of ChatGPT is a concern, and I have been thinking about how to detect this occurrence.

AI Detection

Stack overflow recently announced a policy change to ban auto generated responses from ChatGPT. I wonder how they will actually implement that policy as these tools become more powerful, but I think the writing is on the wall for the company.

Once a degree of trust can be established as to the veracity of the answers from a text transformer trained on code, Stack overflow ceases to be useful for most use cases.

There are still times it will be useful, but for a programmer looking up something like "network request python" things like ChatGPT or Github CoPilot become much easier to use, as they have IDE integration (Why didn't Stack Overflow make IDE plugins for this stuff years ago? Maybe they did, I am too lazy too look - point is, it didn't catch on, but something like CoPilot seems to have a market given GitHub think they can charge £10 a month for it).

ChatGPT did my homework

In the very early days of AoC this year there was a few tweets mentioning that the leaders on the board seemed to be answering in superhuman time. Indeed many people wrote tools to watch the webpage for when new question was live and send the text to ChatGPT to get absurdly fast solves.

As I mentioned, over time this started to die down, but still I feel like next year we might find that the text transformers can perform better.

I had a bit of a think about how you could figure out someone was doing this and one of the detection methods I came up with was the speed with which someone solved the second star after the first star. Some questions can be read and understood quickly (watch someone like Jonathan Paulson to see how quick a human can parse these questions once they know what patterns to look for) but as time goes on the delta between star 1 and star 2 seems to grow.

With this in mind I wanted to see if I could detect suspicious users in a leader board by getting the times for star 1 and star 2 on any given day.

Advent of Code has an API for private leader boards that shows the time that stars are collected, which you can helpfully get from any private leader board.

Using this API and a Python script, it is possible to mark any user who solves the second part less than a minute after the first as suspicious, potentially indicating the use of external resources or tools like ChatGPT or CoPilot.

The code

#!/usr/bin/python3
# Aims to find users who solved part 2 unreasonably fast after part 1.

import json
import datetime

suspect = []

# Randomly selected point that is sufficiently far in the competition
# 2 stars a day, so days * 2
how_many_completes = 36

# How many minutes do you think it would be weird to get the second part of the question in?
too_fast_minutes = 1

with open('aoc.json') as json_file:
    data = json.load(json_file)
    members = data["members"]

    for member_id in members:
        member = members[member_id]
        name = member["name"]
        stars = int(member["stars"])

        if stars > how_many_completes:
            completes = member["completion_day_level"]
            for day in completes:
                if '1' in completes[day] and '2' in completes[day]:
                    one = datetime.datetime.fromtimestamp(
                        float(completes[day]['1']['get_star_ts']))
                    two = datetime.datetime.fromtimestamp(
                        float(completes[day]['2']['get_star_ts']))

                    delta = two - one

                    if delta <= datetime.timedelta(minutes=too_fast_minutes):
                        print(f"{name} solved day {day} star 2 in {delta} minutes after star 1")

Conclusion

This was ultimately a fun but silly little fraud detection exercise on something with low stakes, but the ramifications of these tools continue to be something that we will have to address at some point.

The detection of fraud is an interesting problem, but as these tools become more widespread, it becomes increasingly difficult to effectively detect it. These technologies are widely available and we have yet to address the issue of how to deal with them in our cultures.

While some people may dismiss the potential impact of these tools, they are already capable of solving some problems and are constantly improving. It is important to recognize that the capabilities of these tools are rapidly advancing and could potentially surpass human capabilities in the near future.

Further reading