Training GPT-2 to win Connect Four game

I wanted to learn training GPT-2 as it is a powerful model. For simplicity, I chose Connect Four. This project serves as a basic example, with future plans for more complex applications.

I published the code in my Github repo

Game Overview

Connect Four is a simple strategy game where two players drop colored discs into a grid, aiming to connect four in a row. For more information, see the Wikipedia page.

Generating Training Data

Using the easyAI module, I created the script and added a random player to generate diverse game data.

AIs implemented in module are deterministic, so using AI vs AI would yeld the same results. This is why I used random player.

class RandomPlayer:
    def __init__(self, name="Human"):
        self.name = name

    def ask_move(self, game):
        possible_moves = game.possible_moves()
        return possible_moves[random.randint(0, len(possible_moves) - 1)]

The data generated by script is in format:

1353

I decided to convert it to:

[A1][B3][A5][B3]

So every possible step could be represented by unique token and the network could understand then a certain player is making a move.

For this I had to create new tokens:

'[A0]', '[A1]', '[A2]', '[A3]', '[A4]', '[A5]', '[A6]', '[B0]', '[B1]', '[B2]', '[B3]', '[B4]', '[B5]', '[B6]'

And add them to tokenizer:

special_tokens = [f'[{letter}{number}]' for letter in 'AB' for number in range(7)]
tokenizer.add_special_tokens({'pad_token': '[PAD]', 'additional_special_tokens': special_tokens})
tokenizer.save_pretrained('./tokenizer')
model.resize_token_embeddings(len(tokenizer))

To be sure, that network generates something that makes sense I created a custom callback, that fires after every evaluation:

class PrintRandomSampleCallback(TrainerCallback):
    def __init__(self, _validation_dataset):
        self.validation_dataset = _validation_dataset

    def on_evaluate(self, args, state, control, **kwargs):
        try:
            random_idx = random.randint(0, len(self.validation_dataset) - 1)
            sample = self.validation_dataset[random_idx]
            input_decoded = sample['input_text']
            label_decoded = sample['label_text']
            input = tokenizer(input_decoded, return_tensors='pt', padding=True, truncation=True).to("cuda:0")
            model.eval()
            with torch.no_grad():
                outputs = model.generate(input['input_ids'], attention_mask=input['attention_mask'],
                                         num_return_sequences=1, max_new_tokens=1,
                                         pad_token_id=tokenizer.pad_token_id)
            prediction_decoded = tokenizer.decode(outputs[0], clean_up_tokenization_spaces=True).replace(" ", "")
            print(f"Random validation sample:\nI: {input_decoded}\nL: {label_decoded}\nR: {prediction_decoded}")
        except Exception as e:
            print(e)

Model can be tested with last script. I created a new player class, that uses the trained model to make moves:

class GPTPlayer:
    def __init__(self, name="Human"):
        checkpoint_path = './fine-tuned-gpt2'
        self.model = GPT2LMHeadModel.from_pretrained(checkpoint_path).to("cuda:0")
        self.tokenizer = GPT2Tokenizer.from_pretrained(checkpoint_path)
        self.special_tokens = {f'[{letter}{number}]': number for letter in 'AB' for number in range(7)}
        self.name = name

    def ask_move(self, game):
        input_decoded = "".join([["[A", "[B"][num % 2] + str(x) + "]" for num, x in enumerate(game.history)])
        input = self.tokenizer(input_decoded, return_tensors='pt', padding=True, truncation=True).to("cuda:0")
        self.model.eval()
        with torch.no_grad():
            outputs = self.model.generate(input['input_ids'], attention_mask=input['attention_mask'],
                                          num_return_sequences=1, max_new_tokens=1,
                                          pad_token_id=self.tokenizer.pad_token_id)
        move_token = self.tokenizer.decode(outputs[0], clean_up_tokenization_spaces=True).split(" ")[-1]
        return self.special_tokens[move_token]

The model wins 70-80% of games against a random player but consistently loses to strategic players. This may be due to the simplicity of the training data, short training time, or the model’s inability to understand the game. Including the board state at each step could improve performance.

For more details, visit my GitHub repository.

Dodaj komentarz

Twój adres e-mail nie zostanie opublikowany. Wymagane pola są oznaczone *