4

BSc Thesis

Investigating the Impact of Recurrent Connections on Pre-Trained Transformer-based Models for Modeling Long-Range Dependencies in Sequential Tasks

  • For my thesis, I have built a Transformer generator model. The project involved replicating an autoregressive Transformer model from scratch, based on this article from Professor Peter Bloem, who was also my supervisor for the thesis, alongside Dimitrios Alivanistos. The model was implemented in Python using the PyTorch library.

  • The primary goal of this research was to explore the impact and trade-offs of integrating a recurrent layer (GRU) into a pure self-attention architecture. My analysis focused on evaluating how the inclusion of this recurrent layer affected the generated output and the overall performance of the model. As such, I designed a modified architecture that incorporates recurrent connections into the self-attention-based transformer.

  • The main research question of this study was: How does the integration of recurrent connections into pre-trained models impact their ability to model long-range dependencies and improve performance on sequential tasks? To address this, I investigated how recurrent connections enhance sequential context modeling capabilities, and compared the performance of the modified model to other state-of-the-art models, and assessed computational requirements.

  • My model introduces a recurrent layer (GRU) into the pre-trained GPT-2 transformer. The dataset used for training is the enwik8 dataset, as well as the Alice in Wonderland Corpus, and a dataset containing podcast sequences.

  • The modified model was trained and evaluated on the DAS-5 SLURM Cluster. Performance metrics such as accuracy, precision, recall, F1 score, and AUC-ROC were used to assess its effectiveness in handling sequential tasks. Additional factors like memory usage, training time, and inference speed were also evaluated to understand the trade-offs introduced by the recurrent layer.

References (among others)

  • Vaswani, A., et al. "Attention Is All You Need." Advances in Neural Information Processing Systems, 2017.
  • Dai, Z., et al. "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." 2019.
  • Yoshida, D., et al. "Adding Recurrence to Pretrained Transformers." 2021.
  • Radford, A., et al. "Language models are unsupervised multitask learners." OpenAI Blog, 2019.