Transformers in AI: What They Are and How They Work

Transformers in AI: What They Are and How They Work

A team of researchers at Google Brain, which is now part of Google DeepMind, headed by computer scientist and leading machine learning researcher Ashish Vaswani introduced a deep learning architecture called transformers in 2017 to solve problems related to natural language processing or NLP. Algorithms and models based on transformers have helped in advancing the subfield and applications of NLP and have also contributed to the further development of generative artificial intelligence and the specific subfield of computer vision.

Simplified Explainer of What Transformers Are and How They Work

General Overview and Definitions

Transformers are a specific type of artificial neural network architecture, a deep learning model, and an NLP model that utilize “self-attention” to process input sequences to produce output sequences. Their introduction has helped in advancing NLP because of their demonstrated capabilities to outperform other NLP models on a multitude of tasks including language translation, sentiment analysis, and question answering.

Self-attention enables a mode based on transformer to attend to the different parts of the input sequence, such as individual words in a sentence, and use that information to make predictions or produce output. Each part in the sequence can “attend” to other parts in the same sequence to allow the model to capture relationships and dependencies between the different parts.

A. Vaswani and his team explained that the previous models for converting one sequence to another, like translating from English to German or French, are complex and take a long time to train. Their solution was to use a transformer-based model that uses a simpler design based on an attention mechanism or self-attention. It is considered faster to train, easier to parallelize, and achieves better results than alternative models.

The biggest advantage of transformers is that they can be trained on large datasets because they are amenable to parallelization. This has led to pre-trained systems such as Bidirectional Encoder Representations from Transformers or BERT from Google and the Generative Pre-trained Transformer or GPT from OpenAI, and the Large Language Model Meta AI or LLaMA from Meta.

Analogical Explanation of Operation

Using an analogy can help better understand how transformers work and why they have become critical in advancing the practical applications of AI through recent developments in natural language processing and advances in large language models and multimodal language models. Hence, to start off, consider a particular transformer as a super smart machine that can understand and talk at the same level or capabilities as humans.

This machine is also capable of understanding different languages including English and French. Take note that translating a sentence from English to French would require an average human to understand what each word means and how they related to each other in a sentence. Each part of the sentence should be given special attention for a more accurate translation.

Attending to each part of the sentence is called self-attention. This is central to how transformers work and to their biggest advantages. The analogical machine would then translate the provided sentence from English to Spanish by focusing on its different parts while also giving more attention to words that are deemed more important. It would know which words are critical because it was previously exposed or trained to a large dataset.

Nevertheless, in natural language processing, a particular transformer-based model would process the individual parts of an input sequence. This is the case for GPT models used in chatbots like ChatGPT or Bing Chat. The model processes and understands the individual words in a given prompt to find relations, determine context, and generate the best possible output.


  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. 2017. “Attention Is All You Need.” arXiv. DOI: 48550/ARXIV.1706.03762