Transformers in AI: What They Are and How They Work

Transformers in AI: What They Are and How They Work

A team of researchers at Google Brain, which is now part of Google DeepMind, headed by computer scientist and leading machine learning researcher Ashish Vaswani introduced a deep learning model called transformers in 2017 to solve problems related to natural language processing or NLP. Algorithms and models based on transformers have helped in advancing the subfield and applications of NLP and have also contributed to the further development of generative artificial intelligence and the specific field of computer vision.

Simplified Explainer of What Transformers Are and How They Work

General Overview and Definitions

Transformers are a specific type of artificial neural network architecture, a deep learning model, and an NLP model that utilize “self-attention” to process input sequences to produce output sequences. Their introduction has helped in advancing NLP because of their demonstrated capabilities to outperform other NLP models on a multitude of tasks including language translation, sentiment analysis, and question answering.

Self-attention enables the transformer model to attend to the different parts of the input sequence, such as individual words in a sentence, and use that information to make predictions or produce output. Each part in the sequence can “attend” to other parts in the same sequence to allow the model to capture relationships and dependencies between the different parts.

Vaswani et al. explained that the previous models for converting one sequence to another, like translating from English to German or French, are complex and take a long time to train. Their solution was the so-called transformer model that uses a simpler design based on an attention mechanism or self-attention. It is considered faster to train, easier to parallelize, and achieves better results than the best-known models.

The biggest advantage of transformers is that they can be trained on large datasets because they are amenable to parallelization. This has led to pre-trained systems such as Bidirectional Encoder Representations from Transformers or BERT from Google and the Generative Pre-trained Transformer or GPT from OpenAI, and the Large Language Model Meta AI or LLaMA from Meta.

Analogical Explanation of Operation

Using an analogy can help better understand how transformers work and why they have become critical in advancing the practical applications of AI through recent developments in natural language processing and the emergence of large language models and multimodal language models. Hence, for starters, think of a particular transformer as a super smart machine that can understand and talk at the same level or capabilities as humans.

This machine is also capable of understanding different languages including English and French. Take note that translating a sentence from English to French would require an average human to understand what each word means and how they related to each other in a sentence. Each part of the sentence should be given special attention for a more accurate translation.

Attending to each part of the sentence is called self-attention. This is central to how transformers work and their biggest advantages. The analogical machine would then translate the provided sentence from English to Spanish by focusing on its different parts of while also giving more attention to words that are deemed more important. It would know which words are critical because it was previously exposed or trained to a large dataset.

Nevertheless, in natural language processing, a particular transformer model would process the individual parts of an input sequence, such as in the case of the GPT model used in chatbots like ChatGPT or Bing Chat processing and understanding the individual words in a written input or prompt to find relations, determine context, and generate the best possible output.


  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. 2017. “Attention Is All You Need.” arXiv. DOI: 48550/ARXIV.1706.03762