Hello and welcome back to the NLP Tutorials series. In the last two articles, we have looked into a few applications in the NLP domain ,  NER and Summarization, which can be found in various real-world settings. In this article, we shall get back to the understanding one of the latest and most interesting architecture s –  Infinite-former aka Infinite Transformer! The paper vouches for an attention mechanism which can cater to unbounded long-term contextual memory, which is groundbreaking since many architectures have tried solving the complexity and memory constraints of vanilla transformers albeit with a limited memory (longer sequences but limited). The authors also introduced sticky memories which are able to model very long contexts with a fixed computational requirement. 

Performance benchmarks. Note how the Perplexity metric of Infinite-former is better than Transformer-XL and Compressive Transformer. Image Credits — Infinite-former paper

Performance benchmarks. Note how the Perplexity metric of Infinite-former is better than Transformer-XL and Compressive Transformer. Image Credits — Infinite-former paper

Sounds too good to be true? Let’s conclude after we understand the inner workings.


Right off the bat it seems like it draws similarities from the LSTM model which also can model very long sequence lengths (Vanishing gradient problem encountered). We have seen the Transformer-XL and Compressive Transformer which are quite similar to the approach Infinite-former has taken — Ability to model very long sequences. All the above discussed models have delivered exceptional performance, but with a finite context length. The authors have come up with a clever heuristic way of modelling the self-attention — Linear combination of Radial Basis Functions. Intuition is to have a continuous attention lookup rather than a set of discrete points, which of course result in extensive memory and computation required for updating and maintaining them. This is termed as Continuous Attention mechanism.

Architecture & Working Principle

Continuous Attention is modelled by first converting the discrete signals into a continuous attention function X(t) along with having a probability density function which can span over a range on X(t) and select the segments to attend.

Continuous attention function. Image by author

where B(T) is the coefficient matrix which is representing the past segments in the sequence and Ψ(t) is the Radial Basis Function (RBF) within the range [0,1]. Further, B(T) is computed by a Ridge Regression function which will determine the right coefficients to be used along with the RBF such that X(t) is able to represent the context in a continuous waveform manner. Below is the formula for B(T).

Image by author

where F contains all the basis vectors for all l locations (F = [Ψ(t1), Ψ(t2), …Ψ(tL)]). We can notice a tradeoff here, B(T) is dependent on the length of sequence via the X variable (Discrete text sequence representation). This will lead to a drop in performance (more mistakes) because the basis vectors F are fixed as opposed to X. 

Now that the mechanism for X(t) is clear, let’s see how they have put this concept to practice.

In vanilla transformers, we have a discrete probability distribution (DPD) for the attention to focus on particular segments in the sequence. Since the attention function is continuous now, the authors replaced the (DPD) with a gaussian probability density p with 2 variables σ² and μ which help in fine-tuning the function to the surrounding context to be attended and the strength of attention. If we see a step by step overview of things in computation of attention for a current segment it will look like this,

  1. Query the long-term memory (LTM) via the B(T) matrix as K = BW(k) and V = BW(v)
  2. Compute σ² and μ which constitute the gaussian function representing the spread and strength of the context to be covered
  3. Compute Z(LTM) by imposing the gaussian p and the existing X(t)
  4. Concatenate Z(T) and Z(LTM) to get the final context representation Z
Infinite-former’s continuous attention mechanism. Image Credits — Infinite-former paper
Computation of σ² and μ. Image Credits — Infinite-former paper

Another way to describe this process,

Interpolate the current interpolation + Append the interpolated new signal = Compressed on a single waveform. This is no different than LSTM if we abstract the details.

The infinite-former’s attention is modelled in a continuous space mechanism, dependent on the number of basis vectors which is constant and independent of the context length resulting in a complexity of O(L x L(LTM)). Due to this property, it can attend to infinite length sequences without a major increase in computation. How is it able to do this? The figure below illustrates this clearly.

Representation of unbounded memory. Image Credits — Infinite-former paper

Coming to the sticky memories concept, it is the technique to give proper weightage to the context (some regions are larger and stronger than others). The process here is as follows:

  1. Construct a histogram based on the strength of attention at each interval by dividing the signal into D bins
  2. Compute the probability density p for each bin
  3. Sample M points within the range [0,1] based on p
Illustration of sticky memory effect on the memory space. Image Credits — Infinite-former paper

Now that we have understood the architecture, the question is — How good is it?


The baselines considered for comparison are Transformer-XL and Compressive Transformers. The Infinite-former fared better than both the above-mentioned models in the Sorting and Language modelling (also resulting in a better perplexity than both of them). In the pre-trained language modelling task, Infinite-former managed to have a better performance than GPT-2 over the Wikitext-103 and PG19 datasets.


Quite an experience understanding this, wasn’t it? It requires a very sound understanding of Radial Basis Functions to follow the architecture closely. The results are up there with the best architectures but there is a major caveat —  “Are the discrete signals undergoing a smoothening process?” Consider the scenario where the discrete points are noisy and random and are not falling into a curve. We are of course passing them over a smoothing function to get them into a nice shape for calculation and representation efficiency. But, we are losing importance and genuinity of information at that discrete location which leads to a drop in accuracy. Give a thought to this fact and try to comprehend the actual workings again to have a clearer understanding of the Infinite-former.


Pranav Raikote


  1. Transformer (Attention is All You Need): https://arxiv.org/pdf/1706.03762.pdf
  2. Infinite Transformer: https://arxiv.org/pdf/2108.09084.pdf
  3. LSTM: https://www.bioinf.jku.at/publications/older/2604.pdf
  4. Transformer-XL: https://arxiv.org/pdf/1901.02860
  5. Compressive Transformer: https://arxiv.org/pdf/1911.05507

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s