Modern deep learning heavily relies on having enough computational resources to try out sufficiently complex models and also to keep iteration costs as low as possible. Particularly in the microcosmos of sequential models, Recurrent Neural Networks (RNNs) have proven their effectiveness in various tasks such as machine translation, text-to-speech and handwriting recognition. Apart from these empirical results, they possess convenient properties such as being able to cope with dynamic sequence lengths. Variants such as the Long-Short-Term Memory (LSTM) [Hochreiter & Schmidhuber, 1997] cell or the Gated Recurrent Units (GRU) cell [Cho et al., 2014] have moreover addressed the vanishing gradient problem, making them a solid a priori choice for modeling sequences.

Having said that, there is an obvious shortcoming that becomes more apparent these days. Owed to their iterative nature, it takes quite a long time for information to propagate (linear in the length of the sequence), making it virtually impossible to exploit the merits of concurrent hardware (such as GPUs). Convolutional Neural Networks (CNNs) [Yann LeCun et al., 1998], on the other hand, are immune to this problem as each kernel application is performed in isolation (within the same layer). Unfortunately this comes at the cost of only being able to account for fixed-sized contexts. Finally, fully-connected layers do not account for sequential information and have yet to prove their suitability for these tasks.

Being released in late 2017, Attention Is All You Need [Vaswani et al.], has had a big impact on the deep learning community and can already be considered as being a go-to method for sequence transduction tasks. From a superficial point of view, the so-called Transformer leverages fully-connected networks with a cleverly-designed attention mechanism. This enables us to circumvent interdependent computations that RNNs are facing. Surprisingly, the authors have proven that such an approach yields on-par results in the machine translation domain on the WMT’14 corpus.

The following parts are divided into two sections. At the beginning, we will provide some rationale on the mathematical foundation of the so-called seq2seq model group. This encompasses a brief discussion of Attention [Bahdanau, 2014], a technique that greatly helped to advance the state-of-the-art in deep learning. Having read the paper, I initially found it to be difficult to come up with a waterproof implementation. Thus, the other chapters will focus on how to avoid common pitfalls and cut complexity wherever possible.

A reduced variant of the model was trained on a Thinkpad T470p (GeForce 940MX) for roughly 10 hours and yields surprisingly good results despite its size. I have also defined settings for a more complex variant that can best be run on Amazon’s EC2 using a GPU instance (e.g. p2.xlarge). The code is available at GitHub and can be easily run from within a Docker container.

This treatise should not be read with the expectation of sticking 100% to the model specification. Every now and then I allowed myself to exchange something (such as pre-trained embeddings, cosine decay w. restarts etc.) but the general gist remains the same.

As this can be considered being a very TensorFlow-centric explanation I strongly recommend you to read these two excellent articles beforehand:

1. The Annotated Transformer by G. Klein, Y. Kim, Y. Deng, J. Senellart and A. M. Rush [Harvard NLP]
2. The Illustrated Transformer by Jay Alammar

Especially the latter article covers a great set of illustrations that can help you to improve your understanding of the topic.

Title photo by Natalia Y on Unsplash

# Problem definition

We are interested in transducing a sequence $x = (x_1, \ldots, x_T)$ of variable length T into another sequence $y = (y_1, \ldots, y_T')$ of length T’ where (in general) $T \neq T'$. Vectors within $x, y$ are assumed to be of dimensionality $d_X$ resp. $d_Y$.

## Recap: Seq2Seq using RNNs

The first sequence $x = (x_1, \ldots, x_T)$ is first consumed by a so-called encoder RNN which yields a final state $h_T$ using the following equation: $h_{t+1} = \sigma(W_E h_{t} + Q_E x_{t + 1}) \in \mathbb{R}^{d_H}$ where $h_0$ is typically zero initialized and $\sigma(.)$ denotes some suitable non-linearity. A second decoder RNN takes the $d_H$ dimensional output $h_T$ of the encoder and produces a variably-sized sequence: $g_{t+1} = \sigma(W_D g_{t} + Q_D y_t) \in \mathbb{R}^{d_G}$ $y_{t+1}' = \textrm{softmax}(W_Y g_{t+1}) \in \mathbb{R}^{d_V}$ where $y_0' = h_T$ and $g_0 = \mathbf{0}$. Notice that the last generated output token is used as an input in the next step. In practice, an end token («END») often serves as a termination symbol so that the model doesn’t produce sequences of infinite length in prediction mode.

Generally we are interested in increasing the throughput by training on batches instead of single examples. This, however, requires to use a padding scheme since individual sequences generally don’t have the same length. In order to avoid computing the loss also on padded values, we can construct a mask that is zero if the associated position belongs to the padding and one everywhere else.

def masked_cross_entropy(loss, labels, sequence_lengths):
"""
:param logits: A Tensor of shape [B, T, C]
:param labels: A Tensor of shape [B, T]
:param sequence_lengths: A Tensor of shape [B]
"""
T = tf.shape(sequence_lengths)[0]
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=logits) # [B, T]

# Reduce loss to single scalar


Albeit more tedious to write, the above code guarantees that the loss of each individual example within the batch is on the same scale.

# Attention

Despite its conceptual simplicity, the above model still has one very profound shortcoming - it needs to pack all of the information of the input sequence into a single vector $h_T$. Since the input sequences’ lengths are unbounded, we have to put an unbounded amount of information into this vector. As the vector’s capacity is bounded, it becomes impossible for the decoder to produce sensible output sequences of arbitrary length.

Attention [Bahdanau, 2014] rids us of this requirement. Content-based attention mechanisms use the entirety of intermediate encoder states $h = (h_1, \ldots, h_T)$ (the so-called keys) as a memory which can be queried using a so-called query vector.

For this, we need the so-called attention weights which can be derived in the following fashion: $\alpha_{t} = h_t^T q\ \in \ \mathbb{R}$ for $t=1,\ldots,T$. Intuitively, these can be interpreted as how “related” certain time steps (in our case words) are for some given query $q$ (also a word). Often, another component is defined: The so-called value vectors $v = (v_1, \ldots, v_T)$ can be conveived as being (possibly) non-linear transformations of $h = (h_1, \ldots, h_T)$. For the sake of simplicity, we assume that $v_t = h_t$ holds for all encoder timesteps.

Putting all together we arrive at: $w=\sum_{t=1}^T \alpha_{t}' v_t=\sum_{t=1}^T \alpha_{k}' h_t\ \in \ \mathbb{R}^{d_H}$ where $\alpha_{t}' =\frac{\alpha_{t}}{\sum_{k=1}^T \alpha_{k}}$ is the normalized attention weight. In the best case, this summary $w$ contains salient information which aids us in generating the next symbol by supplying it as an input to the decoder. In the machine translation scenario this could, for instance, mean that the attention mechanism assigns a large weight to corresponding words such as “stool”/”stuhl” or “president”/”präsident”.

Apart from the added interpretability, breaking the dependence of relying on a fixed-length vector has enabled researchers to improve their models by wide margins. Various examples of this can be found in the fields of classification [Parikh, 2016], handwriting recognition [Bluche, 2016] and text-to-speech [Tachibana, 2017].

Figure 1: In the machine translation domain attention enables us to craft an intuitive alignment between the words of the two languages under consideration. This can also become more complex in that it is possible that a word (e.g. “zusammenkommen”) is being associated to two words (“come together”)

# The Dataset

We decided to use the EuroParl [Koehn, 2005] dataset instead of the more frequently used WMT’14 dataset. This corpus contains the proceedings of the European Parliament and offers sentence-aligned translations in each language of the twenty-one member countries. The German-English language pair contains 1,920,209 sentence pairs and comes in an easy-to-use format. However, it should be stated that the choice of using this language pair has no clear motivation other than personal preference and can easily be replaced by some other pair that covers less data.

Unfortunately, reading large data sets is still a cumbersome task in TensorFlow. Simple solutions such as tf.data.Dataset.from_tensor_slices suffer from the inherent 2 GiB limit imposed by TensorFlow’s graph buffers. Another solution, the so-called TFRecords solve this problem at the cost of an (almost) non-existing documentation.

In order to keep the IO-overhead at a minimum we nonetheless decide to use this approach combined with TensorFlow’s Dataset & Estimator API. The latter has the added benefit of further improving the execution speed as we don’t have to constantly switch back to Python to re-feed the session with a new batch.

## Conversion

The data comes in the two files europarl-v7.de-en.de and europarl-v7.de-en.en that contain the German and English sentences respectively. Each sentence occupies exactly one line. In conjunction to this mere transformation to TFRecords we also use spaCy to tokenize the words and also to strip unwanted special symbols. Lemmatization should not be performed in this scenario as we also want the model to learn how to discern multiple variations of a word.

We also have to keep track of the used vocabulary in each language as this will greatly simplify the use of the text in TensorFlow.

Note: The astute reader might notice that this could also be avoided by using a hash-based approach (e.g. by IdTableWithHashBuckets in tf.contrib.lookup). However this comes at the cost of not being able to map back the integer indices to actual words which is needed later on.

Naively instantiating spaCy leads to a rather heavy-weight configuration. As we only need a tiny subset of the features we are better off disabling all the others, resulting in a huge performance increase.

import spacy
disable = ["ner", "tagger", "parser", "textcat"]


Let us first create a generator that yields a tuple of Doc objects (spaCy’s abstract representation of text) for each sentence pair.

def line_generator(de_fname, en_fname):
with open(de_fname, "r") as f_de, open(en_fname, "r") as f_en:
for de_doc, en_doc in zip(nlp_de.pipe(f_de), nlp_en.pipe(f_en)):
yield de_doc, en_doc


Now we can define how the tokens should be transformed. We choose to only keep punctuation marks and further to neglect capitalization:

def _token_filter(token):

def parse_doc(de_doc, en_doc):
de_tokens = [token.text.lower() for token in de_doc if _token_filter(token)]
en_tokens = [token.text.lower() for token in en_doc if _token_filter(token)]
return de_tokens, en_tokens


The following code iteratively builds up the language-specific vocabulary and writes the sentence-pairs into the TFRecords file.

def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))

# Open file
writer = tf.python_io.TFRecordWriter(args.output)

# Vocabulary
de_vocab = set()
en_vocab = set()
for de_doc, en_doc in line_generator(args.de, args.en):

# Filter
de_tokens, en_tokens = parse_doc(de_doc, en_doc)

# Update vocabulary
de_vocab.update(de_tokens)
en_vocab.update(en_tokens)

# Define feature
feature = {
"en_text": _bytes_feature(map(lambda s: s.encode("UTF-8"), en_tokens)),
"de_text": _bytes_feature(map(lambda s: s.encode("UTF-8"), de_tokens))
}

# Create an example protocol buffer
example = tf.train.Example(features=tf.train.Features(feature=feature))

# Serialize to string and write out
writer.write(example.SerializeToString())

# Close file
writer.close()


The first function _bytes_feature might seem to be a bit cryptical at first glance so let’s explain the process in details. At first we encode each token into its UTF-8 counterpart, this list of encoded tokens can be further wrapped by a tf.train.Feature. As we are dealing with sentence pairs we have to do this twice. Finally, we can wrap the combined representation into a tf.train.Example which is serialized before being written out.

After having written the TFRecords file the vocabulary can be dumped into a text file where each line corresponds to exactly one token. This representation can be easily digested by tf.contrib.lookup.index_table_from_file which maps tokens to their associated integer ID as defined by the vocabulary.

Note: It is of utmost importance to also add two special symbols (a start token «START» and an end token «END») to both language-specific vocabularies. A proper explanation will be given later on.

## Constructing the input pipeline

We use the Dataset API to read batches from the TFRecords file. For this to work, we need to first parse the individual examples. Afterwards we can apply arbitrary transformations on the examples.

### Parsing

Each feature consists of two (German & English) list of tokens of dynamic size. Each token is a string, so we arrive at the following specification:

def _parse(record):

# Define data types
features = {
"en_text": tf.VarLenFeature(dtype=tf.string),
"de_text": tf.VarLenFeature(dtype=tf.string)
}

return tf.parse_single_example(record, features=features)


Using tf.VarLenFeature yields a tf.SparseTensor which we can easily convert to a normal (dense) tf.Tensor:

def _densify(record, field):
record[field] = tf.sparse_tensor_to_dense(record[field], default_value="")
return record


### Example-specific transformations

Notice that until now we have neither used the start symbol «START» nor the end symbol «END». Let’s add them to each sentence:

def _prepend_start_token(record, field):
constant_values="<<START>>")
return record

def _append_end_token(record, field):
constant_values="<<END>>")
return record


Example: Applying the above transformation yields new sentences like: [”«START»”, “I”, “like”, “the”, “beaches”, “in”, “California”, “«END»”]

Owed to the fact that we would like to combine multiple sentence pair in a batch we face the same problem that we earlier had with Seq2Seq. In general the number of words per sentence is not constant so we should better keep track of the individual lengths to know where the padding starts. We can use the following function to add two new features de_text_length and en_text_length. This also counts our newly-introduced special symbols.

def _add_length(record, field):
record["{}_length".format(field)] = tf.shape(record[field])[0]
return record


### Creating epochs

Epochs can be efficiently created using shuffle_and_repeat from tf.contrib.data. It also shuffles the data which is indispensable because proceedings typically revolve around a common topic whose words appear more often than others. The shuffling helps to make the batches more diverse in this setting.

def create_epochs(data, shuffle_buffer_size, num_epochs):
data = data.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=shuffle_buffer_size,
count=num_epochs))
return data


### Batch-specific transformations

Let’s ponder a little bit about how the padding will be applied to the sentences. Assume that we have a batch size of 4 and the individual lengths of the sentences (could be either German or English) are 1, 2, 3 and 50 respectively. This entails that we have to pad the first example 49 times, the second 48 times and the third 47 times which implies a lot of unnecessary memory overhead. It would be wiser to minimize the padding by grouping together sentences of similar lengths.

To obtain this behavior we can use the function bucket_by_sequence_length in tf.contrib.training. which roughly performs these tasks. Given a function that determines the length of a sentence and a definition of the bucket boundaries it does the following:

1. Build buckets according to bucket boundaries
2. Assign examples to buckets
3. Pad each example to the bucket’s assigned length (using “” for tf.strings)

The individual buckets can also produce different batch sizes, a feature that we omit for the sake of simplicity.

Note: We can only specify one length function. This should not concern us too much as two sentences of similar length in German are also expected to have two English translations of similar length (at least in most of the cases)

def _extract_length(record, field):
return record["{}_length".format(field)]

def group_by_sequence_length(data, batch_size, bucket_boundaries):
bucket_batch_sizes = [batch_size] * (len(bucket_boundaries) + 1)
f = tf.contrib.data.bucket_by_sequence_length(element_length_func=partial(_extract_length,
field="de_text"),
bucket_boundaries=bucket_boundaries,
bucket_batch_sizes=bucket_batch_sizes,
"en_text": tf.TensorShape([None]),
"de_text": tf.TensorShape([None]),
"en_text_length": tf.TensorShape([]),
"de_text_length": tf.TensorShape([])
})
data = data.apply(f)
return data


### Input function

We can put everything together by using something along these lines:

from functools import partial
def get_input_fn(fname, shuffle_buffer_size, num_epochs, bucket_boundaries, batch_size):

# Create dataset
data = tf.data.TFRecordDataset(filenames=[fname])

# Parse single records
data = data.map(_parse)

# Densify
data = data.map(partial(_densify, field="en_text"))
data = data.map(partial(_densify, field="de_text"))

# Prepend start tokens
data = data.map(partial(_prepend_start_token, field="en_text"))
data = data.map(partial(_prepend_start_token, field="de_text"))

# Append end tokens
data = data.map(partial(_append_end_token, field="en_text"))
data = data.map(partial(_append_end_token, field="de_text"))

# Epochs
data = create_epochs(data, shuffle_buffer_size, num_epochs)

# Bucket by sequence length
data = group_by_sequence_length(data, batch_size, bucket_boundaries)

return data


# Embedding

The paper doesn’t say anything about the use of any pre-trained embeddings which is why we assume that the word embeddings have been learned from scratch. Nevertheless, having the objective of keeping the resource requirements at a bare minimum, we decide to use the fastText embeddings [Bojanowski et. al, 2016] that have been trained on Wikipedia data.

These embeddings are easily consumable as they are sorted by their corpus frequency which, in turn, makes it possible to efficiently extract the $n$ most frequent words. We further decide to fix these which greatly reduces the number of trainable weights.

Unfortunately, it is quite cumbersome to load pre-trained embeddings using the Estimator API. One way to do this reads as follows:

def get_scaffold(W_init):
"""
:param W_init: A np.ndarray of shape [num_words, embedding_size]
that stores the pre-trained embeddings
"""

# Extract dimensions
num_words, embedding_size = W_init.shape

# Define embedding tensor
W = tf.get_variable(
name="W_embedding",
shape=[num_words + 1, embedding_size],
initializer=None,
trainable=False,
dtype=tf.float32)

# Define scaffold
def _init_fn(_, sess):
sess.run(W.initializer, feed_dict={W.initial_value: W_init})

return tf.train.Scaffold(init_fn=_init_fn)


The fact that we are actually using one token more stems from the fact that we are using an out-of-vocabulary token.

This scaffold can afterwards be passed as an argument to tf.estimator.EstimatorSpec:

...
return tf.estimator.EstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN,
loss=..,
train_op=..,
scaffold=scaffold)


Note: We actually need to perform this procedure for both languages. For the sake of clarity I abstained from showing both initializations. The interested reader can take a look at the code to get some more detail.

# Time-aware embedding

The great benefit of using RNN-based architectures is that they have an innate sense of sequence order due to the recurrence equation. This, however, comes at the cost of introducing strong sequential dependencies that make it impossible to fully leverage the capabilities of modern hardware.

Wouldn’t it be a great thing to give the model some sense of ordering while abstaining from the aforementioned shortcoming? As a matter of fact, the idea to achieve this is surprisingly simple in that we only need to design a so-called “positional” embedding which is added on top of our actual embedding.

Ideally it should satisfy the following criteria:

1. dependent on time
2. dependent on embedding dimension (e.g. the first feature dimension could be treated differently than the second one)
3. broadcastable along batch dimension (for correctness purposes)
4. deterministic

The authors suggest a simple yet powerful scheme to realize this: $f(t, i) = \begin{cases} \sin\left(\frac{t}{10000^{\frac{i}{d}}}\right), & \text{if } i\ \equiv_2\ 0 \\ \cos\left(\frac{t}{10000^{\frac{i-1}{d}}}\right), & \text{if } i\ \equiv_2\ 1 \end{cases}$ where $\equiv_2$ is the modulo 2 relation, $t$ denotes the time and $i$ denotes the index of the feature dimension.

Let’s take a look at this TensorFlow implementation:

def add_positional_embedding(embedding, lang):
"""
:param embedding: A tf.Tensor of shape [B, T, E]
:param lang: A language code (either "de" or "en")
"""
T = tf.shape(embedding)[1]
E = embedding.get_shape().as_list()[2]

# Create grid
pos = tf.cast(tf.tile(tf.expand_dims(tf.range(T), axis=0),
multiples=[d, 1]), tf.float32)  # [d, T]
i = tf.cast(tf.tile(tf.expand_dims(tf.range(d_model), axis=1),
multiples=[1, T]), tf.float32)    # [d, T]

# Sine waves
sine = tf.sin(tf.divide(pos, tf.pow(float(10**4), tf.divide(i, E))))     # [E, T]
cosine = tf.cos(tf.divide(pos, tf.pow(float(10**4), tf.divide(i, E))))   # [E, T]

# Shift cosine by one position
cosine = tf.manip.roll(cosine, shift=1, axis=0)

# Alternate between waves depending on parity
even_mask = tf.equal(tf.mod(tf.range(E), 2), 0)   # [E]
joint_pos = tf.where(condition=even_mask, x=sine, y=cosine)     # [E, T]
joint_pos = tf.transpose(joint_pos)     # [T, E]

# Magnitude of positional embedding
gamma = tf.get_variable(name="gamma_{}".format(lang),
shape=[],
initializer=tf.initializers.ones,   # initially the same as specified by the paper
trainable=True,
dtype=tf.float32)

# Apply positional encoding
return tf.add(embedding, gamma * joint_pos, name="composed_embedding")   # [B, T, E]



The first couple of lines create a 2D grid that helps us to evaluate the $f(t, i)$ at each $t \in \lbrace 1, \ldots, T \rbrace$ resp. $i \in \lbrace 1, \ldots, d \rbrace$ (feel free to use tf.meshgrid as an alternative). Afterwards, we blindly evaluate both the sine and cosine functions on all values. Finally, a binary mask is constructed that alternates between True and False in order to select the appropriate function.

In the best case we make no a-priori assumption on the scale of our pre-trained embeddings. In this fashion, we add a trainable paramater $\gamma$ that determines how high the contribution of the positional embeddings should be.

Figure 2: The x-axis denotes time whereas the y-axis denotes the feature dimension. The juxtaposed variant is computed by alternating between the sine and the cosine representation over the feature dimension.

# Attention

There are two types of attention being used in the architectured: The “self-attention” and the “encoder-decoder” attention. The former is used both in the encoder and the decoder while the latter is solely performed by the decoder. Let us shed some light on these mechanisms:

## Self-attention

Assume that we have some input sequence $s = (s_1, \ldots, s_T) \in\ \mathbb{R}^d$ and are interested in producing a new (output) sequence $s' = (s_1', \ldots, s_T') \in\ \mathbb{R}^d$ of the same dimensionality. The process of deriving a single $s_k'$ can best be explained by the following (iterative) analogy:

1. Compute $w_t = \frac{s_k^T s_t}{\sqrt{d}}$ for $t=1,\ldots, T$
2. Normalize the weights $\hat{w}_t = \frac{\exp(w_t)}{\sum_{i=1}^T \exp(w_i)}$ using softmax over time
3. Scale each $s_t$ with its associated weight $w_t$ and sum everything up: $s_k' = \sum_{i=1}^T w_i s_i$

However, this is usually expressed in a vectorized (and thus much more GPU-friendly form) which we also see in the paper: $S' = \textrm{softmax}(\frac{S S^T}{\sqrt{d}}) S$ where $S \in \mathbb{R}^{T \times d}$ is the concatenation of all time steps.

### Batching

The above formulation assumes a single example (batch size equal to one). To enable a higher GPU throughput and to decrease the variance of the gradients, we should generalize this to arbitrary batch sizes. TensorFlow offers a range of different functions that deal with this task (e.g. tf.einsum, tf.tensordot or tf.matmul).

Using the latter we can derive the attention weights for a batch of input sequences $S \in\ \mathbb{R}^{B \times T \times d}$ like this:

W = tf.nn.softmax(tf.matmul(S, tf.transpose(S, perm=[0, 2, 1])), axis=2)


where $W \in \mathbb{R}^{B \times T \times T}$.

These weights can then be applied in the following manner to get new sequence representations:

S = tf.matmul(W, S)


### The “dynamic length” problem

Implementing soft attention this way looks easier than it actually is. This observation stems from the fact that in order to create a batch of sequences of different lengths we have to introduce some kind of padding. If we are not careful in our approach we happen to assign non-zero weight to padded values which can deteriorate the model’s performance.

Let’s define a quick example: Assume that we are dealing with a batch size of 4 where the individual sequence lengths are given by 1, 2, 3 and 4. Furthermore, we assume that the number of features is $d = 10$ (which is totally arbitrary):

import numpy as np
import tensorflow as tf; tf.enable_eager_execution()

B = 4
T = 5
d = 10
t = [1, 2, 3, 4]
S = tf.constant(np.random.randn(B, T, d))
W = tf.nn.softmax(tf.matmul(S, tf.transpose(S, perm=[0, 2, 1])), axis=2)    # [B, T, T]


Figure 3 This series of images depicts the output of the attention weight tensor $W$. Each representation was created by chopping off the first element (w.r.t the batch dimension) from the preceding representation. Values corresponding to padding are highlighted in red.

Let us now explain the problem with the aid of the code and the depiction of its result (Figure 3). Due to the presence of the softmax function each value in the output tensor is guaranteed (neglecting any numerical underflow) to be strictly positive. This, in turn, also means that the attention mechanism takes the padding into account. At best, this can confuse our model in training mode. In the worst case, however, this could even be more fatal in prediction mode as we typically don’t use any form of padding there.

But how do we correctly compute the softmax function over a subset of indices? This innocent question does not have a simple answer in the TensorFlow universe. One mathematically-motivated way would be to set the values pertaining to the padding to $-\infty$ which would drive the exponentials to zero. Unfortunately, this provokes NaN gradients which will quickly set an end to our training efforts.

The other way would be to write a new softmax function having in mind that TensorFlow’s implementation does a lot more than a naive computation - it also protects us from numerical under- and overflows.

Let’s take a look at the classical formulation of the function: $g_j(x_1, \ldots, x_N) = \frac{\exp(x_j)}{\sum_{k=1}^N \exp(x_k)}$

Now we subtract some arbitrary constant $c$ from each argument: \begin{aligned} g_j(x_1 - c, \ldots, x_N - c) &= \frac{\exp(x_j - c)}{\sum_{k=1}^N \exp(x_k - c)} \\ &= \frac{\exp(x_j)\ \exp(-c)}{\sum_{k=1}^N \exp(x_k)\ \exp(-c)} \\ &= \frac{\exp(-c)\ \exp(x_j)}{\exp(-c) \sum_{k=1}^N \exp(x_k)} \\ &= \frac{\exp(x_j)}{\sum_{k=1}^N \exp(x_k)} \\ &= g_j(x_1, \ldots, x_N) \end{aligned}

As we can pick $c$ arbitrarily, we choose $c = \max(x_1, \ldots, x_N)$ which upper bounds the numerator by 1 as all shifted arguments are either negative or zero (attained for $x_i$ w. $i = \textrm{argmax}(x_1, \ldots, x_N)$). This makes us immune to numerical overflows!

Unfortunately, by using this fix we actually increased our vulnerability to numerical underflows as some arguments can be pushed to very negative values after shifting. Feeding these values into the exponential function causes the result to become so negligibly small that it is inevitably rounded down to zero.

Why is this a problem? Assume that the maximum $c$ is attained at some index $i$ that belongs to the padding (i.e. $i \geq l$) where $l \leq T$ is the actual length. Because all arguments are shifted by $c$ we observe that $\exp(x_k - c)$ can easily become extremely small for each $k \neq i$ as the exponential function quickly decays for negative arguments. In the worst case all values that are not associated to the padding can become zero which elicits NaNs after normalizing. Hence, we want to calculate the maximum solely over the defined values.

One well-defined way to do this goes as follows:

def get_safe_shift(logits, mask):
"""
:param logits: A tf.Tensor of shape [B, TQ, TK] of dtype tf.float32
:param mask: A tf.Tensor of shape [B, TQ, TK] of dtype tf.float32
where TQ, TK are the maximum lengths of the queries resp. the keys in the batch
"""

# Determine minimum
logits_min = tf.reduce_min(logits, axis=2, keepdims=True)      # [B, TQ, 1]
logits_min = tf.tile(logits_min, multiples=[1, 1, TK])  # [B, TQ, TK]
logits =  tf.where(condition=mask > .5, x=logits, y=logits_min)

# Determine maximum
logits_max = tf.reduce_max(logits, axis=2, keepdims=True, name="logits_max")      # [B, TQ, 1]
logits_shifted = tf.subtract(logits, logits_max, name="logits_shifted")    # [B, TQ, TK]

return logits_shifted


Conceptually speaking, we calculate the minimum over all keys. Afterwards, we replicate this tensor to the same shape that it had before the reduction. Using the mask, we can achieve that all values pertaining to the padding are set to the corresponding minimum, ensuring that the calculation of the maximum is independent of the padding. This way, at least one defined value has a numerator equal to one which ultimately makes a division by zero impossible.

### Normalizing the logits

Let us use the above logits to derive the attention weights using a variant of softmax:

def padding_aware_softmax(logits_shifted, mask):

# Apply exponential
weights_unscaled = tf.exp(logits_shifted)

weights_unscaled = tf.multiply(mask, weights_unscaled)     # [B, TQ, TK]

# Derive total mass
weights_total_mass = tf.reduce_sum(weights_unscaled, axis=2, keepdims=True)     # [B, TQ, 1]

# Avoid division by zero
x=weights_total_mass,
y=tf.ones_like(weights_total_mass))

# Normalize weights
weights = tf.divide(weights_unscaled, weights_total_mass)   # [B, TQ, TK]

return weights


The general idea is to apply the exponential function on all elements and multiplying the resulting representation with the mask to filter out the contributions of the padding. To (properly) normalize these values it is not sufficient to compute the mean naively (via tf.reduce_mean) as the sequence length is dynamic. Fortunately, we can can come up with a mathematically sound way of doing this by summing up the mask and dividing by the result.

Note: Due to the dynamic sequence length it is possible that weights_total_mass contains zero values. This is absolutely correct and stems from the fact that all previously non-zero values in weights_unscaled have been set to zero by the mask. As these attention weights are irrelevant, we can avoid a “division by zero” error by setting the divisor to one wherever necessary.

To increase the model’s attention capabilities the authors also use the concept of having multiple attention heads.

Mathematically speaking, assume that some matrices $Q, K, V \in \mathbb{R}^{T \times d}$ are given which denote the queries, keys and values. For each attention head we use a different set of weights $W_Q, W_K, W_V \in \mathbb{R}^{d \times d'}$ to derive: \begin{aligned} Q' &= Q W_Q \in \mathbb{R}^{T \times d'} \\ K' &= K W_K \in \mathbb{R}^{T \times d'} \\ V' &= V W_V \in \mathbb{R}^{T \times d'} \end{aligned} Applying the conventional (previously described) attention mechanism on $Q', K', V'$ thus yields different attention weights. In theory, this lets each attention head focus on different aspects. The results of each attention head are afterwards concatenated over the feature dimension.

In order to make the output tensor’s shape identical to the input tensor’s shape (which is necessary as we would like to use many layers), we use another linear mapping at the end.

Note: It is possible that the attention heads are not learning sufficiently diverse strategies. This phenomenon becomes apparent if the assigned attention weights are more or less the same over all attention heads. As each of these define a proper probability distribution, this unwanted effect can be mitigated by imposing a penalty to increase their distributional dissimilarity (e.g. by leveraging the Kullback-Leibler divergence).

def attention(query, key, value,
query_length, key_length,
"""
:param query: A tf.Tensor of shape [B, TQ, E]
:param key: A tf.Tensor of shape [B, TK, E]
:param value: A tf.Tensor of shape [B, TK, E]
:param num_heads: How many sets of weights are desired
"""
with tf.name_scope("attention"):

# Make shapes compatible for tf.Dense
B = tf.shape(query)[0]
embedding_size = query.get_shape().as_list()[2]

# Number of times steps (dynamic)
TQ = tf.shape(query)[1]
TK = tf.shape(key)[1]

query = tf.reshape(query, shape=[B * TQ, embedding_size], name="query")   # [B * TQ, E]
key = tf.reshape(key, shape=[B * TK, embedding_size], name="key")   # [B * TK, E]
value = tf.reshape(value, shape=[B * TK, embedding_size], name="value")   # [B * TK, E]

# Linear projections & single attention

# Apply three different linear projections
activation=None,
use_bias=False)(query)  # [B * TQ, num_head_dims]
activation=None,
use_bias=False)(key)  # [B * TK, num_head_dims]
activation=None,
use_bias=False)(value)  # [B * TK, num_head_dims]
# Reshape

# Derive attention logits

# Rescale

# Normalize weights
query_length=query_length,
key_length=key_length)     # [B, TQ, TK]

# Apply weights to values

# Contract list of Tensors

# Project back to input embedding size (E)
output = tf.layers.Dense(units=embedding_size,
activation=None,
output = tf.reshape(output, shape=[B, TQ, embedding_size], name="attention")   # [B, TQ, E]

return output


## Encoder-decoder attention

The encoder-decoder attention obeys the same principles as the self-attention previously described with the difference of the data coming from different sources. The queries $Q$ come from the decoder whereas the keys $K$ and values $V$ come from the encoder.

# Encoder

The encoder aims to construct a semantic representation of our input sentence which the decoder can eventually use to derive the output sentence. The input sequence is hereby given by the sum of the original embedding (in our case the fastText embeddings) and the positional embedding.

One layer consists of:

1. Self-attention, combines global information according to their attention scores (i.e. their importance)
2. Dense, applied independently over all positions

Typically many of these layers are stacked on top of each other to increase the model’s expressiveness. For this to be possible, it is required that the output shape is identical to the input shape.

def encoder_layer(x, x_length,
num_dense_dims, dense_activation,
dropout_rate, is_training):
"""
Encoder layer
:param x: A tf.Tensor of shape [B, T, E]
:param x_length:  A tf.Tensor of shape [B]
:param num_dense_dims: Dimension of linear projection between dense layers
:param dense_activation: Activation function of the dense layers
:param dropout_rate: How many neurons should be deactivated (between 0 and 1)
:param is_training: Whether we are in training or prediction mode
:return: A tf.Tensor of shape [B, T, E]
"""

B = tf.shape(x)[0]
T = tf.shape(x)[1]
E = x.get_shape().as_list()[2]

""" First sub-layer (self attention) """
layer_norm_opts = {
"begin_norm_axis": 2,
"begin_params_axis": 2,
"scale": False,
"center": True
}

# Residual
residual = x    # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)      # [B, T, E]

# Self attention
x = attention(query=x,
key=x,
value=x,
query_length=x_length,
key_length=x_length,

# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[B, 1, E])(x, training=is_training)     # [B, T, E]

# Residual connection
x = x + residual    # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)      # [B, T, E]

""" Second sub-layer (dense) """

# Residual
residual = x    # [B, T, E]

# Reshape to make output compatible with tf.layers.Dense
x = tf.reshape(x, shape=[B * T, E])    # [B * T, E]

# Dense
x = tf.layers.Dense(units=num_dense_dims,
use_bias=True,
activation=dense_activation)(x)    # [B * T, num_dense_dims]
# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[1, num_dense_dims])(x, training=is_training)     # [B * T, num_dense_dims]

# Dense
x = tf.layers.Dense(units=E,
use_bias=True,
activation=dense_activation)(x)     # [B * T, E]

# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[1, E])(x, training=is_training)     # [B * T, E]

# Reshape again
x = tf.reshape(x, shape=[B, T, E])     # [B, T, E]

# Residual connection
x = x + residual     # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)     # [B, T, E]

return x


## Dropout

We have decided to use a dropout mask that is constant over all timesteps, a pattern that is also often used to train LSTMs. This can be achieved by specifying the noise_shape argument of tf.layers.Dropout to be [B, 1, E] where the first dimension is the batch dimension, the second one the time dimension and the third one denotes the feature dimension. This will afterwards be internally broadcasted to randomly mask out some portion of the input tensor.

Note: Dropout needs to behave differently depending on whether we are currently training the model or using it in prediction mode. Unfortunately, simply using tf.layers.Dropout as described by the documentation doesn’t apply dropout by default (it is in prediction mode by default). To enable it, we have to manually specify the current mode by using the training argument of its call method.

## Layer normalization

We use tf.contrib.layers.layer_norm to realize the layer normalization. However, we need to be very careful to use it in the right way.

Let me provide some reasoning on how we chose the actual arguments:

1. center = True: Shifts the result by some trainable variable.
2. scale = False: Scales the result by some trainable variable. Since we are using transformations that can apply their own scaling, this becomes unnecessary and we disable it to cut down on the number of weights
3. begin_norm_axis = 2: Setting this to 2 means that we are calculating the moments solely over the feature dimension. The default (1) would also calculate the moments over the time dimension which is undesired as the dynamic padding would skew these statistics.
4. begin_params_axis = 2: Setting this to 2 (the same as the default of -1) means that the centering variable has a shape identical to the feature dimension (denoted as E). Other choices are not possible here as the dimension needs to be known at graph construction time.

## Residual connections

This makes up the easiest part in the code. Before each stage we retain the input using a variable named residual which is afterwards added back to the output of the stage. This scheme is advantageous as the model can drive the output of the stage to zero to fully retain the input. If we were not to have these residual connections, the model could, in theory, learn to mimic the identity function. In practice, however, this is perceived to be more difficult hence justifying the popularity of residual connections.

# Decoder

The decoder assembles the output sentence by peeking over the output of the encoder. To come up with meaningful sentences, it needs to perform two important tasks:

1. Keeping track of the tokens it already generated
2. Extracting the appropriate information from the encoder for each token to be generated

Especially the first task exposes a common pitfall. To impose this constraint we need to design another mask that prohibits the decoder to see the “future” while allowing it to glance over the previous tokens. Let’s take a look at Figure 4 to see why this is necessary. Assume that we were allowed to look into the future. Obviously the decoder could now just learn to “cheat” by copying the token from the next time step into the prediction of the current time step. Masking out these future contributions is thus an effective means to rule out this behavior.

Figure 4 The incremental formulation conditions the decoder on the tokens it has already generated. For example, given the incomplete sentence “«START» I like the beaches in” and a German input sentence “«START» Ich mag die Strände in Kalifornien «END»”, the decoder is expected to assign a high probability to the token “California”.

From a structural point of view one decoding layer consists of:

1. Self-attention
2. Attention w.r.t encoder output
3. Dense, applied independently over all positions
def decoder_layer(x, x_length,
encoder_out, encoder_out_length,
dense_activation, dropout_rate, is_training):
"""
Decoder layer
:param x: A tf.Tensor of shape [B, T, E]
:param x_length:  A tf.Tensor of shape [B]
:param encoder_out: A tf.Tensor of shape [B, T', E]
:param encoder_out_length: A tf.Tensor of shape [B]
:param num_dense_dims: Dimension of linear projection between dense layers
:param dense_activation: Activation function of the dense layers
:param dropout_rate: How many neurons should be deactivated (between 0 and 1)
:param is_training: Whether we are in training or prediction mode
:return: A tf.Tensor of shape [B, T, E]
"""

B = tf.shape(x)[0]
T = tf.shape(x)[1]
E = x.get_shape().as_list()[2]

""" First sub-layer (self attention) """
layer_norm_opts = {
"begin_norm_axis": 2,
"begin_params_axis": 2,
"scale": False,
"center": True
}

# Residual
residual = x    # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)      # [B, T, E]

# Self attention
x = attention(query=x,
key=x,
value=x,
query_length=x_length,
key_length=x_length,

# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[B, 1, E])(x, training=is_training)     # [B, T, E]

# Residual connection
x = x + residual    # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)      # [B, T, E]

""" Second sub-layer (attention over encoder output) """

# Residual
residual = x    # [B, T, E]

# Attention
x = attention(query=x,
key=encoder_out,
value=encoder_out,
query_length=x_length,
key_length=encoder_out_length,

# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[B, 1, E])(x, training=is_training)  # [B, T, E]

# Residual connection
x = x + residual    # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)

""" Third sub-layer (dense) """

# Residual
residual = x    # [B, T, E]

# Reshape to make output compatible with tf.layers.Dense
x = tf.reshape(x, shape=[B * T, E])    # [B * T, E]

# Dense
x = tf.layers.Dense(units=num_dense_dims,
use_bias=True,
activation=dense_activation)(x)    # [B * T, num_dense_dims]
# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[1, num_dense_dims])(x, training=is_training)     # [B * T, num_dense_dims]

# Dense
x = tf.layers.Dense(units=E,
use_bias=True,
activation=dense_activation)(x)     # [B * T, E]

# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[1, E])(x, training=is_training)     # [B * T, E]

# Reshape again
x = tf.reshape(x, shape=[B, T, E])     # [B, T, E]

# Residual connection
x = x + residual     # [B, T, E]

# Layer normalization
x = tf.contrib.layers.layer_norm(inputs=x, **layer_norm_opts)     # [B, T, E]

return x


The settings for dropout and layer normalization follow the same guidelines that we already defined for the encoder.

# Loss

Figure 4 tells us that we need to need suitably align the decoder’s output with our ground truth. This stems from the fact that the last prediction for which we have a corresponding ground-truth token conditions the decoder on “«START» I like the beaches in California” and expects the decoder to predict the end token.

We can achieve this by using the following code snippet:

decoder_out = decoder_out[:, :-1, :]    # [B, T - 1, num_words]
reference = en_token_ids[:, 1:]     # [B, T - 1]


The variant tf.nn.sparse_softmax_cross_entropy_with_logits of cross entropy does not expect the labels to be one-hot encoded and thus abstains from wasting precious GPU memory.

We derive a sequence mask using tf.sequence_mask whose arguments take into account that we had to truncate the decoder’s output for the sake of comparability.

Furthermore, we abstain from using a mean reduction on the loss to make its scale invariant w.r.t the appended padding.

# Cross entropy
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=decoder_out, labels=reference)     # [B, T - 1]

T = tf.shape(en_text)[1]
maxlen=T - 1,
dtype=tf.float32)   # [B, T - 1]

loss = tf.multiply(mask, loss)  # [B, T - 1]

loss = tf.reduce_sum(loss, axis=1)   # [B]
loss = tf.divide(loss, mask_mass)   # [B]
loss = tf.reduce_mean(loss)     # []


# Learning rate schedule + optimizer

We use the Nesterov [Nesterov, 1983] optimizer together with a restarting cosine decay [Loshchilov & Hutter, 2016]. Both concepts help us to escape from local minima: On the one hand, Nesterov uses a momentum term to increase the gradient’s magnitude once it gets close to zero. On the other hand, each warm restart cycle ramps up the base learning rate at the beginning and smoothly decays it afterwards until it reaches some (possibly cycle-dependent) minimum.

def get_train_op(loss,
initial_learning_rate,
first_decay_steps,
t_mul,
m_mul,
alpha):

# Use cosine decay with warm restarts
learning_rate = tf.train.cosine_decay_restarts(learning_rate=initial_learning_rate,
global_step=tf.train.get_global_step(),
first_decay_steps=first_decay_steps,
t_mul=t_mul,
m_mul=m_mul,
alpha=alpha)

# Optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=momentum,
use_nesterov=True)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())

return train_op


## Parameters

We utilize tf.train.cosine_decay_restarts which expects these parameters:

1. learning_rate ($l_0$): The initial learning rate ($t = 0$)
2. global_step ($t$): Use tf.train.get_global_step() here
3. first_decay_steps ($T_0$): How many time steps the first cycle endures
4. $t_{\textrm{mul}}$: This factor governs how much longer the next cycle should be in terms of time steps (e.g. $t_{\textrm{mul}} = 2$ would imply doubling the cycle length after the first cycle switch)
5. $m_{\textrm{mul}}$: The same concept for the cycle-dependent initial learning rate (e.g. $m_{\textrm{mul}} = \frac{1}{2}$ halves the initial learning rate after the first cycle switch)
6. $\alpha$: This defines the minimum learning rate that is attained at the end of a cycle. Having a learning rate $l$ at the beginning of some cycle $c$ implies a decrease until $\alpha \cdot l$ is reached

Figure 5 Exemplary evolution of the learning rate schedule for $l_0 = 0.05,\ \alpha = 0.1,\ t_\textrm{mul} = 0.9,\ m_\textrm{mul} = 1.0,\ T_0 = 1000$

## Deployment

The model was tested on Amazon’s Deep Learning Ubuntu Base AMI but can be run in any environment that supports Docker. I have written a couple of scripts that automatically install the dependencies.

Warning: Even though I have carefully tested all the scripts it is possible that their execution harms the sanity of your system and can lead to loss of data or other undesired effects. I do not take any responsibility for potential damages. Please read the scripts thoroughly before you use them!

Warning: The following scripts fetch data from third-party sources. Upon execution you implicitly acknowledge their terms of agreements and any restrictions they impose.

### Training

cd AttentionIsAllYouNeed

# Install docker, nvidia-docker, all python packages + spaCy corpora
chmod +x ec2_dl_ami_setup.sh
sh ./ec2_dl_ami_setup.sh

# Start training process
chmod +x dispatch_docker.sh
sh ./dispatch_docker.sh


The last script generates a new TensorFlow checkpoint directory in the MT_Data folder. You can use this to let TensorBoard visualize the process in real-time.

### Prediction

cd AttentionIsAllYouNeed

sudo nvidia-docker run -v realpath MT_Data:/data/ \
--rm attention_is_all_you_need \
--mode predict \
--prediction_scheme beam \
--de_vocab /data/wiki.de_tokens.txt \
--en_vocab /data/wiki.en_tokens.txt \
--pre_trained_embedding_de /data/wiki.de_embeddings.npy \
--pre_trained_embedding_en /data/wiki.en_embeddings.npy \
--config config.yml \
--model_dir /data/TF_<<CHECKPOINT_ID>>


### Example run

I have defined a lightweight version (config_lite.yml) that only leverages two encoder- and two decoder layers with four attention heads per layer. The embedding size $d = 300$ is given by the pre-trained fastText embeddings. The attention heads’ projection layers contain 50 units whereas the dense layers use 600 units to derive the intermediate representations. I have trained this model for roughly 10 hours on my dated notebook that has a GeForce 940MX at its disposal. The training process was aborted once it consistently surpassed a cross entropy of 2.

I used a beam search to decode the following examples:

Input: "Portugal, Spanien und Deutschland sind europäische Länder"
Output: "portugal , spain and germany are european countries"
[0.898 0.937 0.727 0.789 0.98  0.796 0.871 0.966]
Score: -0.128

Input: "Portugal, Spanien und Deutschland sind Länder in Europa"
Output: "portugal , spain , spain and germany are countries in europe"
[0.892 0.94  0.744 0.24  0.705 0.934 0.981 0.712 0.902 0.844 0.98 ]
Score: -0.247

Input: "Im Jahr 2014 wurde Deutschland Fußballweltmeister"
Output: "in , germany was <UNK>""
[0.615 0.821 0.959 0.504 0.134]
Score: -0.571


The last example shows that out-of-vocabulary words such as 2014 or Fußballweltmeister (world in champion in soccer) are difficult to grasp. This can be resolved (to some extent) by increasing the vocabulary size or by identifying numbers beforehand.

# Final words

Feel free to give me some hints on how to further improve the quality of this article.