May 18, 2019

May Papers

Lately I’ve been interested in how machine learning can help artists create animations for characters and critters. So, you’ll find that many of the papers I read in May are connected to this theme. I try to keep the descriptions of the papers brief. Those that particularly inspire me might end up posts of their own in the future.

Phase-Functioned Neural Networks for Character Control

This is the first paper I read in ML generated animation space. It sets up the animation network as this neat conditional neural network. First, they have a function that takes in a phase state (think of this as a value indicating what point in the cycle of the animation the character currently is in) and produces the weights for a second neural network that converts the data from last frame + user input into the motion for the current frame + a change in phase value. The phase function is modeled using a Catmull-Rom Spline that interpolates between 4 sets of expert weights. The cool thing about the Catmull-Rom spline is that it can create a loop, and the authors leverage this capability so that the phase function is cyclic. This is a natural choice for this domain, since the walking animation for characters are looped.

Mode-Adaptive Neural Networks for Quadruped Motion Control

This paper is kind of an extension of the phase-functioned paper before it. Instead of using a spline to interplate between experts, the authors use a neural network that predicts mixing values α1,α2,...,αn_1, _2, …, nα1,α2,...,αn for the expert weights given the motion from the previous frame. These mixing weights are then used to compute the parameters of the neural network θ=i=1nαiθi= {i=1}^{n} _i _iθ=i=1nαiθi where θi_iθi are the ith expert’s parameters.

The authors analyze the behavior of their network by inspecting the values of the expert weights over time. There is some interesting periodicity in the behavior of the weights. They also ablate certain experts and observe the resulting effect on the animation of the character. They found certain experts were tailored towards moving left vs right, and some experts focused on high-frequency components of the motion that contribute to the motion’s natural feel. I love this kind of analysis, and I think the author’s did an amazing job.

I feel like this technique could be extended pretty naturally to arbitrary time-series data. Especially time-series data that has a periodic or seasonal component to it.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

After reading the mode adaptive paper, I was wondering how common it was to mix the weights of the experts rather than the output of the experts. This lead me to looking at this paper, where the authors try to train very large mixture of experts neural networks with gating functions. In this case the gating functions gate the output of the experts, rather than their weights. There is also a big focus on enforcing sparsity — they truncate the gating values to the kkk largest weights.

This is quite different than the Mode-Adaptive approach, which didn’t care about sparsity nor gating the outputs. I wonder if the mode-adaptive network authors tried this?

Learning feed-forward one-shot learners

This paper was referenced in the Mode-adaptive NN paper in the discussion section as a possible technique one could use to generate walking animations for quadrupeds for which you don’t have motion capture data. To do so, you assume you already have some diverse database of motion capture data for various quadrupeds. You then learn a one-shot learner on this training data — a model that learns to take an exemplar for a type of unknown quadruped and produce the weights of a NN that would produce the walking animation.


Where WW’W are the parameters of the meta network, WWW are the parameters used by the walking animation network, zzz is an exemplar, and xxx is frame data used to produce the animation.

The challenge here, however, is that learning a NN that produces weights for another NN is intractable when done naively, since the output space WWW of the meta-network is massive. The author’s, instead of directly building WWW off the exemplar, modify the model such that it uses the exemplar to generate the singular values of a factored matrix:


Where M1M_1M1 and M2M_2M2 are learned matrices independent of the exemplar, and fff is a NN that takes as input the exemplar and produces the diagonal values of matrix.

I thought this paper was neat, but it requires one already having a dataset that is pretty expensive to curate. I also wonder if there are better ways to make the meta-learning problem tractible.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

This one is a bit different from the overall theme of the paper’s I read this month. But, it was hard to pass this paper without reading it, since it has been making such waves in the ML community. The authors set up a clever training procedure that lets them use bi-directional transformers rather than limiting the model to be auto-regressive. The end result is a model that can take in short sequences (~512 tokens) and produce context-aware embeddings for the tokens for the sequence. It also can produce whole-sequence embeddings via the clever use of a [CLS] token.

Overall, their results are a very impressive demonstration of what is possible in the unsupervised learning space when you can train massive models on an absurd amount of data. I suppose technically this model isn’t unsupervised since they train against supervision signals like predicting sequential sentences and predicting masked words. But these signals are pretty weak, and the resulting learned representations are transferable to other tasks.

Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

I ran into this paper after reading the StyleGAN paper last month. The StyleGAN uses AdaIn (Adaptive instance normalization) layers in its architecture, so I figured it’d be a good idea to learn about where AdaIn came from.

The inspiration for AdaIn seems to come from the instance normalization layer”, which re-scale’s the input features such that their spatial mean and standard deviation match two learnable parameters γγ and ββ.


Empirically the style transfer literature has established that the style” of an image is partially captured by the spatial variation of convolutional features, and that this variation can be summarized with low-order moments like the mean and standard deviation. So, by re-normalizing the features to match a specified mean and standard deviation, you are re-styling the image.

Adaptive instance normalization takes this idea one step further by removing the learnable parameters and having instance normalization match the moments coming from the the features of the style image itself.


There is quite a bit more to this paper, especially in the definition of the loss function and the overall model architecture. But, I’ll stop here, otherwise I’d risk writing a whole post on this.

Machine learning Computer graphics Monthly papers
May 13, 2019

Multi-headed attention as matrix multiplication

Today I’ll walk through how to implement multi-headed attention with a series of matrix multiplications. Computing attention in this way is more efficient than iteratively visiting each input vector and computing its associated output. Matrix multiplication operations are highly optimized and are designed to be parallelzed across threads on the GPU.

To help us, I’ll also introduce two interpretations of matrix multiplication, both of which are super useful even outside the context of computing multi-headed attention.

Computing the attention weights

Suppose you had two sets of vectors of identical dimension a1,a2,...,an_1, _2, …, _na1,a2,...,an and b1,b2,...,bm_1, _2, …, _mb1,b2,...,bm and you construct a matrix AAA by stacking the ai_iai vectors as rows, and a matrix BBB by stacking the bi_ibi vectors as columns. You can then interpret the multiplication ABABAB as a pairwise similarity matrix:


We can use this to help us compute our attention weights. Recall that in computing a self-attention layer, we iterate through each query vector, and compute the scaled dot product against all of the key vectors. For the ith query vector, we compute the weights with:


Note the (WQvi)(WKvj)(W^{Q}_i) (W^{K}_j)(WQvi)(WKvj) term. If we were able to compute all pairwise dot-product similarities, we would be well on our way to computing the attention weights. Lets compute the similarities with a few matrix multiplications:


SSS is a scaled pairwise dot-product similarity matrix (ddd being the dimension of the query/key embedding space), III is a matrix constructed by taking our input vi_ivi vectors as rows, and the matrices QQQ and KKK are the input vectors embedded into query and key space, respectively. That is, the ith row of IWQIW^QIWQ is WQviW^Q_iWQvi, and the ith row of IWKIW^KIWK is WKviW^{K}_iWKvi. Note that we need to take the transpose of KKK in computing SSS in order to stack the key vectors along columns rather than the rows.

Computing the similarity matrix isn’t enough; we need to turn this similarity matrix into attention vectors for each query exponentiating all of the entries and then normalizing each row to sum up to 1. This is the same as taking the softmax of the rows of the matrix.


After taking this softmax, we have the property that the ith row of AAA is the attention weight vector for the ith input vector. Concretely, Ai,jA_{i,j}Ai,j tells us, when computing the output for input vector vi_ivi, how much weight wjw_jwj to assign input vector vj_jvj.

Computing the outputs

Now, we will leverage a second useful interpretation of matrix multiplication in order to use our attention matrix to compute output vectors. Suppose we had vectors a1,a2,...,an_1, _2, …, _na1,a2,...,an and b1,b2,...,bm_1, _2, …, _mb1,b2,...,bm, with the dimension of the aa vectors being mmm. Construct a matrix AAA by stacking ai_iai vectors as rows, and a matrix BBB by stacking the bi_ibi vectors also as rows. Note that this is different than the set up we had for the other matrix multiplication interpretation! The key difference is that we are stacking the bb vectors as rows rather than columns.

Now, we can interpret each row of the product ABABAB as the weighted sum of the rows of BBB, with the weights coming from the corresponding row in AAA:


This means that we can use a single matrix multiplication in order to compute many weighted sums of vectors in parallel, as long as we pack those vectors as rows of a matrix. You can find an excellent visualization of this here.

We can use this trick to compute our self-attention output vectors. Recall that our output vectors are computed as the weighted sum of the value-embedded input vectors:


Using the trick we just learned, we can represent this as the following


Where the matrix VVV is the value-embedded input vectors stacked as rows, the matrix AAA is our attention weight matrix from the previous section with the attention weight distributions for each input vector stacked as rows. The matrix OOO contains our output vectors, with the ith row of OOO being the output for the ith input vector.

Finishing thoughts

We can summarize the whole computation as follows:


This finally gets us to how multi-headed attention is expressed in the Attention is all you need paper. It took three posts to unpack the intuition and reasoning behind the equation!

Machine learning Transformer models Attention
May 5, 2019

Multi-headed attention

Let’s build off of the vanilla self-attention layer I described in my last post to construct a multi-headed attention layer. The intuition behind multi-headed attention is that different input vectors might relate to each other semantically in multiple ways. Consider the sentence I am going to deposit my money at the bank”. When computing an output vector for deposit”, it is likely important to attend to bank” as the other side of the connecting preposition at”. Similarly, we’d also likely need to attend to I”, since I” is the actor that is performing the action deposit”.

But, here’s the rub. The relationship between deposit” vs bank” and deposit” vs I” are quite different. One is a prepositional relationship while the other is a subject-action relationship. Therefore, we might like to produce two separate output vectors that capture these two different senses, and use distinct attention distributions in computing them. The heads of a multi-attention layer are designed to compute these distinct output vectors and attention distributions.

A single head, attempt 1

To start off simply, let us first construct a single head by re-using the machinery we already have from self-attention. Recall that this head wants to construct attention distributions that capture a specific kind of semantic relationship between words. This means, intuitively, that before running self-attention on the vectors, we could embed them in a new space in which closeness represents that semantic relationship. Let’s represent that embedding operation with a matrix WWW. Then, we could embed all the input vi_ivis using the matrix WWW, and then run self-attention on those intermediate embeddings. That would look something like this:


Since we don’t know which semantic embeddings are useful for solving the underlying task, the parameters of WWW would be trainable. Also, note ddd now changes to the dimension of the embedding space, to remain consistent with its original purpose.

A single head, attempt 2

This gets us close to how multi-headed attention is computed in the literature. But, we aren’t quite there yet. Notice that many relationships between words are not symmetric. For example, in our example sentence, the tokens I” and deposit” are subject-verb related. But, I” plays the distinct role of subject”, and deposit” plays the role of verb”. Flipping those roles would create a sentence with a very different meaning. Unfortunately, our current embedding enforces these semantic relationships to be symmetric since both the query vector and the non-query vectors are all embedded with the same matrix, and the dot-product operation is symmetric:


If we wanted to break this symmetry, we can replace the single WWW matrix with two: one matrix WQW^{Q}WQ that transforms only the query vectors, and a second matrix WKW^{K}WK that transforms the non-query vectors. In the literature, you’ll find that these non-query vectors are referred to as keys — hence the K superscript for the matrix.

This gives us the following computation for our single attention head:


A single head, final attempt

Now that we have broken the symmetry between queries and keys, our last step will take a closer look at the computation of the output vector:


As the simple weighted sum of the input vectors, this output doesn’t have much flexibility for expression. What if the vectors need to participate in the output in different ways? From our example sentence, the tokens I”, deposit”, and money” all participate in a subject-object-verb” relationship. If we imagine the embedding for I” as the query vector, it might be the case that deposit” and money” are all equally important in terms of attention. However, we may need to push forward into the output the fact that we are using the token deposit” with its verb sense rather than, say, its noun sense, and that we are using money” as an object rather than the subject.

A solution to this is yet another transformation matrix WVW^{V}WV that takes the input vectors and transforms them into an embedding space suitable for combination in the final weighted sum. In the literature, you’ll find that these vectors are referred to as values, giving us the V” superscript for the matrix. This leaves us with the following final set of equations for computing a single attention head:


We will define this entire computation as:


To summarize, we introduced three new learnable matrices WKW^{K}WK, WQW^{Q}WQ, and WVW^{V}WV that all transform the input vectors into key, query, and value embedding spaces. Closeness between vectors in the query and keys spaces is used to compute attention weights, while the value embedding space is used in computing the final output vector.

Putting our heads together

Now that we’ve described how to construct a single head, bringing this back to multi-headed attention is fairly straightforward. The idea is to define KKK single headed attention layers, with their own uniquely defined query-key-value matrices, then execute them in parallel against the same set of input vectors. Finally, we concatenate the output vectors coming from the different heads together into a single output vector. We’ll introduce a subscript to the matrices indicating the index of the head to which they belong. We’ll also introduce a superscript to the output vectors indicating the head from which they were computed.

o1(i),o2(i),...,on(i)=Attention(v1,v2,...,vn;[WiQ,WiK,WiV])for i = 1…Ko1(i),o2(i),...,on(i)=Attention(v1,v2,...,vn;[WiQ,WiK,WiV])for i = 1…K



That is all for multi-headed attention! In a future post I discuss how to pack the attention computations into matrix multiplications for GPU optimization. I’ll likely also write a post showing how to actually build these attention layers in tensorflow. Thank you for reading!

Machine learning Transformer models Attention
April 27, 2019

A vanilla self-attention layer

Transformer architectures have become a fairly hot topic in machine learning since the Attention Is All You Need” paper was published in 2017. Since then, they have been applied to a variety of domains like image generation, music generation and language representation models. If there is a problem involving processing sequences, I bet someone has thrown the transformer model at it.

In this post I’m going to dig into a small piece of the transformer architecture — the self-attention layer. I find that this layer often confuses folks a great deal. I suspect part of the problem is the way it is presented in relation to other attention mechanisms, which often adopt the query, key, value” terminology to describe their various components. However, self-attention ends up being, in a way, a degenerate case of attention where the query, keys, and values are all practically the same! So, we are going to throw away most of that language for now, leaving only the term query”, and walk through the exact computations of the self-attention layer. In a follow-up post I’ll likely reintroduce the terminology to describe concepts that build on top of self-attention, like multi-headed attention.


A self-attention layer takes as input a sequence of vectors v1,v2,...,vn_1, _2, …, _nv1,v2,...,vn and produces as output a one-to-one corresponding sequence o1,o2,...,on_1, _2, …, _no1,o2,...,on.

We will visit each input vector vi vi in turn, and compute its output vector oi oi. When we visit an input vector, we will call it the query vector, and all other input vectors the non-query vectors.

The output for a query vector is computed by taking the weighted sum of all the non-query input vectors, where the weight for each is proportional to the exponentiated scaled dot-product of that vector and the query. The scale for the dot-product is the square-root of the dimensionality of the vectors, denoted d d d in the equation. The authors of the original paper report this scaling is to combat vanishing gradients in the weights for dot products between high dimensional vectors.


The dot-product measures the similarity between two vectors. This gives us an important property — vectors in the input sequence similar to the query receive high weight, while vectors dissimilar to the query receive low weight. Note that the weights can be interpreted as the scaled softmax over the dot-products of all the input vectors to the query.

And…that is it. In summary, to compute the output sequence, we visit each input vector, declare it the query, compute the weights of the non-query vectors, and compute the weighted sum.


When computing the output for a query vector, the weights associated with the input vectors, all together, is called an attention vector. The magnitude of the weight for an input vector is a measure of how salient the information in that vector will be in the output, with the saliency being defined, by the usage of the dot-product, as the degree of similarity between the query and the input vector. This can be counter intuitive to some, since there doesn’t seem to be any learned parameters here. In other words, how does the model learn what to pay attention to? Folks often mention the case of embeddings for words — just because two words might be semantically similar (are close in the embedding space), doesn’t mean that they should have high attention scores for one another.

I think if the self-attention layer is computed in isolation, without additional context around it, then this concern is valid. However, in most uses of self-attention, the embeddings are typically trainable, which means that the model will learn to craft word embeddings that are useful for the self-attention layer to perform well on the target task. Also, the self-attention layer rarely processes the raw word vectors themselves. Often self-attention layers are a component in a larger multi-headed attention layer, which I cover in a future post.

Another thing worth mentioning is that the way I outlined computing the self-attention layer is very inefficient, and doesn’t take advantage of the parallelism possible in modern GPUs. This was done to make the concept of self-attention as clear as possible, without obscuring it with optimizations. Typically, what you’d do instead is package the computation as a large matrix multiplication. I cover this in another post as well.

Machine learning Transformer models Attention