A better notation for deep learning

This article presents a new notation for the deep learning era. The notation simplifies the manipulation of so called contractive expressions, which are ubiquitous in tensor processing. The backpropagation equations are derived for concrete examples that are known to be hard to derive manually, such as strided and tiled convolution. A discussion is made around potential use cases, such as increasing the abstraction level of deep learning software and reducing code bloat.

Contents

Introduction
Einstein notation
Deep learning notation (in lack of a better name)
- Renaming
- Substitution
Backpropagation equations
Discussion
Errata

$$ \newcommand\grad[2]{\frac{\partial #1}{\partial #2}} $$

Introduction

When teaching deep learning, you reach a climax when deriving the backpropagation (a.k.a. backprop) algorithm. It is such a simple algorithm, yet it is what drives all the magic. There is a superficial difficulty to the derivations in all text books that I have read. The rules of matrix calculus is hard enough to remember, and it does not improve as you add more axes when you move up from matrices and vectors to higher order tensors. I have also seen usage of typesets to describe different types of variables, one for scalars, one for vectors, one for matrices, and another for higher order tensors. This is just tedious, rendering it almost impossible to do “napkin calculations”, and it enlarges the gap between theory and implementation on a computer (so far I have not seen any computer language where the font changes the semantics of your program, and I hope I will never see that).

Anyways, it does not have to be like this, as I argue that the difficulty arise from lack of notation alone. This post will introduce a semi-new notation, which I hope will be perceived as refreshing for anyone who have learned it “the old way”, and hopefully intuitive for those who has never derived backprop before.

The notation aims to simplify the manipulation of contractive expressions, i.e. the extension of matrix and vector operations to higher order structures known as tensors. Deep learning is basically a combination of contractive expression interleaved with non-linear elementwise function application and structural operations such as reshaping, permutation, concatenation and stacking. When deriving backprop it is the former that is the hardest to understand.

Einstein notation

The reason why I call the notation “semi-new” is that it is based on a fairly old notation introduced by Einstein when working on the theory of relativity. The notation I will derive here is not the same, as it aims to solve different problems. When Einstein dealt with the theory of relativity, he was naturally dealing a lot with differential geometry. In differential geometry, you work a lot with tensors. Yes, tensors is a very old concept in physics, similar to tensors used in deep learning today in the sense that they are multi-dimensional arrays. But physical tensors are also different because there are some laws (related to change of basis) that they need to satisfy. Naturally, physical tensors tend to be of quite a small dimension, as they deal with physical concepts such as spacetime, and not abstract hyper-dimensional vector spaces that we deal with in deep learning. Einstein’s notation aimed to simplify the notation of repeated summation and also the book-keeping of something called co- and contravariance of a tensor along different dimensions. I will not explain what co- and contra-variance means except that it relates to how coordinates changes with respect to a change of basis, and this is important in physics. This property was encoded by using sub- and superscripts along the dimensions of respective type.

Last thing to note about Einstein notation, and probably the most important for our purposes, is that it does not distinguish between the type of the variables. Be there scalars, vectors, matrices or higher order tensors, they are all just tensors of a particular order, there is nothing more to it. The notation focus on the individual elements of the tensors, and not on them as abstract entities in their own, which narrows the gap between the napkin calculation and the implementation in source code.

Deep learning notation (in lack of a better name)

Inspired by Einstein, we should aim at removing redundant information from our equations and simplify book keeping for the domain we are working in. As in general relativity, one such redundancy is the summation symbol. Consider an equation like this

$$ y_i =\sum_j x_{ij} $$

how can you compress this notation? Remove the summation symbol!

$$ y_i = x_{ij} 1^j $$

Initially this looks nonsensical. But, as long as we know that a repeated index means summation, we can infer the full equation with the summation symbol. In differential geometry, this is known as a contraction between the $x$ tensor and $1$ tensor (the tensor filled with ones, with dimensions inferred by context). In Einstein notation, $x$ is said to be covariant in $j$, and $1$ is said to be contravariant in $j$.

We do not care about tensor variance, but we only use sub-/superscripts to encode implicit summation. The indices of an expression only makes sense in the context of an equation, that is, an expression with a left and a right hand side. An index is implicitly summed over if it does not occur in the left hand side. This means that we can encode the hadamard product between two vectors $x$ and $y$ as

$$ z_i = x_i y_i $$

which do not perform a summation, as the index variable $i$ occurs in both left and right hand side. We can describe the inner product as

$$ z = x_i y_i $$

which do perform implicit summation since $i$ does not occur in the left hand side. Note that in Einstein notation, the expression $x_i x^i$ always denotes a contraction, and does not depend on context.

Another operation which is not valid in Einstein notation is contraction over more than two tensors at once. Again, we do not care about that, so the following operation is permitted and makes sense in deep learning notation

$$ w_i = x_i y_i z_i $$

so is this

$$ w_i = x_i y_j z_j = x_i \sum_j y_j z_j $$

and this

$$ w = x_i y_i z_i = \sum_i x_i y_i z_i $$

Renaming

Often when dealing with derivatives of tensordial expressions, you will often end up with taking the derivative of a tensor w.r.t. itself at different indices. For example

$$ \grad{A_i}{A_j} = \delta_{ij} $$

where $\delta_{ij}$ is zero everywhere except for indices where $i = j$, also known as the dirac delta. When a dirac terms appear in a contractive expression, we are allowed to use it to rename symbols. For example

$$ A_i = B_i C_j \delta_{ij} = B_i \sum_j C_j \delta_{ij} = B_i C_i $$

where we have used the definition of repeated index in the right hand side and deleted all zero terms. Hence, the dirac symbol can be interpreted as a translation of index symbols. Note however that symbol renaming can not be done if both indices occurs in the left hand side! For example, the equation

$$ A_{ij} = B_i C_j \delta_{ij} $$

does not represent an implicit summation. This is in contrast to Einstein notation, where the appearance of a dirac term always rename symbols, hence extra care needs to be taken if you are used to that!

Substitution

Given the equations

$$ y_i = A_{ij} x_j $$

and

$$ x_i = B_{ij} z_j $$

in order to substitute the expression for $x_i$ into the expression for $y_i$, one has to rename the index variable $i$ in $x_i$ with $j$ in $y_i$, but one also has to synthesize an imported index variable for $j$ in $x_i$. This is because $j$ in the equation for $x_i$ is local to that equation, and would clash with the local index variable $j$ in $y_i$. When synthesizing index variables, you are free to pick any unused name in your local context. For example, substituting $x_i$ into $y_i$, synthesize index variable $k$ and perform the symbol renaming $x_i \rightarrow x_j$ and $x_j \rightarrow x_k$

$$ y_i = A_{ij} B_{jk} z_k $$

this is the trickiest part of the notation, and is no different from Einstein notation, or regular mathematical notation for that sake either, but one quickly gets used to it. Of course, it is often easiest if one carefully select index variables such that no name collisions occur during a calculation to begin with!

Backpropagation equations

Let us derive backprop for a multilayer perceptron (MLP) with $N$ layers, a special case of a feed forward network without any branching operating on vectors. The MLP (without bias term) can be defined recursively as

$$ Y^L_i = \sigma(Z^L_i) = \sigma(W^L_{ij} X_j^{L}) = \sigma(W^L_{ij} Y_j^{L-1}) $$

where $Y^L_i$ is the output at layer $L$ for some non-linear elementwise function $\sigma$ and weight matrices $W^L$. The pre-activations $Z^L$ are introduced in order to create a computational graph that we can analyze. It is easier to analyze backprop if only a single operation is applied in each step.

Let us focus on one layer $L$ so we can remove that from the notation. This is aligned with the object oriented approach used in deep learning frameworks, where a layer can be programmed in isolation from others. Tensors in upstream layers are denoted by a $+$ superscript, and tensors in downstream layers are denoted with a $-$.

To train such a network, a loss function $E$ is defined. To compute the derivative of $E$ w.r.t. an element of a weight matrix $W_{ij}$ we use the chain rule of calculus

$$ \grad{E}{W_{ij}} = \grad{E}{Z_a} \grad{Z_a}{W_{ij}} = G_a S_{aij} $$

where the upstream gradient $G$ and the local sensitivity $S$ are introduced.

Let us start with the upstream gradient. Several intermediate tensors will be introduced during the calculation, but they will be substituted back and eliminated in the end

$$ \begin{split} G_a &= \grad{E}{Z_a} = \grad{E}{Y_b} \grad{Y_b}{Z_a} = C_b D_{ba} \newline C_a &= \grad{E}{Y_a} = \grad{E}{Z_b^+} \grad{Z_b^+}{Y_a} = G_b^+ F_{ba}^+ \newline D_{ab} &= \grad{Y_a}{Z_b} = \grad{\sigma(Z_a)}{Z_b} = \sigma^\prime(Z_a) \grad{Z_a}{Z_b} = \sigma^\prime(Z_a) \delta_{ab} = H_a \delta_{ab} \newline F_{ab} &= \grad{Z_a}{X_b} = \grad{W_{ac} X_c}{X_b} = W_{ac} \grad{X_c}{X_b} = W_{ac} \delta_{cb} = W_{ab} \end{split} $$

Note that the intermediate tensor $D_{ab}$ contains an unreduced constraint $\delta_{ab}$ (no reduction is performed over $a$ since it occurs on the left hand side). While these can be implemented as actual tensors in software, they are better treated as symbolic operators that should be “consumed” in contractive expression for efficient implementation. We also introduced the backwards propagator $F$, which is key for providing a recursive formulation of backprop. Substituting back into the equations for $G$ gives

$$ G_a = C_b D_{ba} = G_c^+ W_{cb}^+ H_b \delta_{ba} = G_c^+ W_{ca}^+ H_a $$

In index free matrix-vector notation, this becomes

$$ G = (W_+^T G_+) \odot H $$

which should hopefully be familiar to the reader if you have derived the equation before.

Let us continue with the local sensitivity

$$ S_{aij} = \grad{Z_a}{W_{ij}} = \grad{W_{ab} X_b}{W_{ij}} = X_b \grad{W_{ab}}{W_{ij}} = X_b \delta_{ai} \delta_{bj} = X_j \delta_{ai} $$

which is also, just like the intermediate tensor $D$, an unreduced operator. Substituting the expression of $S$ back into the expression for the gradient w.r.t. the weights we get

$$ \grad{E}{W_{ij}} = G_a S_{aij} = G_a X_j \delta_{ai} = G_i X_j $$

In index free notation, this becomes the outer product, which is a classical result for the MLP

$$ G X^T $$

which should also be familiar to the reader if you recognize the result for $G$.

Thus, the recursive backprop equations (note the usage of the term, in contrast to algorithm) are

$$ \begin{split} \grad{E}{W_{ij}} &= G_i X_j \newline G_i &= G_j^+ W_{ji}^+ H_i \end{split} $$

The process of deducing the equations is quite mechanical, it’s easy to do my hand, and it is also easy to implement on a computer to do automatically!

Strided convolution

Let us step up the game a bit to really convince ourselves about the power of deep learning notation. Consider a convolutional network working with image data. The forward equation for the convolution operator with stride $s$, kernel $K$ and input $X$ is

$$ Z_{ijk} = K_{iabc} X_{a,sj+b,sk+c} = K_{iabc} X_{ap^{jb}q^{kc}} = K_{iabc} X_{apq} $$

where $i$ and $a$ are channel indices and $j$, $k$, $b$ and $c$ are spatial indices. This introduce indirect indices $p = p^{jb}$ and $q = q^{kc}$ for ease of notation, that is, indices that depend on other indices. The notation allows for “hiding” indirect indices after initially recording them for ease of notation, and then “unhide” them later for performing reductions.

There are implementation details such as padding policy (valid, same, full), indexing policy (implicitly 0-based here) that I choose to ignore in order to not lose focus on the notation (but which needs to be addressed when implementing algorithms in software).

Let us focus on this operation only (ignoring any non-linearity applied after the the convolution) and derive the local sensitivity and backwards propagator equations

$$ S_{defg}^{ijk} = \grad{Z_{ijk}}{K_{defg}} = \grad{K_{iabc} X_{apq}}{K_{defg}} = X_{apq} \grad{K_{iabc} }{K_{defg}} = X_{ap^{jb}q^{kc}} \delta_{id} \delta_{ae} \delta_{bf} \delta_{cg} = X_{ep^{jf}q^{kg}} \delta_{id} $$

$$ F_{def}^{ijk} = \grad{Z_{ijk}}{X_{def}} = \grad{K_{iabc} X_{a p q }}{X_{def}} = K_{iabc} \grad{X_{apq}} {X_{def}} = K_{iabc} \delta_{ad} \delta_{pe} \delta_{qf} = K_{idbc} \delta_{p^{jb} e} \delta_{q^{kc} f} $$

Now the gradient w.r.t. weights is trivially computed through a contraction against the upstream gradient

$$ \grad{E}{K_{defg}} = G_{ijk} S_{defg}^{ijk} = G_{ijk} X_{ep^{jf}q^{kg}} \delta_{id} = G_{djk} X_{ep^{jf}q^{kg}} = \sum_{j,k} G_{djk} X_{e,sj+f, sk+g} $$

and the gradient is propagated as

$$ \grad{E}{X_{def}} = G_{ijk} F_{def}^{ijk} = G_{ijk} K_{idbc} \delta_{p^{jb} e} \delta_{q^{kc} f} = \sum_{i} \Bigg( \sum_{sj + b = e} \Bigg( \sum_{sk + c = f} G_{ijk} K_{idbc} \Bigg) \Bigg) $$

The last operation is known as “transposed convolution”, since it is effectively the transpose of the convolution operator in matrix form. Observe that dirac deltas with indirect indices introduce constrained summation in the backwards pass.

Tiled convolution

Tiled convolution is a convolution that “cycles through” different kernels as it moves across an input tensor. This is a compromise between the hard shared weights induced by the convolution operation and locally (almost fully) connected layers (a.k.a. unshared convolution). This operation is interesting not just as a notational stress test, but also because standard libraries do not have built in support for it.

Let us begin with the definition of the forward equation

$$ Z_{ijk} = K_{iabcp^jp^k} X_{a q^{jb} q^{kc}} $$

where

$$ \begin{split} p^a &= a \mod T \newline q^{ab} &= a + b \end{split} $$

for some fixed $T$ (i.e. number of kernel groups to cycle through). Note that the kernel $K$ have 6 dimensions (usually libraries do not support more than 4, excluding batch dimension). That is a beefy tensor!

First, let us derive the local sensitivity

$$ S_{defghl}^{ijk} = \grad{Z_{ijk}}{K_{defghl}} = X_{a q^{jb} q^{kc}} \delta_{id} \delta_{ae} \delta_{bf} \delta_{cg} \delta_{p^j h} \delta_{p^k l} = X_{e q^{jf} q^{kg}} \delta_{id} \delta_{p^j h} \delta_{p^k l} $$

Contracting with the upstream gradient gives the gradient w.r.t. kernel

$$ \grad{E}{K_{defghl}} = G_{ijk} S_{defghl}^{ijk} = G_{ijk} X_{e q^{jf} q^{kg}} \delta_{id} \delta_{p^j h} \delta_{p^k l} = G_{djk} X_{e q^{jf} q^{kg}} \delta_{p^j h} \delta_{p^k l} = \Bigg( \sum_{j = h \mod T} \Bigg( \sum_{k = l \mod T} G_{d j k} X_{e,j+f,k+g} \Bigg) \Bigg) $$

Secondly, let us derive the backwards propagator

$$ F_{def}^{ijk} = \grad{Z_{ijk}}{X_{def}} = K_{iabcp^jp^k} \grad{X_{a q^{jb} q^{kc}}}{X_{def}} = K_{iabcp^jp^k} \delta_{ad} \delta_{q^{jb} e} \delta_{q^{kc} f} = K_{i d b c p^j p^k} \delta_{q^{jb} e} \delta_{q^{kc} f} $$

which contracted with the upstream gradient gives the propagated gradient

$$ \grad{E}{X_{def}} = G_{ijk} F_{def}^{ijk} = G_{ijk} K_{i d b c p^j p^k} \delta_{q^{jb} e} \delta_{q^{kc} f} = \sum_{i} \Bigg( \sum_{j + b = e} \Bigg( \sum_{k + c = f} G_{ijk} K_{i,d,b,c,(j \mod T),(k \mod T)} \Bigg) \Bigg) $$

just like with strided convolution, this yields highly interesting patterns in the indexing and summation constraints in the backwards pass.

Attention!

As a crescendo to this little notational symphony, let us have a stab at the layer type that has received most attention recently. The forward equation is

$$ Y_i = A_{ij} V_j = \sigma(Z_{ij}) V_j = \sigma \Bigg( \frac{Q_{ik} K_{jk} }{\sqrt{N}} \Bigg) V_j $$

The backward equations are derived below

$$ \begin{split} \grad{Y_i}{Q_{ab}} &= \grad{A_{ij}}{Q_{ab}} V_j = A_{ij}^\prime V_j \grad{Z_{ij}}{Q_{ab}} = A_{ij}^\prime V_j \frac{K_{jk}}{\sqrt{N}} \delta_{ia} \delta_{kb} = A_{ij}^\prime V_j \frac{K_{jb}}{\sqrt{N}} \delta_{ia} \newline \grad{Y_i}{K_{ab}} &= \grad{A_{ij}}{K_{ab}} V_j = A_{ij}^\prime V_j \grad{Z_{ij}}{K_{ab}} = A_{ij}^\prime V_j \frac{Q_{ik}}{\sqrt{N}} \delta_{ja} \delta_{kb} = A_{ia}^\prime V_a \frac{Q_{ib}}{\sqrt{N}} \newline \grad{Y_i}{V_a} &= A_{ij} \delta_{aj} = A_{ia} \end{split} $$

and we are done. Anticlimactic? Yes. Trivial to derive? Indeed.

Discussion

Hopefully you are convinced that deep learning notation is useful when deriving gradients involving contractive expressions. What else can it be used for? Deep learning software is heavily object oriented today. Layers are programmed in isolation against a common interface (usually something like implementing the forward and backward operations). This is great, and is what has allowed for fast development and adoption of the software.

However, in order to use a layer, one needs to implement it with custom C++/CUDA code in one of the major code bases. If one wants to use an optimized layer one has to rely on proprietary binary code supplied by the hardware manufacturers. In fact, it seems the binaries are getting bigger by the year, and recently when putting together a micro-service my docker image blew up to 10 GB. This is a lack of abstraction, and if the human implementation efforts needed to maintain these libraries is not motivating enough, maybe the code bloat is.

Imagine if we instead implemented a general purpose compiler that received an abstract modeling language based on a common notation and outputs optimized binary code. The language would be declarative instead of procedural, aligned with a more general trend in software. You would describe what you want to be computed, not how it should be carried out, which is the responsibility of the compiler! New interesting layers could be quickly prototyped and tested, the gap between research and engineering would further narrow down. The notation laid out in this article might be exactly that common denominator that ties together modeling with implementation and allows for experimentation of new architectures.

Errata

(Aug 20, 2024): s/dirac/kronecker/g