This lecture is about diffusion, a technique for building generative models.
Diffusion models behind most of the “AI image generators”, whose output these days populates social media…
… and occasionally, the news.
We will first discuss a very simple version of the diffusion process, with all the math stripped away. This will help to illustrate how simple the basic idea is: it can be explained with (almost) no math at all.
Next, to help us build an efficient and robust diffusion system, we will need to use Gaussian noise. This is noise from a Gaussian or normal distribution. To help us build diffusion on top of Gaussian noise, we will need to understand Gaussians very well, so we will spend a little time developing our intuition, and setting up the properties that we need.
Then we will develop Gaussian diffusion, the basic principle behind modern diffusion systems like DALL·E, Midjourney and Stable Diffusion.
This is the basic task of generative modeling: We are given a set of instances. In this case images, but they could also be sentences, bits of music or anything else we can think of. Given this dataset, our job is to produce a model that functions like a black box with a single button: we push the button and out pops an instance that looks like it belongs in the dataset.
It shouldn’t literally come from the data, that would be overftting, but it should seem to follow the same distribution as the one we got the data from.
Why does generative modeling require all these special tricks to apply deep learning to them? What makes it so difficult?
The first reason is that it doesn’t come with explicit input/output pairs. Neural networks are functions from an input to an output. We can only start training them when we have an example of what a given output should look like for a given input. In generative modeling we don’t have that. We just want a sample from the data distribution, and we want a new sample every time. We don’t even know what the sample should be, so long as it looks like the data.
To make the sample different each time, we need randomness. In some way, the model should consume random bits, and then it should translate these to an output sample.
If we do this naievly, for instance, by randomly pairing up one noisy input with an instance in the data, and making the model predict one from the other, we get what is called mode collapse. Instead of producing different samples with high variety (for instance, a different digit every time), it produces the sample sample every time: the point that minimizes the average distance to all data points. Instead of producing variety, the model collapses to a single output. The big challenge in generative modeling is avoiding this kind of behavior.
Here is the the basic idea behind diffusion. We start with an image from the data, and we gradually, step by step, add a little noise to it. In this case we do that by picking a few random pixels and setting them to black or white at random. As we do this, the information in the original image gradually disappears, until we are left with an image that consists entirely of noise: each pixel is black or white with 50% probability, and there is no further structure.
We can then train a model to reverse this process. This is essentially a denoising model, something that neural networks have been able to do for decades. Note that we now have input/output pairs for our neural network. For each time step t we can just give it image t+1, and train it on how close the input is to the image at timestep t.
It’s not possible to do this task perfectly, but for many of the noisy pixels, especially at the start of the diffusion process, you can see pretty well what it behind them, so we can at lest make an educated guess at most stages of the diffusion process.
We call the image from our data x, and we call the sequence of noisy images zT. We use T to refer to the number of noising steps we apply, so zT is our final image. We set T to some high value where we’re sure that at that point every pixel has been corrupted at least once and no information from the original image remains.
Then, when the time comes we essentially fake the process of denoising. For zT we know exactly what the distribution is. We can sample a new zT from this distribution very easily: we just create a new image, where we decide to set every pixel to black or white at random. This the same noisy image we would get if we took a sample from our data, and added noise to it bit by bit, exc ept in this case, there is no actuall image “behind” the noise.
However, we still ask our model to “denoise” it. As the model starts to remove parts of the noise, it will begin to hallucinate small details to take the place of this noise. In the next step, these details replace what it hallucinates for the next noise to remove and so on. I everything works, at the end of the denoising process, the model has hallucinated an entire image, that looks like it came from our dataset.
Here is the training and sampling process.
During training, we follow the diffusion process, and add noise step by step. At each stage, we then immediately train the model for one step of gradient descent, to predict the previous image from the next one. We do this by running the model and comparing its output ot-1 to the true value for zt-1. For now, we will simply use the squared Euclidean distance between the two.
So, how do we set up the model? We need to map one image to another. This means that we cannot go with fully connected layers if we want to work on high-res images.
For that reason, we want the kind of hourglass shape that we also saw in the variational autoencoder. We uses a series of convolutions and pooling operations to gradually work down to a low-resolution, high-dimensional representation, and then we reverse the process with transpose convolutions and upsampling.
This allows the model to learn powerful image-to-image functions. However, it also makes it difficult to learn certain very simple functions. For instance, in this case, the model needs to leave most of the image intact, and only remove a few noisy pixels. To output all the pixels that should stay the same, the model should encode them through all the convolutions and then decode them again. This is a complex function to learn, when all you want is to reproduce the input.
For this purpose, we add residual connections. Before we send the input through the convolutions, we remember it, and then at the end, concatenate it to the input along the channel dimension, and project down to 3 channels with a simple 1x1 convolution.
Note that UNets existed long before diffusion models. This is just the first time we’ve come across them in the course, so we introduce them here.
It help to tell the model at which point in time we are denoising, so we give it an extra argument t. We usually don’t show this in the notation, but you can assume it’s always there.
It can also help a lot to tell the model at which point in space image it’s working.That is, the coordinates of the pixel to which a convolution is being applied. We’ve discussed before that convolutions are translation equivariant, which means they necessarily do the same thing regardless of where in the image they are applied. However, many things are very dependent on which part of the image you’re in. For example, if you’re generating portrait photos, your much more likely to generate a nose in the center of the image that in the top right.
For both t and the pixel coordinates, we create embeddings. These are then added to the pixels before every convolution and transpose convolution.
Here is the result on a fairly quick experiment without too much tuning.
The final step doesn’t exactly show recognizable digits, but it’s enough to indicate that the method has promise.
What can we improve about this approach?
One issue is that we are feeding examples to our model in sequence. It will first see a bunch of very unnoisy images from the start of the diffusion process, and then a bunch of noisy images. This causes learning to become unstable. For instance, the network might forget what it’s learned from the unnoisy images while it’s learning about the noisy images.
For this reason, it would be better to pick a random t and train the neural network on the images at that point in time. However, in our current set up that would be expensive, since we would have to run the whole diffusion process up to point t, just for one sample.
A second downside is that the model needs to identify which pixels are noise and then do something very different for those than for the other pixels. This is not what convolutions are good at. They are much better at doing the same thing to all pixels in parallel.
Even worse, only some of the corrupted pixels were added in this noising step. Ideally, the model removes those, and leaves the rest as they are. But the model has no way to tell which is which. The best it can do is denoise all pixels a bit.
Before we more onto the alternative types of noise, here is one approach to diffusion that solves some of these problems.
Instead of training the network to predict the next step in the denoising process, we train it to fully denoise the image. That is, we always predict x from zt, no matter how far along the noising process we are.
For high t, the image will be so noisy that there isn’t much to predict, and the network will mode collapse to predicting mostly the data mean, possibly with very slight amount of variation. For low t, with little noise, it will be very easy for the model to make an almost perfect prediction.
One thing we can now do is add noise one pixel at a time. We’ll keep track of which pixels we’ve already corrupted, and only corrupt pixels we haven’t corrupted before. This means we diffuse for exactly as many steps as we have pixels. This seems inefficient, but that turns out not to be the case.
First, because during training, we can always easily sample zt directly. We just pick a random t between 0 and T, and corrupt t pixels of x chosen randomly (without replacement).
Second, because our sampling algorithm works by denoising and renoising, we don’t need to sample with the same step size we did before. We can just denoise from level t=1000 and renoise at level t=900. This is a way to trade off sampling speed for quality.
The idea is that during sampling, instead of denoising a little bit—from level t to level t-1—we fully denoise from level t , and then add some noise back in at level t-1. The idea is that the model doesn’t need to worry about modeling the noise (which in the case of binomial noise is hard to do for a unet). It can just predict the fully cleaned image and we’ll add some noise back using the noising operator we also used during training.
The result is that we start with a very fuzzy prediction: If we feed the model a fully noisy image with no signal, all it can do is predict the data mean. Then, we renoise, hiding most of the image, except some very small details of the fuzzy image. There are different every time, so they may steer the model slightly towards a different outcome. The more noise we remove, the more the predictions serve to add detail, and the more the prediction moves away from the data mean and towards a specific sample.
With that, here is our second diffusion algorithm in full.
The parameter K lets us trade off sampling speed for image quality.
This kind of no-frills approach to diffusion is not just a teaching tool. It can actually be used to generate high-quality images. In this paper, the authors show that a simple noise function and a squared error loss (like we used) is often all you need. In fact, you don’t even need noise. You can gradually degrade the image by any means you can think of: evend slowly overlaying a picture of a random animal allows the diffusion process to work.
That’s about as far as we can go without mathematics. In the remainder of the lecture, we will show how using Gaussian noise can solve many more of our problems.
Most diffusion models work on the basis of Gaussian noise. That is noise from a Gaussian or normal distribution. These distributions have many nice properties, that mean they are used a lot in machine learning. It can however, be a little daunting to work with Gaussians at first.
A big reason for this is that the function for the probability density is complex. This is usually how Gaussians are defined and how their basic properties are proved. We can make things a lot simpler by forgetting about this formula, and defining Gaussians a slightly different way.
We start with the 1D standard normal distribution. That is, a single distribution which has its mean at zero, and which has a very simple shape.
One of the main properties of the Gaussian is that it has a definite scale: it assigns some nonzero probability density to every point between negative infinity and positive infinity, but we can still say with strong certainty that a sample will be in the range [-1, 1], and with extreme certainty that a sample will be in the range [-10, 10]. This means, for instance, because people’s heights are roughly normally distributed, you may see the occasional people being 2m tall, and one person in history might have come close to being 3m tall, we will never see a 4m tall person. In short, this normal distribution are why doors are feasible.
To get a distribution like this, we can use exponential decay: every step of 1 we take away from 0, we multiply the density by a fixed number smaller than 1. The function e-|x|, or exp -|x|, is an example. the normal distribution adds even more decay by squaring the x. exp-x2.
There are various reasons for this square: One is that with the square in there the curve has two inflection points. This is where the derivative flattens out and the curve goes from growing more quickly to growing more slowly (on the left) and from decaying more slowly to decaying more quickly (on the right). Because the peak is in between the two inflection points and that’s where the rate of change slows down, we get a “bulk” of probability mass in that range. In short this is a natural range to consider the scale of the normal distribution.
We multiply by 1/2 before the square so that the inflection points land precisely on -1 and 1, which means that for the standard gaussian, the mean is at 0 and the bulk of the mass is in [-1, 1]. The variance also coincides with the positive inflection point, so that is 1.
To make this a proper probability distribution, the density over the whole domain should sum to one. This is achieved by integrating the raw function exp-x2 and dividing the function by that result. However, to keep the formula simple, we can also just say that p(x) is proportional to exp-x2. Whatever we multiply by to normalize, is a constant with respect to x, so we won’t need to worry about it. The function becomes even simpler if we only look at the log-probability density. In that case, it’s just a parabola (the normalization constant becomes an additive constant c).
Note that we’re not introducing a mean or a variance (that would just complicate the formula). We’ll just stick with the single standard normal distribution and build on that for the time being.
Next, we define a Gaussian in d dimensions. That is, a distribution on vectors of d elements. Imagine that we sample d numbers xi independently from Ns1 and stick them in a vector x. The resulting distribution on x, we call Nsd (or just Ns if the dimension is clear from context).
That is, the standard Gaussian in d dimensions just consists of d independent standard Gaussians, one along each axis.
What does the density of the standard Gaussian look like? When we sample x, each of its elements is sampled independently, so p(x) = p(x1) p(x2) … p(xd). We’ll look at the log density of this to make things simpler. The log density of xi under Ns1 is -1/2 xi2 plus come constant, so we get a sum of these for all xi. If we collect the squares into a sum, we see that this sum is equal to the squared length of x.
What this tells us, is that all x that have the same length, have the same (log) density. This means that the contour lins of Ns2 form circles and in higher dimensions, the contour surfaces form hyperspheres.
For this reason we call this type of distribution spherical (the word isotropic is also used).
With a little more reasoning (which we’ll skip), you can also work out that the smaller circles have higher density, and that the density decays with the diameter of the circle in the same way as it does in the 1D case.
This means that the distribution in 2D roughly looks like a rotated version of the 1D curve.
Next, we define the family of Gaussians. We will do this by assuming that we have a vector s which is drawn from a standard normal Gaussian in d dimensions. We then apply an affine transformation to s (that is, a matrix multiplication and a translation), resulting in the vector x.
Obviously, if we know s, then x is fully determined. But what if I told you I had done this, but didn’t tell you what s I had drawn. What would the distribution p(x) be? We call this distribution a Gaussian. The matrix A and the vector t function (for the time being) as the parameters of this Gaussian.
For our definition, it is not necessary that A is square. However, any Gaussian defined by a non-square A, can also be defined by a square A (possibly with s of a different dimension), so we will limit ourselves mostly to square As in this lecture.
How should we visualize these Gaussians? A simple way is to think of all the points inside a circle under Ns. This circle will be mapped to an ellipse by A and t.
If, say, 68% of the points s fall inside the inner circle, then they will all be mapped to the inner ellipse on the right. This means that 68% of the points x will fall inside the inner ellipse.
In short, the general Gaussian is not spherical.
question: What would A and t need to look like for the distribution on the right to be spherical as well?
Next, we can ask about some of the properties of these Gaussians. For example, the mean.
The mean is defined as the expected value of the distribution. That is, what does the average of a sample converge to for an increasing sample size?
We use two properties in this derivation. First, that the Expected value is a linear function. That means that any matrix multiplication and vector addition can be taken out of the mean in the same way we would take it out of a set of brackets.
The second is that the mean of the standard Gaussian is 0 (that is, the 0-vector). This follows from the fact that the elements of s are independently sampled, so the expected value of s is the vector made up of the expected values of its elements. And the expected value of Ns1 is 0.
The end result is that the mean of a Gaussian (A, t) is equal to t.
Next, the covariance. Again, think of this as a property of any multivariate distribution not (necessarily) a parameter.
The covariance matrix of p(x) collects the variance of all the elements of x along the diagonal and the covariance of any two elements on the off-diagonal elements. This tells us what the covariance matrix of s is: the variances are all 1, because all elements are distributed according to Ns1, and the covariances are all 0, because all elements are independently drawn. In short, the covariance matrix of s is the identity matrix.
We can now work out what the covariance matrix of a Gaussian (A, t) is. We use the definition that the covariance matrix of p(x) is the expected outer product of (x - μ), where μ is the mean of p(x).
In the previous slide, we worked out that μ = t, so if we fill in the definition of x in terms of s, the translation term t and the μ cancel out. Next, we are left with two matrix multiplications. Working these out of the expectation, we are left with EssT, which is the covariance of the standard Gaussian, which is I, so it disappears.
The end result is that the covariance of the Gaussian (A, t) is AAT.
Because A and t transform so simply to the mean and the covariance (and back) this means we can now use the mean and covariance as a name for the Gaussian as well (we have a simple method for translating one type of name to another).
The mean and covariance is, of course, the traditional way of parametrizing the Gaussians. It has one benefit over the affine transformation: the mean and covariance are unique names. The affine transformation naming scheme has many names for the same Gaussian.
One thing that is going to be important when we build diffusion models is the subset of Spherical Gaussians. These are Gaussians that retain the property of the standard Gaussian that all the contour lines are circles, spheres or hyper-spheres.
We achieve this if A scales each dimension uniformly (that is by the same amount). In other words, if A is the identity matrix I times some scalar σ.
Following what we derived earlier, this shows that the covariance matrix is I times σ2.
By seeing what happens in the 1D case, we can see that these are natural analogues of the variance and the standard deviation, and we will name them as such. That is, even though a d-dimensional distribution doesn’t have a single variance or standard deviation, in a spherical distribution the variance and standard deviation are the same in all direcitions, so in this specific case, we can talk about a single scalar variance and standard deviation.
To finish up our discussion, we will look at 4 properties of Gaussians that will help us to define the Gaussian diffusion process.
We start with the simplest one. We already know that, by definition, an affine transformation of the standard Gaussian is a Gaussian. If we combine this with the fact that the composition of two affine transformation is another affine transformation, we can show that the affine transformation of any Gaussian (standard or not) yields another Gaussian.
Say that we have a Gaussian G and and affine transformation (A, t). What we want to show is that sampling from G and applying (A, t) results in a Gaussian.
Sampling from G is, by definition, equivalent to sampling from Ns and then transforming by some affine transformation (B,t in the slides). All we are doing then, is applying another affine transformation (A, t) after that. The composition of two affine transformations is another affine transformation, so what we are doing is equivalent to sampling from Ns and then applying a single affine transformation. By our definition, this is a Gaussian.
Here’s a slightly more involved one to prove: if we take samples x and y from two separate Gaussian, their sum is also distributed as a Gaussian.
The way to show this, and to work out what the parameters of the resulting Gaussian are, is to frame this process as an affine transformation. If we fill in the definitions of x and y, we get an affine transformation in two variables s1 and s2. We can rephrase this as an affine transformation in a single variable, by simply concatenating A and B horizontally and s1 and s2 vertically.
Finally, we can see that the concatenation of s1 and s2 is a standard Gaussian vector. This is because we defined a standard Gaussian vector as one whose elements were sampled independently from Ns1. The elements of s1 and s2 were each sampled indepently, so s must also be standard Gaussian.
The result is that z is an affine transformation of a standard Gaussian, so it must be a Gaussian.
Note, however, that this doesn’t give us as nice a definition as we have found so far: the problem is that we are defining z as an affine transformation of Gaussian noise s, but s has twice as many dimensions as z. In our previous definitions s always had the same number of dimensions as the resulting variable.
The solution is to work out the covariance matrix. This is a unique parametrization of this Gaussian on z, so it must be square.
To get the covariance matrix matrix, we need to work out what we get if we multiply (A B) by its transpose.
The result is AAT + BBT, the sum of the two covariances of the original two distributions.
If we want to translate this back to an affine transformation, we will need to find a square matrix C, such that CCT is equal to this expression. Happily this is always possible (usually in a few ways), using operations like a Eigen- or Cholesky decomposition (note that is AAT + BBT symmetric)
This operation, summing two Gaussian variables, will become especially important in the context of spherical Gaussians.
Applying the rule we derived in the previous slide, we can see how to work out the sum of two Gaussians in the affine transformations notation: we take the two variances (the sigma’s without a square), we square them, then sum them then take the square root again.
The third property, we’ll call chaining Gaussians. This is when we take a sample from one Gaussian, and make it the mean of another Gaussian.
This happens, for instance, if you take one noisy measurement, and use it in another noisy process. For example if I measure the length of one plank of wood, and then try cut another plank of wood at that length, the length of the resulting plank of wood will be a combination of how poorly I measured the first plank and how poorly I made my cut. If both processes are Gaussian, what we’d like to show, is that the resulting distribution on the length of the second plank has and acceptable variance, and that the mean of the whole process is equal to the length of the first plank, so that at least on average the planks are the same length.
We can think of this as a convolution of two Gaussians. Each possible measurement x contributed some probability density to the final point y, proportional to how likely x is to be measured. Adding up (or rather integrating) all these bits of probability density, will give us the answer.
Of course, this is not an easy integral to work out, especially if we want to move to higher dimensions. Instead, we will take a geometric perspective again.
Here’s what the picture looks like in N dimensions. Writing the two Gaussians as transformations of standard Gaussian noise, shows how we can solve this puzzle: we just plug the definition of x into the definition of y.
Doing this gives us y as a function of two vectors of standard Gaussian noise s1 and s2. The trick here, is that we can interpret this as the sum of two Gaussians (one with mean t, and one with mean 0).
We’ve seen already how to combine these into a a single Gaussian. We sum the means, and the covariances.
The result is that our distribution on y has t as a mean, and the sum of the two covariances for its covariance matrix.
As before, the story becomes much simple in the spherical case. Here, we just square and sum the variances and then take the square root.
The KL divergence, written here as an expectation, expresses how “similar” two probability distributions are. If the are exactly the same, the KL divergence is 0, and the more they differ, the bigger it gets.
For this to lead to a loss function we can use, we need a closed-form expression for the KL divergence. The formula here is no good to us, because the expectation isn’t necessarily computable. We need an expression that we cannot just compute quickly, but that we can backpropagate through (remember, this will be our loss function).
This is our final reason for liking Gaussians: the KL divergence between two Gaussians has a simple closed-form expression.
We will make use of the following useful property: the expected value between random vector x and constant vector y, if x is distributed according to a spherical Gaussian, is the squared distance between y and the mean of the Gaussian, plus d times the variance.
Now, to the KL divergence.
We won’t work it out for general Gaussians (although it exists), we will just show a simple case: the KL loss for two spherical Gaussians, in terms of the means. We will assume that our model only controls the mean of one of the Gaussians, so that we can ignore any terms that are constant with respect to that mean.
In the first line, we expand the logarithm over the two Gaussians, and remove any terms that don’t contain μ, which is the term we assume we’re optimizing. Then we work the expectations inside as shown.
To line 2, we can apply the property from the last slide, resulting inline 3 as shown. The second term reduces to a constant, and the first to a constant (in μ) plus a constant, times the distance between μ and ν, which is the fundamental property that we should be minimizing.
At last, we are ready to start building Gaussian diffusion models.
There are many different ways to do this, and many state-of-the art models add their own little tweak and touches to the derivation.
We will focus on this specific approach. The first to show that diffussion models could be trained to do the things we’d seen models like GANs do, like generate realistic human faces. Besides being very powerful, this model is also remarkable for its simplicity. More powerful models have been published since, but this is probably the best in terms of its tradeoff of simplicity and performances, so it’s a good model to use as a foundation of your understanding of diffusion models.
Here is the picture of diffusion models that we’ve built up so far. We add noise to an image, slowly erasing the information in it step by step, and then we train a model to reverse this process.
The key feature that made this work is that the noise we converge to in the diffusion is a known distribution that we can easily sample from.
We also saw that a nice property to have was if we can sample the noisy image zt at time step t directly rather than explicitly having to go to t steps of diffusion. We will require both these properties from our Gaussian noise process.
Here is what Gaussian noise looks like.
Here is what that looks like in 2 dimensions. We start with a Gaussian that is centered on x (that is, x is its mean). We then sample z1 from that distribution. This is an image that will looks like a noisy version of x. We then repeat the process, we create a Gaussian that is centered on z1, sample from that and so on.
This is, of course, the Gaussian chaining that we talked about in the last part. So we know that our distribution on zt is a Gaussian.
Currently, the variance of this Gaussian grows bigger with every step we take, and the mean stays at x, so we’re not really removing information from x very effectively. We’ll fix that by slightly tweaking the process.
Here’s how we can write down this diffusion process.
What we need it for our process to converge to a single, know Gaussian that we can sample from. This can be any Gaussian, but it would be neatest if this was the standard Gaussian. We’ll see what we need to do to achieve that.
The first thing we need is for the mean to go to 0. We can achieve this with a simple trick. Instead of putting the next Gaussian right on top of the previous sample, we first shrink the sample a bit, by multiplying it by some value, γ, between 0 and 1.
This shrinking creates a kind of force that steadily pulls the mean to 0. It’s reasonable to assume that this will cause the process to converge to a zero-mean distribution, but we should really prove that. Additionally, we should work out what to set the variance to at each step, so that the distribution converges to variance 1.
To work this, we will write down the first couple of zt, and rewrite them in terms of x. To do this, we can apply the same principle that we used to show the chaining property was true. We write each z as an affine transformation of the previous z and some Gaussian noise s. We then plug in the definition of the previous z, and work this into a function of x and a single standard noise vector s.
By z3, the pattern (hopefully) becomes clear: the scalar in front of x is just γ to the power of t. The variance is a sum with t terms, where each term is our chosen standard deviation σ, times γ to the power of i, where i increases in steps of 2.
With this, we can work out what to set σ to so that the variance of zt converges to 1.
They key is to recognize that the sum on the right-hand-side is a standard geometric sum in for γ2 which a closed form expression is known. With that, we can rewrite and figure out what σ2 should be.
In practice, it turns out that it’s better to take big steps at the start of the diffusion process, and smaller steps at the end. This is because denoising is easier if you can see more of the image. That mean that the start of the process, we can ask the mode to remove more noise, which we achieve by taking bigger steps there. This means we can take smaller steps at the end, when it’s more difficult for the model to figure out what is “behind” the noise.
We achieve this by setting a noise schedule. We use a different value for each timestep, and ensure that these decrease with t.
If all of these are smaller than 1, we can be sure that the process still converges to mean 0, but what should we set the variance to at each step to ensure that we get a standard Gaussian at the end? A lucky guess might be the same value we got earlier: one minus the square of γ, except now to do this at each timestep for the specific γt we used.
Let’s see if that guess pays off.
We proceed as before, by writing out the definitions of the first few zt, and plugging the previous zt into the definition of the next.
The result is more complex, but not by much.
To get the mean of zt, x is scaled by the product of all γt so far. The scalar in front of s is once again a sum of t terms. Each term i consists of the variance use at time i times all the square of the γt's used after time i.
Finally, we need to show that if we set each to the value of one minus the square of the gamma used at t, we get the desired result (a variance that goes to 1).
The key is to realize that if we multiply out the brackets, we get a telescoping sequence. We show it here for t=4, but it works for all t. Each term before we multiply our the brackets yields two terms afterwards one with the squared γt's from i to t and one with the squared γt's from i+1 to t. Since i increments by 1 each term, the first term we create by multiplying our one set of brackets will be equal to the second term we generate by multiplying out the next set of brackets. This is true for every set of brackets, so that we end up only with the final 1 and the full sequences of squared γt’s from the first term.
This tells us that the variance of zt can be expressed as one minus the square of the product of γt all so far, which goes to 0, so the variance goes to 1.
So there we have it: a noise step for our diffusion process that generally affects all pixels to some extent, that can be easily sampled at time t and that converges to standard noise as t increases.
The next thing we want is a principled loss. Instead of just appealing to intuition, and minimizing the squared distance between two fairly arbitrarily chosen images, we are going to start with a well-chosen principle, and derive from this a loss that we can optimize.
The principle we choose, just as we did with the VAE, is maximum likelihood: we should try to set the parameters of our UNet so that the probability of our data is as high as possible.
To keep things simple we will
Only state the maximum likelihood objective for one instance x. It would be more correct to maximize the likelihood for the whole dataset, or even better to maximize the expected likelihood of one instance sampled from the data distribution, but there we will just do what we always do in deep learning: loop over the individual instances (or batches of instances) and follow the gradient of the loss for these with stochastic gradient descent. In short, this part of the approach is exactly the same for generative models as it is for classification or regression, so we will start with the maximum likelihood objective for a single instance.
Take the negative log-likelihood as a starting point. As we noted before, we put a logarithm in front of a probability (density) to help with the math and the numerical stability, and we put a minus in front because we are looking for a quantitiy to minimize.
The next question is, what is the probability of x under our model? In this view the entire denoising chain, from sampling zt all the way to the final x is a process that we call p. p is controlled by the parameters of our UNet, which we collectively call θ. Our job is to set θ so that this whole sampling chain is as likely as possible to generate our training, validation and test data.
Note the subtle, but fundamental change in perspective. Before we started by assuming that there was a diffusion process. Now, we never refer to it in our objective. If we could optimize this objective function directly, we would never need to worry about the existince of a diffusion process at all. All we would need, would be the denoising process.
As it is, we cannot easily optimize this function directly, so we bring in the diffusion process, and call it q.
To fit this picture we need to slightly adapt our UNet, or at least our perspective on it. The UNet should now compute the function p(zt-1|zt). That is, instead of giving us a single guess for what picture preceded zt, it gives us a probability distribution on the whole space of possible images.
Practically, this isn’t as involved as it sounds. The simplest way to achieve this is to treat the output of the UNet as a mean of a probability distribution. We will combine this with a variance to get a spherical Gaussian on zt-1. is a hyperparameter that we schedule manually.
So, here’s the situation: we want to rewrite the negative log probability density of the data into something we can compute and that we can backpropagate through. The functions that we can easily compute are
under p (our model):
the density of zt conditioned on zt+1
the density of zT (the last step in our diffusion process) which we assume to be standard Gaussian noise.
under q (the diffusion process which we will bring in)
the density of zt conditioned on zt-1
the density of zt conditioned on x
In our derivation, we will make use of the properties of continuous expectation, and of Jensen’s inequality. The latter is a general rule about what happens when you take the sum of points on a concave function, versus summing first and then applying the function. In our case, we need a specific consquence of Jensen’s inequality, which is that if we are working with the logarithm, a concave function, of an expectation, a sum or integral, then moving the logarithm inside the expectation changes the value, but it makes it strictly larger.
Here is our full derivation. Step by step:
1: We can get to a particular x by many different sequences of denoising steps. If we call this sequence of intermediate pictures ẑ, then the total probability is the probability of x given a particular ẑ, integrated over all possible ẑ. In probability theory jargon: we start with the joint probability of x andf the ẑ that generated it, and then we marginalize out ẑ to get our loss.
2: We use the chain rule of probability to decompose the joint probability into the product of all probabilities of zt given its successor. This is a very common approach in sequence modeling. If you’ve never seen it before, have a look at these slides.
3: We feed in the diffusion process. There is not much intuition here, except that we’re allowed to do this because the factor we add in is equal to 1. Note that this probability is slightly different than the one under p, because here x only appears in the conditional. The reason is that we want q to correspond to our diffusion process in which x is given and we sample ẑ.
4: We move the numerator out in front, and the denominator we decompose in the same was as we did with p. This gives us a factor q(zt | zt-1) for every factor p(zt-1|zt). I recommend trying this with a short sequence on pen an paper, to see that it works out this way.
5: The probability next to the integral shows that we have created an expectation. Specifically, the expected value of some probabilities multiplied together, under our diffusion process. This is relevant, because we can estimate expectations by sampling, and we can easily sample from our diffusion process. In short, this is starting to look like something we can estimate.
6: Just as with the VAE, we use Jensen’s inequality to move the logarithm inside the expectation. This turns the quantity we’re working out into an upper bound of the quantity we actually want to minimize. This is called the evidence lower bound or ELBO (it’s a lower bound if you take the perspective that you’re maximizing the probability of the data).
7, 8: Next, we work out all the log factors separately. In our case, only the terms containing p(zt-1|zt) are affected by the parameters. The rest are constants, so we can ignore them without affecting the gradients we ultimately get.
9: We open up the formula for the Gaussian that our model outputs. Because of the logarithm and our assumption that the variance τt is constant, the only term that is influenced by the parameters is the squared euclidean distance. The multiplicative constant should be included if we’re strict (since it’s different for different t), but it’s often found that the algorithm performs better if we ignore it and tune the learning rate carefully.
If we move the sum outside the expectation, a very simple algorithm presents itself. We will take this sum of expectations and optimize one term at a time. That is, we pick a random t and optimize only the term corresponding to t.
In that case, the expectation over the whole sequence ẑ reduces to the marginal distribution on zt-1 and zt. These we can sample easily to estimate the expectation.
Finally, we just compute the squared Euclidean distance between zt-1 and the model output and backpropagate that.
This is, of course, what we were doing already. You may ask what the point of all this is, if we just end up where we started, when the original idea was already obvious. One benefit of this derivation, is that it tells us how to adapt the model in a rigorous way. We can, for instance, decide not to ignore the variance of the distribution that the model outputs. Or, we can make that variance an output of the model instead of a hyperparameter. In such cases, all we have to do is follow the derivation and see how the loss function changes as a result.
The version of our algorithm in the last slide was used in early diffusion models. It worked, but it suffered from high variance: that is, the loss differed a lot from one sample to the next. This isn’t surprising: we do a lot of sampling, and everything we sample to estimate some expectation that we want to compute, increases the variance. There is one place where we can replace a sample with a more direct computation.
In the algorithm we have now, we sample zt-1 and then we maximize the probability density of zt-1 under p (given zt). What if, instead, we could work out the probability distribution that q puts on zt-1 given zt and just tell our model to get as close to that distribution as possible? Comparing two probability distributions to each other should lead to less variance than comparing one probability distribution to a sample from another distribution.
The problem is that q(zt-1|zt) is the opposite direction of how we can easily compute q. We can try to flip the conditional around with Bayes’ rule, but it turns out this doesn’t lead to a tractable distribution. To see why, imagine that the data can come from two points x and x’ as shown in the image. What we want to know is the distribution on zt-1 given zt. If we started with x, then this zt-1 is very likely, since the diffusion process mostly follows a straight line towards the origin. However, if we start with x’, then all the probability mass is on the line between x’ and the origin, and this zt-1 is very unlikely.
The takeaway is that the distribution q(zt-1|zt) is highly dependent on the shape of the distribution on x. This distribution is first of all highly complex, and second of all unknown (it’s what we’re trying to learn).
However, what we also see is that if we condition on which x we started with, the resulting distribution q(zt-1|zt, x) is actually quite simple. In fact, with a little analysis, it turns out that this distribution is a Gaussian.
The derivation itself is a little complex so we’ll skip it here, but these are the parameters of the resulting Gaussian.
Note that the mean looks like a random value itself, but that isn’t what’s happening here. What we have done is to remove x from the formula by remembering the specific gaussian noise st that we sampled when we generated zt. So, what we are looking at is just a function of know values. In other words, this function doesn’t work for general zt and x, but it we generated zt from x (as we do in sampling from our diffusion process), then we can use this formula if we remember the specific Gaussian noise we used.
So now we have a distribution to compare out model’s output distribution to, not just a single point. But how do we work this into the loss? We wanted to work from first principles, so we can’t just start comparing these two distributions arbitrarily. We will have to go back to our derivation, and take a slightly different route.
Here is the derivation.
1: In the first line, we have already followed some steps from the original derivation, including marginalizing in ẑ, working in q, and turning the integral into an expectation. Note that we have not applied Jensen’s inequality yet (the logarithm is still outside the expectation).
2: We note that q(zt|zt-1) is equal to q(zt|zt-1, x). This is because zt is independent of x when contioned on zt-1.Put simply, if we already know zt-1 then also knowing x doesn’t give us extra information. So adding x to the conditional doesn’t change anything. It will however change things when we use Bayes’ rule to flip the conditional around.
3: We separate the first factor from our product. This will help us later on.
4: Next, we flip the conditional in the denominator of our product using Bayes’ rule. We flip around zt and zt-1, and keep x in the conditional for all factors.
5: Note that this extra factor that we got from Bayes’ rule “telescopes” over the factors. That is, every z that appears in the numerator as zt-1 will appear in the denominator as zt in the next factor. All of these cancel out, against each other, except the first and the last, which we take out of the sum.
6: We cancel the the two q(z1|x)’s
7: We apply Jensen’s and separate out three terms for our loss function as shown.
8: We now recognize that almost all terms are KL divergences. That is, they naturally express the similarity between two Gaussian distributions. Only the first term is a log-probability of x given the final image in our denoising process z1.
Looking at this loss function, let’s see what we have. The middle term doesn’t have any parameters, so we can ignore it. The first term is an odd duck. What Ho et al do is to train a separate model for this part, which tells us how to generate x given z1. Note that z1 is almost completely denoised, so this model has an easy job.
For the final term (or rather T-2 terms), we can fill in what we worked out earlier about the KL divergence for two spherical Gaussians: as a function of the means the KL divergence is just the (scaled) squared euclidean distance plus some constants that don’t affect our gradients.
Ho et al find that they get better results by ignoring the scaling factor, but other authors prefer to include it. We’ll ignore it for the sake of simplicity and just minimize the squared Euclidean distance as we did already for the naive diffusion.
What has changed from the previous algorithm? There we also took the output of the model and computed a squared distance. The difference is in what we compute the distance to. Previously, that was the sample zt-1, Now, we compute the distance to the mean of the distribution on zt-1. By not sampling zt-1 but looking instead at the mean of its distribution, we are reducing the variance of our loss.
We are almost finished, but Ho et al have one last trick up their sleeve. By rewriting a little bit further we can have the model predict the noise that has been added to our image instead of the partially denoised image.
First we fill in our formula for computing the mean on zt-1.
Next, we change our model output from a prediction of this mean, to a prediction of the noise st used to generate this mean. From that, we compute the mean by the same formula as we used in the other term.
The factor in front of the noises is the same on both sides, so we can take it out of the loss, and absorb it into the learning rate.
With that, here is our second Gaussian diffusion algorithm, which is exactly equal to the one used by Ho et al.
During training, we pick a random point t and sample zt using some random standard noise st. We then have the model predict st.
During sampling, we sample zt-1 from the distribution q(zt-1 | zt, x) that we worked out earlier. You might think that in this case it’s a problem that we don’t have x, but that’s where out model comes in. We changed the functional dependence on x by rewriting the mean as a function of st and this is a value that our model can predict.
The result is a kind of combination of the two algorithms we saw in the first part.There we had two options: we either predict zt-1 from zt directly, or we predict x from zt-1, and we denoise and renoise during sampling. Here, we do neither. We instead predict the noise vector st that describes zt as a function of x. From this we can compute either our best bet for the fully denoised x, or for zt-1. In short the first and second algorithms from the first part of the lecture are the same algorithm, but only if we use Gaussian noise, and we predict st.
That should give you a basic idea of how Gaussian diffusion works. There are many variations on this idea and all of them require you to follow this derivation and to change some thing here or there, or to rewrite something in a slightly different way. Understanding all these models requires understanding this derivation in detail.
Some important things you may want to look into if you want to continue with diffusion models are:
Ways to make the sampling more efficient. Ho et al required 1000 steps during sampling, which means 1000 forward passes to generate one batch of images. This puts diffusion models at a substantial disadvantage to VAEs and GANs which can generate images with a single forward pass. Several methods exist which allow us to reduce the number of sampling steps required (possibly at the cost of a slight drop in quality).
If your sampling algorithm doesn’t require you to visit all time steps in the diffusion process, then you can make those steps as small as possible. We saw this idea in part 1 in the denoising/renoising algorithm. In the limit of infinitely small time steps, we get a continuous value t. Theoretically, the models starts to look like a differential equation, but practically, not much changes.
Finally, the most impressive uses of diffusion models are those that condition the generation on something, in particular on a text description of what the image should contain. Practically, this is as simple as feeding the text description to an LLM to get an embedding and then to add that embedding to the inputs of the model. However, it turns out that it’s good to have a way to control how much the model tries to make the image fit the data distribution, and how much it tries to make it fit the description. This is usually done with a method called classifier-free guidance.