notes on diffusion models

consider these resources i found through will’s book :

some questions i have before reading those diffusion papers :

what diffusion actually solves that transformers can’t? what do they actually predict? how do they generalize so well on images from just text prompts? how is this diffusion related to the physical diffusion process? what does it mean that they are inspired by non-equilibrium thermodynamics? why so much focus on markov chains, quasi-static tricks, Z, and conditional / posterior probabilities? will add more to this.

in this one we go into deep unsupervised learning using non-equilibrium thermodynamics. idk where to begin so let’s begin by initializing some params.

p(x) is the goal: probability distribution of real data (e.g. all possible cat images).
θ: parameters of the neural network.

overview: start with a clean image $x^0$. add noise in $T$ steps until it becomes $x^T$, basically a blurry mess.

the model does not jump directly from $x^T$ to $x^0$. instead it predicts the immediately preceding step $x^{t-1}$.

adding random noise does not destroy structure immediately. add 10% noise to a dog image and the dog-ness is still there. noise is patternless; data has structure.

the model learns to separate random noise from structured signal (eyes come in pairs, edges align, textures repeat).

the model takes $x^{(t)}$ and timestep $t$ and predicts a Gaussian distribution for $x^{(t-1)}$. since we constructed the forward process ourselves, we know the true $x^{(t-1)}$. we compare distributions using KL divergence.

thermodynamics intuition: in physics, equilibrium means max entropy. the forward process intentionally destroys structure until we reach a simple Gaussian.

diffusion models learn to run the process backward, pulling the system from high entropy back into structure. impossible in the real world, learnable with gradients.

markov chains: in the forward chain $q$, each state depends only on the previous one. the model learns a second markov chain $p_\theta$ that reverses each step.

forward never comes back; reverse explicitly learns how to undo each transition.

forward diffusion process: $q(x)$ is real data. we sample $x_0$ and apply a gaussian diffusion process.

each step adds small gaussian noise. if steps are small enough (quasi-static), the reverse process is guaranteed to be gaussian too.

but the true reverse distribution depends on $q(x_0)$, which is intractable. instead, the network predicts only mean and covariance.

why is it still complex if it’s gaussian? because mean and covariance depend on the input image. dog-like noise should point toward dogs, house-like noise toward houses.

learning that function is the hard part.

the network is a function of $x^{(t)}$ and $t$. early timesteps have low noise; later ones have high noise.

$$ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I}) $$

the full trajectory probability is just the product of each step.

intuition for $\beta_t$: step size of randomness. $1-\beta_t$ is signal preserved.

$q(x_0)$: complex multi-modal data distribution
$q(x_t)$: blurred but still structured distribution
$q(x_t \mid x_{t-1})$: simple gaussian rule

we tell the network: look at $x_t$ and learn the inverse of the gaussian step that created it.

each pixel is scaled by $\sqrt{1-\beta_t}$, noise is sampled per-pixel and scaled by $\sqrt{\beta_t}$.

tractable means solvable. quasi-static diffusion turns an impossible integral into predicting two numbers.

$$ \alpha_t = 1 - \beta_t \quad ; \quad \bar{\alpha}_t = \prod_{i=1}^t \alpha_i $$

$$ \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon} $$

usually:

$$ \beta_1 < \beta_2 < \dots < \beta_T \quad ; \quad \bar{\alpha}_1 > \dots > \bar{\alpha}_T $$

stochastic gradient langevin dynamics — need to read more 💀

reverse process starts from pure gaussian noise:

$$ p(x;\mu,\sigma^2)= \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$