When we generalize the idea behind the PCA as generative model, we call the projection and reconstruction procedures respectively encoding and decoding.

The encoder will be a function (neural net) that takes a data point $x$ in input and outputs a low-dimensional code $z \in R^{k}$ that is the intermediate representation of $x$ .

The decoder will take the code $z$ and return the $x$ that is the most similar to the one that generated $z$ .

An architecture like this, in which encoder and decoder put together generate an output $x$ that is the most similar to the input. (If this doesn’t happen, then it’s just an encoder-decoder architecture).

Screenshot 2023-04-19 at 4.52.19 PM.png

In order to enforce the Autoencoder to learn the encoder and decoder such that it will output a vector that is very similar to the input, we have to enforce it into the loss function:

reconstruction loss ℓ_{θ} = i \sum ∥ x_{i} - D_{Θ} (E_{Θ} (x_{i})) ∥

The type of norm depends on the data, if we have data that lives in the euclidean space, than we can use the $L_{2}$ norm.

Importance of the bottleneck

The bottleneck in the architecture, meaning the fact that $k < d$ is important in order to avoid trivial solutions.

If $k = d$ , then $z \in R^{d}$ and so we risk that both the encoder and decoder will learn the identity function (meaning that $z = x$ )
If $k > d$ , then we risk that the encoder and decoder will just create the $z$ as $x$ concatenated with as much $0$ needed in order to arrive to the $k$ dimensions.

Variational Autoencoders (VAE)

When dealing with simple autoencoders, we can have some regions of the space space in which the codes are immersed that have blanks, and so when we sample out of that blank and then decode the result, we won’t obtain an image that is natural. This because there aren’t enough points near that point.

Screenshot 2023-04-19 at 5.28.03 PM.png

The idea to resolve this problem is to enforce the latent space to be packed within a smaller area, and so to be dense, in order to avoid blanks.

Variational autoencoders are a type of autoencoders that allow to do that, in particular they allow to enforce a distribution in the latent space, meaning that it will produce codes that can be seen as samples from that probability distribution. In other words, if we sample the code from the latent space, then the sampling should respect the probability distribution that we enforced.

The distribution is chosen a priori, and most of the time it’ll be a Gaussian.

## Variational inference

In our scenario $x$ is a given data point and $z$ is a latent code.

In a variational autoencoder architecture, the encoder take in input a data point and returns the probability distribution of all possible latent codes.

p_{θ} (z ∣ x) = \frac{p _{θ} ( x ∣ z ) p _{θ} ( z )}{p _{θ} ( x )} = \frac{p _{θ} ( x , z )}{p _{θ} ( x )}

The problem here is that we cannot exactly compute that, because we cannot compute:

p_{θ} (x) = \int p_{θ} (x ∣ z) p_{θ} (z) d z

Since we need to integrate over the entire latent space, which we don’t have. So the problem is intractable.

What we can do is to compute an approximation:

q_{ϕ} (z ∣ x) \approx p_{θ} (z ∣ x)

where $q_{ϕ} (z ∣ x)$ is a neural net with weights $ϕ$ . In order to let the neural net learn the best parameters $ϕ^{*}$ we have to enforce the fact that we want the two distributions to be as close as possible in the loss function, by minimizing the $K L$ divergence.

ϕ^{*} = argmin_{ϕ, θ} K L (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z ∣ x)) after some calculations = ar g ϕ, θ max z \sum q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )} - lo g p_{θ} (x)

The $- lo g p_{θ} (x)$ contains the intractable term and so there still is a problem, but this is solvable noticing that, applying at theorem, the right part of the equation is always bigger than the left part, and so the maximisation of only the left part will approximate the maximisation of the whole thing.

The right part takes the name of Evidence variational Lower BOund or $E L B O_{ϕ, θ} (x)$ .

We are interested in finding:

ϕ, θ max E L B O_{ϕ, θ} = ϕ, θ max z \sum q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )} = ϕ, θ max z \sum q_{ϕ} (z ∣ x) lo g p_{θ} (x ∣ z) + z \sum q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( z )}{q _{ϕ} ( z ∣ x )} = ϕ, θ max likelihood of observing x given z E_{q_{ϕ} (z ∣ x)} lo g p_{θ} (x ∣ z) ensures q_{ϕ} (z ∣ x) \approx p_{θ} (z) - K L (q_{ϕ} (z ∣ x) ∥ p_{θ} (z))

The first part of this equation is simply the reconstruction loss, and we can also substitute it with the loss $ℓ_{_{ϕ}, θ}$ we’ve seen before.

The second term is the $K L$ divergenge, and it’s needed in order to enforce the encoder $q_{ϕ}$ to be the most similar to the true function $p_{θ}$ , this is the regularizer.

Since we want to maximise that, it’s the same as minimizing the negative, so the loss is just:

ℓ_{ϕ, θ} = - reconstruction loss E_{q_{ϕ} (z ∣ x)} lo g p_{θ} (x ∣ z) + regularizer K L (q_{ϕ} (z ∣ x) ∥ p_{θ} (z))

Remember that the probability distribution is chosen a priori, so for example we choose $p (z) = N_{0, I} (z)$ , and so we can forget about the parameters $θ$ , because the probability distribution has no free parameters.

The probabilistic encoder $q_{ϕ} (z ∣ x)$ also generates a Gaussian distribution (if we choose the Normal distribution) with some mean $μ$ and a standard variation $σ$ , different for each data point $x$ (of course the output changes also according to the neural net parameters $ϕ$ ).

The difference of a probabilistic encoder from a simple encoder is that it outputs the mean and the standard variation, from which we can sample each time a different $z$ , instead of outputting directly the $z$ .

The Gaussian is most of the time the best choice, also because using that the $K L$ term has a closed form.

👨🏽‍💻 Domiziano's Notes

Explorer

Autoencoders

Importance of the bottleneck

Variational Autoencoders (VAE)

Regularizing effect

Reparametrization trick

Graph View

Table of Contents

Backlinks