Variational Autoencoder: Intuition and Implementation
There are two generative models facing neck to neck in the data generation business right now: Generative Adversarial Nets (GAN) and Variational Autoencoder (VAE). These two models have different take on how the models are trained. GAN is rooted in game theory, its objective is to find the Nash Equilibrium between discriminator net and generator net. On the other hand, VAE is rooted in bayesian inference, i.e. it wants to model the underlying probability distribution of data so that it could sample new data from that distribution.
In this post, we will look at the intuition of VAE model and its implementation in Keras.
VAE: Formulation and Intuition
Suppose we want to generate a data. Good way to do it is first to decide what kind of data we want to generate, then actually generate the data. For example, say, we want to generate an animal. First, we imagine the animal: it must have four legs, and it must be able to swim. Having those criteria, we could then actually generate the animal by sampling from the animal kingdom. Lo and behold, we get Platypus!
From the story above, our imagination is analogous to latent variable. It is often useful to decide the latent variable first in generative models, as latent variable could describe our data. Without latent variable, it is as if we just generate data blindly. And this is the difference between GAN and VAE: VAE uses latent variable, hence it’s an expressive model.
Alright, that fable is great and all, but how do we model that? Well, let’s talk about probability distribution.
Let’s define some notions:
- XX: data that we want to model a.k.a the animal
- zz: latent variable a.k.a our imagination
- P(X)P(X): probability distribution of the data, i.e. that animal kingdom
- P(z)P(z): probability distribution of latent variable, i.e. our brain, the source of our imagination
- P(X|z)P(X|z): distribution of generating data given latent variable, e.g. turning imagination into real animal
Our objective here is to model the data, hence we want to find P(X)P(X). Using the law of probability, we could find it in relation with zz as follows:
P(X)=∫P(X|z)P(z)dzP(X)=∫P(X|z)P(z)dz
that is, we marginalize out zz from the joint probability distribution P(X,z)P(X,z).
Now if only we know P(X,z)P(X,z), or equivalently, P(X|z)P(X|z) and P(z)P(z)…
The idea of VAE is to infer P(z)P(z) using P(z|X)P(z|X). This is make a lot of sense if we think about it: we want to make our latent variable likely under our data. Talking in term of our fable example, we want to limit our imagination only on animal kingdom domain, so we shouldn’t imagine about things like root, leaf, tyre, glass, GPU, refrigerator, doormat, … as it’s unlikely that those things have anything to do with things that come from the animal kingdom. Right?
But the problem is, we have to infer that distribution P(z|X)P(z|X), as we don’t know it yet. In VAE, as it name suggests, we infer P(z|X)P(z|X) using a method called Variational Inference (VI). VI is one of the popular choice of method in bayesian inference, the other one being MCMC method. The main idea of VI is to pose the inference by approach it as an optimization problem. How? By modeling the true distribution P(z|X)P(z|X) using simpler distribution that is easy to evaluate, e.g. Gaussian, and minimize the difference between those two distribution using KL divergence metric, which tells us how difference it is PP and QQ.
Alright, now let’s say we want to infer P(z|X)P(z|X) using Q(z|X)Q(z|X). The KL divergence then formulated as follows:
DKL[Q(z|X)∥P(z|X)]=∑zQ(z|X)logQ(z|X)P(z|X)=E[logQ(z|X)P(z|X)]=E[logQ(z|X)?logP(z|X)]DKL[Q(z|X)‖P(z|X)]=∑zQ(z|X)log?Q(z|X)P(z|X)=E[log?Q(z|X)P(z|X)]=E[log?Q(z|X)?log?P(z|X)]Recall the notations above, there are two things that we haven’t use, namely P(X)P(X), P(X|z)P(X|z), and P(z)P(z). But, with Bayes’ rule, we could make it appear in the equation:
DKL[