Moment generating function--an example with Gaussian distribution
I came across situations where I had to find the distribution of certain variables based on those of others. The approach of derivation through the probability density function (PDF) did not work, but moment generating functions come to the rescue. Here, I present such a situation and show how to apply moment generating functions to draw conclusion on about the distribution of a random variable.
Let’s examine the following two statements:
- $Beta(1, N-1)$ converges to $Exponential(\frac{1}{N})$ when $N\rightarrow \infty$.
- Let $Y_{n \cdot d}$ be a matrix of data that is generated such that $Y_{n \cdot d}=X_{n \cdot k} \cdot W_{k \cdot d} + \epsilon_{n \cdot d}$, and each row of $X_{n \cdot k}$, $\mathbf{x_i} \sim N(\mathbf{0}, \sigma_x^2I_{k})$ is drawn from $N(0,1)$, and each row of $\epsilon$, $\epsilon_i\sim N(0, \sigma_{\epsilon}^2I_{d})$. Then, $Y_{n \cdot d}$ is drawn from $N(0,)$.
In general, my first instinct in proving statements about the equivalence between two distributions is to derive the PDF or CDF of the random variable in question. However, I had a hard time proving them through that route this time. Instead, we have to take an alternative approach: Method of moments. Let’s dive in.
The moment of a random variable
If we have a random variable $X$, the $n$-th moment of $X$ is defined as $E[X^n]$. The first moment is the mean, and the second moment is the variance, those are the ones I care to remember, beyond that, we can call them $n$-th moment. In proving that two distributions are equivalent ($Beta(1, N-1)$ and $Exponential(\frac{1}{N})$ in the first opening statement), or that a particular random variable is from a particular distribution $Y\sim Normal(0, \Sigma_{W,\sigma_{\epsilon}, \sigma_{x}})$ in the second opening statement, we can instead show that their moments are equal. If $E[X^n]=E[Y^n]$ for all $n$, then $X$ and $Y$ are identically distributed, or asympotically equally distributed.
Moment generating function
The moment generating function (MGF) of a random variable $X$ is defined as $M_X(t)=E[e^{tX}]$. Here is why MGF takes this particular definition:
- Back in Calculus II (which, imho, is the most omnipresent form of math, backed here), we learned about the Taylor series to estimate the value of a function $f(x)$ for any value of $x$: $f(x)=f(0)+f’(0)x+\frac{f''(0)}{2!}x^2+\cdots$. At the same time, the beauty in $e^x$ function is that its derivative is itself, $e^x$, and $e^{0}=1$ (so when $f(x)=e^x$, $f’(0)= f''(0)= …= 1$). The Taylor series of $e^x$, therefore, is $e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}\cdots$. It’s beautiful!
- Replacing $x$ with $tX$, we get $e^{tX}=1+tX+\frac{(tX)^2}{2!}+\frac{(tX)^3}{3!}\cdots$. Taking the expectation of this with respect to the random variable $X$, we get $M_X(t)=E[e^{tX}]=1+tE[X]+\frac{t^2E[X^2]}{2!}+\frac{t^3E[X^3]}{3!}\cdots$. This is, again, the Taylor expansion of the definition of MGF for random variable $X$, $M_X(t)$.
- The moment generating function is aptly named because if we take the $n-$the derivative of $M_X(t)$ and plug in $t=0$ to the resulting function, we will get $E[X^n]$. For example, $M_X’(0)=E[X]$, $M_X''(0)=E[X^2]$, and so on.
- Not all distributions have a closed-form MGF. However, for most of the distributions that we usually do not avoid, MGF is conveniently written in its Wikipedia page. And if you are ever curious about how the MGF for your distribution of interest is derived but cannot derive it yourself, peek at the solutions manual.
Example of using MGF for a sensible proof
Let’s prove the second opening statement. Here it is in bullet points:
- $\mathbf{Y_{n \cdot d}}$: a matrix of observed data, each row represents a data point, each column corresponds to a feature.
- $n$ is the number of data points, $d$ is the number of features, and $k$ is the number of latent features.
- Generative model assumption: $\mathbf{Y_{n \cdot d}}=\mathbf{X_{n \cdot k}} \cdot \mathbf{W_{k \cdot d}} + \mathbf{\epsilon_{n \cdot d}}$, where $\mathbf{X_{n \cdot k}}$ is the latent feature matrix, $\mathbf{W_{k \cdot d}}$ is the weight matrix (another name is ‘loadings’), and $\epsilon_{n \cdot d}$ represents noise.
\(\begin{aligned} \begin{pmatrix} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \vdots \\ \mathbf{y}_n \end{pmatrix} = \begin{pmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_n \end{pmatrix} \cdot \mathbf{W} + \begin{pmatrix} \mathbf{\epsilon}_1 \\ \mathbf{\epsilon}_2 \\ \vdots \\ \mathbf{\epsilon}_n \end{pmatrix} \end{aligned}\) - Further, we assume that: $\mathbf{x_i} \sim N(\mathbf{0}, \sigma_x^2I_{k})$, and $\mathbf{\epsilon_i} \sim N(\mathbf{0}, \sigma_{\epsilon}^2I_{d})$.
- This, by the way, is the generative model for probabilistic Principal Component Analysis (PPCA), if we have space for another jargon.
The question is: What is the marginal distribution of $\mathbf{Y}$? We can try: $\mathbf{y_i} = \int_{\mathbf{x_i}} \mathbf{x_i}\cdot \mathbf{W}+ \mathbf{\epsilon_i} d{\mathbf{x_i}}$. I was not that good at this maneuver, so instead, we try:
- For the sake of notation simplicity, let’s work through one row of $\mathbf{Y}$ and $\mathbf{X}$ and $\mathbf{\epsilon}$, denoted as $y$ and $x$ $\mathbf{\epsilon}$, respectively, representing one of $N$ independent datapoints. We have $\mathbf{y=x\cdot W + \epsilon}$.
- Now, in a separate scenario for a random $\mathbf{x}$, if $\mathbf{x} \sim Normal(\mathbf{\mu}, \mathbf{\Sigma})$, then $M_{\mathbf{x}}(\mathbf{t}) = exp(\mathbf{t}^T\mathbf{\mu} + \frac{1}{2}\mathbf{t}^T \mathbf{\Sigma} \mathbf{t})$. This is the analytical solution to the MGF of a multivariate normal distribution for $\mathbf{x}$.
- So, when $\mathbf{y=x\cdot W + \epsilon}$, based on the definition of MGF:
Because of the final form of the MGF of $\mathbf{y}$, we can conclude that $\mathbf{y} \sim N(\mathbf{0} , \mathbf{W}^T \sigma_x^2\mathbf{I}_k \mathbf{W}) $.
Extending marginal distribution to inference in PPCA
As mentioned above, our assumption $\mathbf{y=x\cdot W + \epsilon}$ is the generative model for PPCA. In PPCA, we care about infering values of $\mathbf{x}$ and $\mathbf{W}$ given $\mathbf{y}$ as observed data. In theory, based on the above proof, we write the data likelihood $P(\mathbf{y})$ as a function of model parameters $\mathbf{W}, \sigma_x^2, \sigma_{\epsilon}^2$. We can, in theory, take the derivative of the data likelihood with respect to parameters to infer the values of parameters, i.e., setting $\frac{\partial P(\mathbf{Y})}{\partial \mathbf{W}}=\mathbf{0}$ and solve for $\mathbf{W}$. In (Lawrence, 2005), the introduction mentions that ‘PPCA and other latent variables, such as factor analysis or independent component analysis, requires a marginalization of the latent variables and optimization of the parameters’. This turns out to be a lot more challenging, I dont think we can actually find a closed form solution for $\mathbf{W}$ this way (I therefore found what is stated in (Lawrence, 2005) misleading). Instead, we use the Expectation-Maximization (EM) algorithm to iteratively infer the values of $\mathbf{X}$ and $\mathbf{W}, \sigma_x^2, \sigma_{\epsilon}^2$ that maximize the likelihood of the observed data. I know of (Chiu, 2020) that shows an explanation and implementation of the EM algorithm for PPCA, specfically applied to genetics data in which the input matrix only contains ${0,1,2}$.
- Lawrence, N. D. (2005). Gaussian process latent variable models for visualisation of high dimensional data.
- Chiu, et al. (2020). Scalable probabilistic PCA for large-scale genetic variation data. Plos Genetics.