Moment generating function--an example with Gaussian distribution

I came across situations where I had to find the distribution of certain variables based on those of others. The approach of derivation through the probability density function (PDF) did not work, but moment generating functions come to the rescue. Here, I present such a situation and show how to apply moment generating functions to draw conclusion on about the distribution of a random variable.

Let’s examine the following two statements:

In general, my first instinct in proving statements about the equivalence between two distributions is to derive the PDF or CDF of the random variable in question. However, I had a hard time proving them through that route this time. Instead, we have to take an alternative approach: Method of moments. Let’s dive in.

The moment of a random variable

If we have a random variable $X$, the $n$-th moment of $X$ is defined as $E[X^n]$. The first moment is the mean, and the second moment is the variance, those are the ones I care to remember, beyond that, we can call them $n$-th moment. In proving that two distributions are equivalent ($Beta(1, N-1)$ and $Exponential(\frac{1}{N})$ in the first opening statement), or that a particular random variable is from a particular distribution $Y\sim Normal(0, \Sigma_{W,\sigma_{\epsilon}, \sigma_{x}})$ in the second opening statement, we can instead show that their moments are equal. If $E[X^n]=E[Y^n]$ for all $n$, then $X$ and $Y$ are identically distributed, or asympotically equally distributed.

Moment generating function

The moment generating function (MGF) of a random variable $X$ is defined as $M_X(t)=E[e^{tX}]$. Here is why MGF takes this particular definition:

Example of using MGF for a sensible proof

Let’s prove the second opening statement. Here it is in bullet points:

The question is: What is the marginal distribution of $\mathbf{Y}$? We can try: $\mathbf{y_i} = \int_{\mathbf{x_i}} \mathbf{x_i}\cdot \mathbf{W}+ \mathbf{\epsilon_i} d{\mathbf{x_i}}$. I was not that good at this maneuver, so instead, we try:

\[\begin{aligned} M_{\mathbf{y}}(\mathbf{t}) &= E_{\mathbf{x}, \mathbf{\epsilon}}[exp(\mathbf{t}^T\mathbf{y})] \\ &= E_{\mathbf{x}, \mathbf{\epsilon}}[exp(\mathbf{t}^T\mathbf{x}\mathbf{W} + \mathbf{t}^T\mathbf{\epsilon})]\\ &= E_{\mathbf{x}}[exp(\underbrace{\mathbf{t}^T}_{1\cdot d}\underbrace{(\mathbf{x}\mathbf{W})}_{(1k),(kd)})] \underbrace{E_{\mathbf{\epsilon}}[exp(\mathbf{t}^T\mathbf{\epsilon})]}_{M_\mathbf{\epsilon}(\mathbf{t}), \text{def.}}\\ &= \underbrace{E_{\mathbf{x}}[exp(\mathbf{x}\mathbf{W}\mathbf{t})]}_{M_\mathbf{x}(\mathbf{W}\mathbf{t})} M_\mathbf{\epsilon}(\mathbf{t})\\ &\overset{\mathbf{x}\sim N, \mathbf{\epsilon}\sim N}{=} exp(\mathbf{t}^T\mathbf{W}^T\mathbf{0} + \frac{1}{2}\mathbf{t}^T\mathbf{W}^T \sigma_x^2\mathbf{I}_k \mathbf{W}\mathbf{t}) \cdot exp(\mathbf{t}^T\mathbf{0} + \frac{1}{2}\mathbf{t}^T \sigma_{\epsilon}^2\mathbf{I}_d \mathbf{t})\\ &= exp(\mathbf{t}^T\mathbf{0}+ \frac{1}{2}\mathbf{t}^T(\mathbf{W}^T \sigma_x^2\mathbf{I}_k \mathbf{W} + \sigma_{\epsilon}^2\mathbf{I}_d) \mathbf{t})\\ \end{aligned}\]

Because of the final form of the MGF of $\mathbf{y}$, we can conclude that $\mathbf{y} \sim N(\mathbf{0} , \mathbf{W}^T \sigma_x^2\mathbf{I}_k \mathbf{W}) $.

Extending marginal distribution to inference in PPCA

As mentioned above, our assumption $\mathbf{y=x\cdot W + \epsilon}$ is the generative model for PPCA. In PPCA, we care about infering values of $\mathbf{x}$ and $\mathbf{W}$ given $\mathbf{y}$ as observed data. In theory, based on the above proof, we write the data likelihood $P(\mathbf{y})$ as a function of model parameters $\mathbf{W}, \sigma_x^2, \sigma_{\epsilon}^2$. We can, in theory, take the derivative of the data likelihood with respect to parameters to infer the values of parameters, i.e., setting $\frac{\partial P(\mathbf{Y})}{\partial \mathbf{W}}=\mathbf{0}$ and solve for $\mathbf{W}$. In (Lawrence, 2005), the introduction mentions that ‘PPCA and other latent variables, such as factor analysis or independent component analysis, requires a marginalization of the latent variables and optimization of the parameters’. This turns out to be a lot more challenging, I dont think we can actually find a closed form solution for $\mathbf{W}$ this way (I therefore found what is stated in (Lawrence, 2005) misleading). Instead, we use the Expectation-Maximization (EM) algorithm to iteratively infer the values of $\mathbf{X}$ and $\mathbf{W}, \sigma_x^2, \sigma_{\epsilon}^2$ that maximize the likelihood of the observed data. I know of (Chiu, 2020) that shows an explanation and implementation of the EM algorithm for PPCA, specfically applied to genetics data in which the input matrix only contains ${0,1,2}$.

  1. Lawrence, N. D. (2005). Gaussian process latent variable models for visualisation of high dimensional data.
  2. Chiu, et al. (2020). Scalable probabilistic PCA for large-scale genetic variation data. Plos Genetics.