Characteristic function-- an example use case

Why is it that for heavy-tailed distributions, the higher moments do not exist or do not converge? What is the characteristic function of a distribution? Why is it useful when we already have the PDF and MGF to define a distribution?

The starting point: A very practical question

The inspiration for this blog post is the paper Tam and Engelhardt, 2025 (Tam, 2025), from my postdoc lab under Dr. Barbara Engelhardt. The paper tries to address a highly relevant question in current machine learning research: Given the rise of all these foundation models that generate texts, images, etc., are there any theoretically sound ways to evaluate whether data generated by these models are similarly distributed to real data?. In more plain language: Are the texts/ images generated by GPT-/DallE-type models similar to the texts/ images that we have been generating for years? In more technical language: Are the distributions of the data generated by the foundation models similar to the distributions of the real data?

Now, if our data is simply numerical values derived from some well-defined distributions such as $Normal(\mu, \sigma$), $Beta(\alpha, \beta)$, etc., comparing the two distributions can be done through, most popularly, KL divergence, for which there are analytical formulas if we make assumptions about the distributions. However, when the data becomes high dimension, or even variable dimension as in the case of text data, KL divergence, or any other traditional measure of distance between distributions, becomes unreliable.

The paper itself introduces the concept of Embedded characteristic score to measure the discrepancies between two distributions of data (real vs. AI-generated).

The paper itself will explain the theoretical rationale behind their choice of how the embeded characteristic score should be defined (that it is a legitimate measure of distance–satisfying the triangle inequality, that its mean approximation converges to the formal definition of expectation, that it is bounded).

Figure 1. Definition of embedded characteristic score.

The paper contains statements for which I am not clear on, and so this blog post is dedicated to helping myself clear up those concepts. Below are the statements I want to understand better:

When the data distribution is heavy-tailed, the higher moments do not exist or do not converge

When a distribution is ‘heavy-tailed’, the probability that the variable $\mathbf{X}$ is far from the mean is higher than , say, that of a normal distribution. The $k-$moment, is defined as $E[\mathbf{X}^k]$, when $X$ has a higher chance to take extreme values, the $k-$moment may not exist because $P(\mathbf{X}^k = \inf) \neq 0$.

In practice, moments of a distribution are estimated through the data: $E[\mathbf{X}^k] \approx \frac{1}{n}\sum_{i=1}^{n}x_i^k$. The same procedure can be repeated mulitple times, obtaining multiple estimates of the $k-$moments of $\mathbf{X}$. When the data is heavy-tailed, we could imagine the sampled $x_i^k$ to take very extreme values, and the estimated $k-$moments may vary greatly across different rounds of sampling. Hence, we can state that the higher moments do not “converge”, i.e. the estimated $k-$moments are unreliable when $k$ is high, and $X$ follows a heavy-tailed distribution.

The values of characteristic functions of a distribution is always bounded and always exists

But first, moment generating functions

Before diving in the concept of chracteristic function, we can revisit the concept of moment generating functions (MGF) to understand why characteristic function is different and necessary. I previously wrote about MGF in a separate blog post, but will outline the key idea here for completeness:

As you can see, the moment generating function is very useful, but a limitation of MGF is that it does not always exist, as explained above. Tam and Engelhardt, 2025 (Tam, 2025) paper proposes that instead of comparing the moments of two distributions, we can use a related concept called the characteristic function.

Characteristic function definition

The characteristic function of a random variable $X$ is defined as $\phi_X(t) = E[e^{itX}]$, where $i$ is the imaginary unit. The characteristic function is a function of $t$. Similarly to MGF, the characteristic function is a way to define the distribution of $X$, i.e. if two distribution of variables $X$ and $Y$ have the same characteristic function, then they are identically distributed. This is proven in (Lukacs, 1970).

Why is the characteristic function always bounded?

Figure 2. Imaginary number plane.

Hence, $|e^{itX}| = \sqrt{\cos^2(tX) + \sin^2(tX)} = 1$. Therefore, we can always say that the value of $e^{itX}$ lies on the unit circle in the complex plane, and hence the values of the real or complex part of the characteristic function is bounded.

Concluding remarks

  1. Tam, E. (2025). A Distributional Evaluation of Generative Image Models.
  2. Lukacs, E. (1970). Characteristic Functions.