Chapter 2 Inequalities involving expectations

This chapter discusses and proves two inequalities that Wooldridge highlights - Jensen’s and Chebyshev’s. Both involve expectations (and the theorems derived in the previous chapter).

2.1 Jensen’s Inequality

Jensen’s Inequality is a statement about the relative size of the expectation of a function compared with the function over that expectation (with respect to some random variable). To understand the mechanics, I first define convex functions and then walkthrough the logic behind the inequality itself.

2.1.1 Convex functions

A function \(f\) is convex (in two dimensions) if all points on a straight line connecting any two points on the graph of \(f\) is above or on that graph. More formally, \(f\) is convex if for \(\forall x_1, x_2 \in \mathbb{R}\), and \(\forall t \in [0,1]\):

\[ f(tx_1, (1-t)x_2) \leq tf(x_1) + (1-t)f(x_2). \]

Here, \(t\) is a weighting parameter that allows us to range over the full interval between points \(x_1\) and \(x_2\).

Note also that concave functions are defined as the opposite of convex functions i.e. a function \(h\) is concave if and only if \(-h\) is convex.

2.1.2 The Inequality

Jensen’s Inequality (JI) states that, for a convex function \(g\) and random variable \(X\):

\[ \mathbb{E}[g(X)] \geq g(E[X]) \]

This inequality is exceptionally general – it holds for any convex function. Moreover, given that concave functions are defined as negative convex functions, it is easy to see that JI also implies that if \(h\) is a function, \(h(\mathbb{E}[X]) \geq \mathbb{E}[h(X)]\).2

Interestingly, note the similarity between this inequality and the definition of variance in terms of expectations:

\[ var(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2, \]

and since \(var(X)\) is always positive:

\[ \begin{aligned} \mathbb{E}[X^2] - (\mathbb{E}[X])^2 &\geq 0 \\ \mathbb{E}[X^2] &\geq (\mathbb{E}[X])^2). \end{aligned} \]

We can therefore define \(g(X) = X^2\) (a convex function), and see that variance itself is an instance of Jensen’s Inequality.

2.1.3 Proof

Assume \(g(X)\) is a convex function, and \(L(X)= a + bX\) is a linear function tangential to \(g(X)\) at point \(\mathbb{E}[X]\).

Hence, since \(g\) is convex and \(L\) is tangential to \(g\), we know by definition that:

\[\begin{equation} g(x) \geq L(x), \forall x \in X. \end{equation}\]

So, therefore:

\[\begin{align} \mathbb{E}[g(X)] &\geq \mathbb{E}[L(X)] \\ &\geq \mathbb{E}[A + bX] \label{eq:def_of_L} \\ &\geq a +b\mathbb{E}[X] \label{eq:loe}\\ &\geq L(\mathbb{E}[X]) \label{eq:def_of_L2}\\ &\geq g(\mathbb{E}[X]) \; \; \; \square \label{eq:tang} \end{align}\]

The majority of this proof is straightforward. If one function is always greater than or equal to another function, then the unconditional expectation of the first function must be at least as big as that of the second. The interior lines of the proof follow from the definition of \(L\), the linearity of expectations, and another application of the definition of \(L\) respectively.

The final line then follows because, by the definition of the straight line \(L\), we know that \(L[\mathbb{E}[X]]\) is tangential with \(g\) at \(\mathbb{E}[\mathbb{E}[X]] = \mathbb{E}[X] = g(\mathbb{E}[X])\).3

2.1.4 Application

In Chapter 2 of Agnostic Statistics (2019), the authors note (almost in passing) that the standard error of the mean is not unbiased, i.e. that \(\mathbb{E}[\hat{\sigma}] \neq \sigma\), even though it is consistent i.e. that \(\hat{\sigma}\xrightarrow{p} \sigma.\) The bias of the mean’s standard error is somewhat interesting (if not surprising), given how frequently we deploy the standard error (and, in a more general sense, highlights how important asymptotics are not just for the estimation of parameters, but also those parameters’ uncertainty). The proof of why \(\hat{\sigma}\) is biased also, conveniently for this chapter, uses Jensen’s Inequality.

The standard error of the mean is denoted as

\[\sigma = \sqrt{V(\bar{X})}\],

where \(V(\bar{X}) = \frac{V(X)}{n}\).

Our best estimate of this quantity \(\hat{\sigma} = \sqrt{\hat{V}(\bar{X})}\) is simply the square root of the sample variance estimator. We know that the variance estimator itself is unbiased and a consistent estimator of the sampling variance (see Agnostic Statistics Theorem 2.1.9).

The bias in the estimate of the sample mean’s standard error originates from the square root function Note that the square root is a strictly concave function. This means we can make two claims about the estimator. First, as with any concave function we can use the inverse version of Jensen’s Inequality, i.e. that \(\mathbb{E}[g(X)] \leq g(\mathbb{E}[X])\). Second, since the square root is a strictly concave function, we can use the weaker “less than or equal to” operator with the strict “less than” inequality. Hence, the proof is reasonably easy:

\[ \begin{aligned} \mathbb{E}\left[\hat{\sigma}\right] = \mathbb{E}\left[\sqrt{\hat{V}(\bar{X})}\right] &< \sqrt{\mathbb{E}[\hat{V}(\bar{X})]} \;\;\; \text{(by Jensen's Inequality)}\\ & < \sqrt{V(\bar{X})} \;\;\; \text{(since the sampling variance is unbiased)} \\ & < \sigma. \;\;\; \square \end{aligned} \]

The first line follows by first defining the conditional expectation of the sample mean’s standard error, and then applying the noted variant of Jensen’s inequality. Then, since we know that the standard error estimator of the variance is unbiased, we can replace the expectation with the true sampling variance, and note finally that the square root of the true sampling variance is, by definition, the true standard error of the sample mean. Hence, we see that our estimator of the sampling mean’s standard error is strictly less than the true value and therefore is biased.

2.2 Chebyshev’s Inequality

The other inequality Wooldridge highlights is the Chebyshev Inequality. This inequality states that for a set of probability distributions, no more than a specific proportion of that distribution is more than a set distance from the mean.

More formally, if \(\mu = \mathbb{E}[X]\) and \(\sigma^2 = var(X)\), then:

\[\begin{equation} %P(|X-\mu|\geq t) \leq \frac{\sigma^2}{t^2} \text{ and } P(|Z| \geq k) \leq \frac{1}{k^2}, \end{equation}\]

where \(Z = (X-\mu)/\sigma\) (Wasserman 2004, 64) and \(k\) indicates the number of standard deviations.

2.2.1 Proof

First, let us define the variance (\(\sigma^2\)) as:

\[\begin{equation} \sigma^2 = \mathbb{E}[(X-\mu)^2]. \end{equation}\]

By expectation theory, we know that we can express any unconditional expectation as the weighted sum of its conditional components i.e. \(\mathbb{E}[A] = \sum_i\mathbb{E}[A|c_i]P(c_i),\) where \(\sum_iP(c_i)=1\). Hence:

\[\begin{equation} ... = \mathbb{E}[(X-\mu)^2 | k\sigma \leq |X - \mu|]P(k\sigma \leq |X - \mu|) + \mathbb{E}[(X-\mu)^2 | k\sigma > |X - \mu|]P(k\sigma > |X - \mu|) \end{equation}\]

Since any probability is bounded between 0 and 1, and variance must be greater than or equal to zero, the second term must be non-negative. If we remove this term, therefore, the right-hand side is necessarily either the same size or smaller. Therefore we can alter the equality to the following inequality:

\[\begin{equation} \sigma^2 \geq \mathbb{E}[(X-\mu)^2 | k\sigma \leq X - \mu]P(k\sigma \leq |X - \mu|) \end{equation}\]

This then simplifies:

\[ \begin{aligned} \sigma^2 &\geq (k\sigma)^2P(k\sigma \leq |X - \mu|) \\ &\geq k^2\sigma^2P(k\sigma \leq |X - \mu|) \\ \frac{1}{k^2} &\geq P(|Z| \geq k) \; \; \; \square \end{aligned} \]

Conditional on \(k\sigma \leq |X-\mu|\), \((k\sigma)^2 \leq (X-\mu)^2\), and therefore \(\mathbb{E}[(k\sigma)^2] \leq \mathbb{E}[(X-\mu)^2]\). Then, the last step simply rearranges the terms within the probability function.4

2.2.2 Applications

Wasserman (2004) notes that this inequality is useful when we want to know the probable bounds of an unknown quantity, and where direct computation would be difficult. It can also be used to prove the Weak Law of Large Numbers (point 5 in Wooldridge’s list!), which I demonstrate here.

It is worth noting, however, that the inequality is really powerful – it guarantees that a certain amount of a probability distribution is within a certain region – irrespective of the shape of that distribution (so long as we can estimate the mean and variance)!

For some well-defined distributions, this theorem is weaker than what we know by dint of their form. For example, we know that for a normal distribution, approximately 95 percent of values lie within 2 standard deviations of the mean. Chebyshev’s Inequality only guarantees that 75 percent of values lie within two standard deviations of the mean (since \(P(|Z| \geq k) \leq \frac{1}{2^2}\)). Crucially, however, even if we didn’t know whether a given distribution was normal, so long as it is a well-behaved probability distribution (i.e. the unrestricted integral sums to 1) we can guarantee that 75 percent will lie within two standard deviations of the mean.


Aronow, P. M., and B. T. Miller. 2019. Foundations of Agnostic Statistics. Cambridge University Press.

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York, NY: Springer New York.

  1. Since \(-h(x)\) is convex, \(\mathbb{E}[-h(X)] \geq -h(\mathbb{E}[X])\) by JI. Hence, \(h(\mathbb{E}[X]) - \mathbb{E}[h(X)] \geq 0\) and so \(h(\mathbb{E}[X]) \geq \mathbb{E}[h(X)]\).↩︎

  2. Based on by Larry Wasserman.↩︎

  3. \(k\sigma \leq |X-\mu| \equiv k \leq |X-\mu|/\sigma \equiv |Z| \geq k,\) since \(\sigma\) is strictly non-negative.↩︎