Glosar

Selectați unul dintre cuvintele cheie din stânga ...

StatisticsEmpirical CDF Convergence

Timp de citit: ~10 min

Let's revisit the observation from the first section that the CDF of the empirical distribution of an independent list of observations from a distribution tends to be close to the CDF of the distribution itself. The Glivenko-Cantelli theorem is a mathematical formulation of the idea that these two functions are indeed close.

Theorem (Glivenko-Cantelli)
If F is the CDF of a distribution \nu and \widehat{F}_n is the CDF of the empirical distribution \widehat{\nu}_n of n observations from \nu, then F_n converges to F along the whole number line:

\begin{align*}\max_{x\in \mathbb{R}} |F(x) - \widehat{F}_n(x)| \to 0 \quad \text{as }n \to \infty,\end{align*}

The Dvoretzky-Kiefer-Wolfowitz inequality quantifies this result by providing a confidence band.

Theorem (Dvoretzky-Kiefer-Wolfowitz inequality)
If X_1, X_2, \ldots are independent random variables with common CDF F, then for all \epsilon > 0, we have

\begin{align*}\mathbb{P}\left(\max_{x}|F(x) - \widehat{F}_n(x)|\geq \epsilon\right) \leq 2 \operatorname{e}^{-2n\epsilon^2}.\end{align*}

In other words, the probability that the graph of \widehat{F}_n lies in the \epsilon-band around F (or vice versa) is at least 1 - 2 \operatorname{e}^{-2n\epsilon^2}.

Exercise
Use the code cell below to experiment with various values of n, and confirm that the graph usually falls inside the DKW confidence band.

Hint: after running the cell once, you can comment out the first plot command and run the cell repeatedly to get lots of empirical CDFs to show up on the same graph.

using Plots, Distributions, Roots
n = 50
ϵ = find_zero(ϵ -> 2exp(-2n*ϵ^2) - 0.1, 0.05)
xs = range(0, 8, length=100)
plot(xs, x-> 1-exp(-x) + ϵ, fillrange = x-> 1-exp(-x) - ϵ,
     label = "true CDF", legend = :bottomright,
     fillalpha = 0.2, linewidth = 0)
plot!(sort(rand(Exponential(1),n)), (1:n)/n,
      seriestype = :steppre, label = "empirical CDF")

As n increases, the confidence band . Yet the proportion of CDFs that lie in the band stays .

Exercise
Show that if \epsilon_n = \sqrt{\frac{1}{2n}\log(\frac{2}{\alpha})}, then with probability at least 1 - \alpha, we have |F(x) - F_n(x)| \leq \epsilon for all x \in \mathbb{R}.

Solution. If we substitute the given value of \epsilon, the right-hand side reduces to \epsilon. So the desired equation follows from the DKW inequality.

Bruno
Bruno Bruno