Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Application: The Universal Approximation Theorem

We give a clean application of the Hahn-Banach theorem to neural networks: Cybenko’s proof Cybenko (1989) that single-hidden-layer networks are dense in C(Ω)C(\Omega).

Single-hidden-layer networks

Fix Ω=[0,1]d\Omega = [0,1]^d and an activation function σ:RR\sigma : \mathbb{R} \to \mathbb{R}. A single-hidden-layer network with MM neurons is a function of the form

f(x)=j=1Majσ(wjx+bj),xRdf(x) = \sum_{j=1}^{M} a_j \, \sigma(w_j \cdot x + b_j), \quad x \in \mathbb{R}^d

where wjRdw_j \in \mathbb{R}^d are weights, bjRb_j \in \mathbb{R} are biases, and ajRa_j \in \mathbb{R} are output coefficients. The set of all such functions (for all MM) forms a linear subspace of C(Ω)C(\Omega):

Nσ=span{σ(wx+b):wRd,  bR}C(Ω).\mathcal{N}_\sigma = \operatorname{span}\{\sigma(w \cdot x + b) : w \in \mathbb{R}^d,\; b \in \mathbb{R}\} \subset C(\Omega).

Definition 1 (Universal approximator)

An activation function σ\sigma is a universal approximator if Nσ\mathcal{N}_\sigma is dense in C(Ω)C(\Omega), i.e. Nσ=C(Ω)\overline{\mathcal{N}_\sigma} = C(\Omega).

The question: Which activations σ\sigma are universal approximators?

Discriminatory activations

Definition 2 (Discriminatory function)

A continuous function σ:RR\sigma : \mathbb{R} \to \mathbb{R} is discriminatory if, for any finite signed Borel measure μ\mu on Ω\Omega,

Ωσ(wx+b)dμ(x)=0for all wRd,  bR\int_\Omega \sigma(w \cdot x + b) \, d\mu(x) = 0 \quad \text{for all } w \in \mathbb{R}^d,\; b \in \mathbb{R}

implies μ=0\mu = 0.

In other words, no nonzero measure can be invisible to every neuron σ(wx+b)\sigma(w \cdot x + b).

Cybenko’s theorem

Theorem 1 (Universal Approximation (Cybenko, 1989))

Let σ:RR\sigma : \mathbb{R} \to \mathbb{R} be continuous and discriminatory. Then

Nσ=C(Ω).\overline{\mathcal{N}_\sigma} = C(\Omega).

Proof 1

Suppose for contradiction that NσC(Ω)\overline{\mathcal{N}_\sigma} \neq C(\Omega).

Step 1 (Hahn-Banach). Since Nσ\overline{\mathcal{N}_\sigma} is a proper closed subspace, the classification of closures gives a nonzero FC(Ω)F \in C(\Omega)^* with F=0F = 0 on Nσ\mathcal{N}_\sigma.

Step 2 (Riesz-Markov). By the Riesz-Markov-Kakutani theorem, there is a unique nonzero signed measure μ\mu on Ω\Omega with

F(ϕ)=Ωϕdμfor all ϕC(Ω).F(\phi) = \int_\Omega \phi \, d\mu \quad \text{for all } \phi \in C(\Omega).

Step 3 (Contradiction). Since FF vanishes on Nσ\mathcal{N}_\sigma, we have

Ωσ(wx+b)dμ(x)=0for all w,  b.\int_\Omega \sigma(w \cdot x + b) \, d\mu(x) = 0 \quad \text{for all } w,\; b.

Because σ\sigma is discriminatory, μ=0\mu = 0, contradicting F0F \neq 0.

Which activations are discriminatory?

The definition asks that no nonzero measure hides from every neuron. The key observation is that steep rescalings of σ\sigma approximate indicator functions of half-spaces.

Proposition 1 (Sigmoidal activations are discriminatory)

Any bounded measurable function σ\sigma with

σ(t){1t+0t\sigma(t) \to \begin{cases} 1 & t \to +\infty \\ 0 & t \to -\infty \end{cases}

is discriminatory.

Proof 2

For λ>0\lambda > 0, the rescaled neuron σ(λ(wx+b))\sigma(\lambda(w \cdot x + b)) converges pointwise as λ\lambda \to \infty to the half-space indicator

1Hw,b(x),Hw,b={x:wx+b>0}.\mathbf{1}_{H_{w,b}}(x), \qquad H_{w,b} = \{x : w \cdot x + b > 0\}.

If σ(wx+b)dμ=0\int \sigma(w \cdot x + b) \, d\mu = 0 for all w,bw, b, then the same holds for σ(λ(wx+b))\sigma(\lambda(w \cdot x + b)) (replace ww by λw\lambda w and bb by λb\lambda b). Dominated convergence gives

μ(Hw,b)=1Hw,bdμ=0for all half-spaces Hw,b.\mu(H_{w,b}) = \int \mathbf{1}_{H_{w,b}} \, d\mu = 0 \quad \text{for all half-spaces } H_{w,b}.

Half-spaces generate the Borel σ\sigma-algebra on Ω\Omega, so μ=0\mu = 0.

Proposition 2 (ReLU is discriminatory)

The ReLU activation σ(t)=max(0,t)\sigma(t) = \max(0, t) is discriminatory.

Proof 3

For any w,bw, b and λ>0\lambda > 0, note that

1λ[ReLU(λ(wx+b))ReLU(λ(wx+b)1)]1Hw,b(x)\frac{1}{\lambda}\bigl[\operatorname{ReLU}(\lambda(w \cdot x + b)) - \operatorname{ReLU}(\lambda(w \cdot x + b) - 1)\bigr] \to \mathbf{1}_{H_{w,b}}(x)

pointwise as λ\lambda \to \infty. Since ReLU(wx+b)dμ=0\int \operatorname{ReLU}(w \cdot x + b) \, d\mu = 0 for all w,bw, b, the same dominated convergence argument gives μ(Hw,b)=0\mu(H_{w,b}) = 0 for all half-spaces, hence μ=0\mu = 0.

Compare this with the Weierstrass theorem from our discussion of Density and Approximation: polynomials are dense in C([a,b])C([a,b]), and Cybenko’s theorem says the same for neural networks in C([0,1]d)C([0,1]^d). Both are density results, but there is an important difference. Weierstrass approximation suffers from the curse of dimensionality: the number of monomials of degree n\leq n in dd variables grows as (n+dd)\binom{n+d}{d}, which is exponential in dd. Neural networks avoid this; the approximation is built from neurons σ(wx+b)\sigma(w \cdot x + b) that each cut across all dd dimensions simultaneously, so the number of terms needed can scale much more favourably with dd.

On the other hand, the proof is non-constructive (it rests on Hahn-Banach, which uses Zorn’s lemma), so it says nothing about how many neurons are needed for a given accuracy ε\varepsilonor how to find the weights.

A second strategy for density proofs

In the Density and Approximation chapter, every density proof was constructive: we built an explicit approximating sequence (Bernstein polynomials for Weierstrass, mollifiers for CcLpC_c^\infty \subset L^p) and estimated the error directly. Cybenko’s proof is fundamentally different. Instead of constructing an approximation, it argues by contradiction via duality:

  1. Assume MX\overline{M} \subsetneq X.

  2. Hahn-Banach produces a nonzero functional FXF \in X^* vanishing on MM.

  3. A representation theorem identifies FF concretely (here, as a measure via Riesz-Markov).

  4. Problem-specific information (here, the discriminatory property) forces F=0F = 0, a contradiction.

This is a general-purpose machine: the same scheme proves density of test functions in LpL^p, completeness of orthonormal systems, and many other approximation results. Only step 4 changes between applications. The price is that the proof is non-constructive and tells us nothing about rates of convergence, but when a direct construction is unavailable, the duality approach is often the only way in.

References
  1. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. 10.1007/BF02551274