Application: The Universal Approximation Theorem

We give a clean application of the Hahn-Banach theorem to neural networks: Cybenko’s proof Cybenko (1989) that single-hidden-layer networks are dense in $C(\Omega)$ .

Single-hidden-layer networks¶

Fix $\Omega = [0,1]^d$ and an activation function $\sigma : \mathbb{R} \to \mathbb{R}$ . A single-hidden-layer network with $M$ neurons is a function of the form

f(x) = \sum_{j=1}^{M} a_j \, \sigma(w_j \cdot x + b_j), \quad x \in \mathbb{R}^d

(1)

where $w_j \in \mathbb{R}^d$ are weights, $b_j \in \mathbb{R}$ are biases, and $a_j \in \mathbb{R}$ are output coefficients. The set of all such functions (for all $M$ ) forms a linear subspace of $C(\Omega)$ :

\mathcal{N}_\sigma = \operatorname{span}\{\sigma(w \cdot x + b) : w \in \mathbb{R}^d,\; b \in \mathbb{R}\} \subset C(\Omega).

(2)

Definition 1 (Universal approximator)

An activation function $\sigma$ is a universal approximator if $\mathcal{N}_\sigma$ is dense in $C(\Omega)$ , i.e. $\overline{\mathcal{N}_\sigma} = C(\Omega)$ .

The question: Which activations $\sigma$ are universal approximators?

Discriminatory activations¶

Definition 2 (Discriminatory function)

A continuous function $\sigma : \mathbb{R} \to \mathbb{R}$ is discriminatory if, for any finite signed Borel measure $\mu$ on $\Omega$ ,

\int_\Omega \sigma(w \cdot x + b) \, d\mu(x) = 0 \quad \text{for all } w \in \mathbb{R}^d,\; b \in \mathbb{R}

(3)

implies $\mu = 0$ .

In other words, no nonzero measure can be invisible to every neuron $\sigma(w \cdot x + b)$ .

Cybenko’s theorem¶

Theorem 1 (Universal Approximation (Cybenko, 1989))

Let $\sigma : \mathbb{R} \to \mathbb{R}$ be continuous and discriminatory. Then

\overline{\mathcal{N}_\sigma} = C(\Omega).

(4)

Proof 1

Which activations are discriminatory?¶

The definition asks that no nonzero measure hides from every neuron. The key observation is that steep rescalings of $\sigma$ approximate indicator functions of half-spaces.

Proposition 1 (Sigmoidal activations are discriminatory)

Any bounded measurable function $\sigma$ with

\sigma(t) \to \begin{cases} 1 & t \to +\infty \\ 0 & t \to -\infty \end{cases}

(7)

is discriminatory.

Proof 2

Proposition 2 (ReLU is discriminatory)

The ReLU activation $\sigma(t) = \max(0, t)$ is discriminatory.

Proof 3

Compare this with the Weierstrass theorem from our discussion of Density and Approximation: polynomials are dense in $C([a,b])$ , and Cybenko’s theorem says the same for neural networks in $C([0,1]^d)$ . Both are density results, but there is an important difference. Weierstrass approximation suffers from the curse of dimensionality: the number of monomials of degree $\leq n$ in $d$ variables grows as $\binom{n+d}{d}$ , which is exponential in $d$ . Neural networks avoid this; the approximation is built from neurons $\sigma(w \cdot x + b)$ that each cut across all $d$ dimensions simultaneously, so the number of terms needed can scale much more favourably with $d$ .

On the other hand, the proof is non-constructive (it rests on Hahn-Banach, which uses Zorn’s lemma), so it says nothing about how many neurons are needed for a given accuracy $\varepsilon$ or how to find the weights.

A second strategy for density proofs¶

In the Density and Approximation chapter, every density proof was constructive: we built an explicit approximating sequence (Bernstein polynomials for Weierstrass, mollifiers for $C_c^\infty \subset L^p$ ) and estimated the error directly. Cybenko’s proof is fundamentally different. Instead of constructing an approximation, it argues by contradiction via duality:

Assume $\overline{M} \subsetneq X$ .
Hahn-Banach produces a nonzero functional $F \in X^*$ vanishing on $M$ .
A representation theorem identifies $F$ concretely (here, as a measure via Riesz-Markov).
Problem-specific information (here, the discriminatory property) forces $F = 0$ , a contradiction.

This is a general-purpose machine: the same scheme proves density of test functions in $L^p$ , completeness of orthonormal systems, and many other approximation results. Only step 4 changes between applications. The price is that the proof is non-constructive and tells us nothing about rates of convergence, but when a direct construction is unavailable, the duality approach is often the only way in.

References¶

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. 10.1007/BF02551274