We give a clean application of the Hahn-Banach theorem to neural networks: Cybenko’s proof Cybenko (1989) that single-hidden-layer networks are dense in .
Single-hidden-layer networks¶
Fix and an activation function . A single-hidden-layer network with neurons is a function of the form
where are weights, are biases, and are output coefficients. The set of all such functions (for all ) forms a linear subspace of :
Definition 1 (Universal approximator)
An activation function is a universal approximator if is dense in , i.e. .
The question: Which activations are universal approximators?
Discriminatory activations¶
Definition 2 (Discriminatory function)
In other words, no nonzero measure can be invisible to every neuron .
Cybenko’s theorem¶
Theorem 1 (Universal Approximation (Cybenko, 1989))
Let be continuous and discriminatory. Then
Proof 1
Suppose for contradiction that .
Step 1 (Hahn-Banach). Since is a proper closed subspace, the classification of closures gives a nonzero with on .
Step 2 (Riesz-Markov). By the Riesz-Markov-Kakutani theorem, there is a unique nonzero signed measure on with
Step 3 (Contradiction). Since vanishes on , we have
Because is discriminatory, , contradicting .
Which activations are discriminatory?¶
The definition asks that no nonzero measure hides from every neuron. The key observation is that steep rescalings of approximate indicator functions of half-spaces.
Proposition 1 (Sigmoidal activations are discriminatory)
Proposition 2 (ReLU is discriminatory)
The ReLU activation is discriminatory.
Proof 3
For any and , note that
pointwise as . Since for all , the same dominated convergence argument gives for all half-spaces, hence .
Compare this with the Weierstrass theorem from our discussion of Density and Approximation: polynomials are dense in , and Cybenko’s theorem says the same for neural networks in . Both are density results, but there is an important difference. Weierstrass approximation suffers from the curse of dimensionality: the number of monomials of degree in variables grows as , which is exponential in . Neural networks avoid this; the approximation is built from neurons that each cut across all dimensions simultaneously, so the number of terms needed can scale much more favourably with .
On the other hand, the proof is non-constructive (it rests on Hahn-Banach, which uses Zorn’s lemma), so it says nothing about how many neurons are needed for a given accuracy or how to find the weights.
A second strategy for density proofs¶
In the Density and Approximation chapter, every density proof was constructive: we built an explicit approximating sequence (Bernstein polynomials for Weierstrass, mollifiers for ) and estimated the error directly. Cybenko’s proof is fundamentally different. Instead of constructing an approximation, it argues by contradiction via duality:
Assume .
Hahn-Banach produces a nonzero functional vanishing on .
A representation theorem identifies concretely (here, as a measure via Riesz-Markov).
Problem-specific information (here, the discriminatory property) forces , a contradiction.
This is a general-purpose machine: the same scheme proves density of test functions in , completeness of orthonormal systems, and many other approximation results. Only step 4 changes between applications. The price is that the proof is non-constructive and tells us nothing about rates of convergence, but when a direct construction is unavailable, the duality approach is often the only way in.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. 10.1007/BF02551274