What is a Neural Network?¶
A single hidden layer neural network computes:
where:
is the input
are weight vectors
are biases
are output weights
is an activation function
is the number of neurons (width)
Common Activation Functions¶
| Name | Formula | Properties |
|---|---|---|
| Sigmoid | Smooth, bounded | |
| Tanh | Smooth, bounded, zero-centered | |
| ReLU | Simple, unbounded | |
| Softplus | Smooth approximation to ReLU |
The Theorem¶
In words: Neural networks are dense in —just like polynomials (Weierstrass) and trigonometric functions.
Connection to the Weierstrass Theorem¶
The Universal Approximation Theorem is a modern descendant of one of the most beautiful results in classical analysis.
The structural parallel is striking:
| Weierstrass (1885) | Universal Approximation (1989) |
|---|---|
| Polynomials | Neural networks |
| Monomials as basis | Ridge functions as basis |
| Dense in | Dense in |
| Existence only—no degree bound | Existence only—no width bound |
Both theorems answer the same fundamental question: Can this family of functions approximate anything? Both say “yes” without telling us how many terms we need.
The Crucial Caveat: Existence vs. Convergence of a Fixed Basis¶
As Trefethen emphasizes, there’s an important subtlety here. The Weierstrass theorem says: for each and , there exists some polynomial that works. But this polynomial depends on —it’s not saying that expanding in a fixed basis (like Chebyshev or Fourier) will converge.
For a fixed orthogonal basis to converge, we need more regularity:
| Basis | Convergence requirement |
|---|---|
| Fourier series | Bounded variation for pointwise convergence; smoothness for rates |
| Chebyshev expansion | Bounded variation; Lipschitz or better for good rates |
| Legendre expansion | Similar regularity requirements |
The Weierstrass theorem allows us to tailor the polynomial to the function. This is analogous to how the Universal Approximation Theorem lets us choose the weights to match the target function.
Key difference in high dimensions: In dimensions, polynomials of degree have terms—exponential in . Neural networks with the right activation can sometimes escape this curse for functions with appropriate spectral structure, which is why the Universal Approximation Theorem opened a new chapter in approximation theory.
Proof Sketch¶
The key insight: ridge functions form a “rich enough” set.
For Sigmoid Activation¶
As the weight magnitude increases, approaches a step function
Step functions can approximate indicator functions of half-spaces
Indicator functions of half-spaces can approximate any continuous function (by chopping the domain into pieces)
For General Activations¶
The proof uses:
Hahn-Banach theorem (functional analysis)
The fact that is non-polynomial (or bounded and non-constant)
Density arguments in
What the Theorem Does NOT Say¶
This is analogous to:
Weierstrass says polynomials are dense, but doesn’t say which degree you need
Stone-Weierstrass is qualitative, not quantitative
The Curse of Dimensionality¶
For polynomials of degree in dimensions, we have terms.
For and : approximately 106 terms!
The universal approximation theorem doesn’t save us from this. Generic continuous functions still require exponentially many neurons as dimension grows.
The key question: Are there function classes where neural networks beat this exponential barrier?
Answer: Yes! These are the Barron spaces.
Extensions¶
ReLU Networks¶
The original theorem required bounded activation. Extensions show:
Deep Networks¶
The universal approximation theorem applies to single hidden layer networks. What about deeper networks?
Depth helps expressivity: Some functions can be represented exactly with neurons in a deep network, but require exponentially many neurons in a shallow network.
Example: The function can be computed by:
Shallow network: needs neurons
Deep network with layers: needs neurons
Comparison with Classical Approximation¶
| Aspect | Polynomials | Fourier | Neural Networks |
|---|---|---|---|
| Universal in ? | Yes | Yes (periodic) | Yes |
| Constructive? | Yes (Lagrange, etc.) | Yes (FFT) | Sort of (training) |
| Convergence rate? | Known | Known | Depends on function class |
| High-? | Curse | Curse | Can escape for Barron |
Summary¶
The universal approximation theorem tells us:
✅ Neural networks can approximate any continuous function
❌ It doesn’t say how or how efficiently
🔑 The real power comes from Barron’s theorem (next section)
The theorem is like knowing “there exists a polynomial approximation”—useful, but we need quantitative bounds to do practical computation.