Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

What is a Neural Network?

A single hidden layer neural network computes:

fn(x)=k=1nakσ(wkx+bk)f_n(x) = \sum_{k=1}^{n} a_k \sigma(w_k \cdot x + b_k)

where:

Common Activation Functions

NameFormulaProperties
Sigmoidσ(t)=11+et\sigma(t) = \frac{1}{1 + e^{-t}}Smooth, bounded
Tanhσ(t)=tanh(t)\sigma(t) = \tanh(t)Smooth, bounded, zero-centered
ReLUσ(t)=max(0,t)\sigma(t) = \max(0, t)Simple, unbounded
Softplusσ(t)=log(1+et)\sigma(t) = \log(1 + e^t)Smooth approximation to ReLU

The Theorem

In words: Neural networks are dense in C(K)C(K)—just like polynomials (Weierstrass) and trigonometric functions.

Connection to the Weierstrass Theorem

The Universal Approximation Theorem is a modern descendant of one of the most beautiful results in classical analysis.

The structural parallel is striking:

Weierstrass (1885)Universal Approximation (1989)
Polynomials akxk\sum a_k x^kNeural networks akσ(wkx+bk)\sum a_k \sigma(w_k x + b_k)
Monomials {1,x,x2,}\{1, x, x^2, \ldots\} as basisRidge functions {σ(wx+b)}\{\sigma(w \cdot x + b)\} as basis
Dense in C([a,b])C([a,b])Dense in C(K)C(K)
Existence only—no degree boundExistence only—no width bound

Both theorems answer the same fundamental question: Can this family of functions approximate anything? Both say “yes” without telling us how many terms we need.

The Crucial Caveat: Existence vs. Convergence of a Fixed Basis

As Trefethen emphasizes, there’s an important subtlety here. The Weierstrass theorem says: for each ff and ϵ\epsilon, there exists some polynomial that works. But this polynomial depends on ff—it’s not saying that expanding ff in a fixed basis (like Chebyshev or Fourier) will converge.

For a fixed orthogonal basis to converge, we need more regularity:

BasisConvergence requirement
Fourier seriesBounded variation for pointwise convergence; smoothness for rates
Chebyshev expansionBounded variation; Lipschitz or better for good rates
Legendre expansionSimilar regularity requirements

The Weierstrass theorem allows us to tailor the polynomial to the function. This is analogous to how the Universal Approximation Theorem lets us choose the weights wk,bk,akw_k, b_k, a_k to match the target function.

Key difference in high dimensions: In dd dimensions, polynomials of degree nn have (n+dd)\binom{n+d}{d} terms—exponential in dd. Neural networks with the right activation can sometimes escape this curse for functions with appropriate spectral structure, which is why the Universal Approximation Theorem opened a new chapter in approximation theory.

Proof Sketch

The key insight: ridge functions σ(wx+b)\sigma(w \cdot x + b) form a “rich enough” set.

For Sigmoid Activation

  1. As the weight magnitude w\|w\| increases, σ(wx+b)\sigma(w \cdot x + b) approaches a step function

  2. Step functions can approximate indicator functions of half-spaces

  3. Indicator functions of half-spaces can approximate any continuous function (by chopping the domain into pieces)

For General Activations

The proof uses:

What the Theorem Does NOT Say

This is analogous to:

The Curse of Dimensionality

For polynomials of degree nn in dd dimensions, we have (n+dd)\binom{n+d}{d} terms.

For d=10d = 10 and n=10n = 10: approximately 106 terms!

The universal approximation theorem doesn’t save us from this. Generic continuous functions still require exponentially many neurons as dimension grows.

The key question: Are there function classes where neural networks beat this exponential barrier?

Answer: Yes! These are the Barron spaces.

Extensions

ReLU Networks

The original theorem required bounded activation. Extensions show:

Deep Networks

The universal approximation theorem applies to single hidden layer networks. What about deeper networks?

Depth helps expressivity: Some functions can be represented exactly with O(logn)O(\log n) neurons in a deep network, but require exponentially many neurons in a shallow network.

Example: The function f(x)=x2Lf(x) = x^{2^L} can be computed by:

Comparison with Classical Approximation

AspectPolynomialsFourierNeural Networks
Universal in C(K)C(K)?YesYes (periodic)Yes
Constructive?Yes (Lagrange, etc.)Yes (FFT)Sort of (training)
Convergence rate?KnownKnownDepends on function class
High-dd?CurseCurseCan escape for Barron

Summary

The universal approximation theorem tells us:

The theorem is like knowing “there exists a polynomial approximation”—useful, but we need quantitative bounds to do practical computation.