Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Optional Section

This section and the next (§Barron’s Theorem) cover material beyond the core MATH 551 syllabus. They extend the deterministic 1D approximation theory to the high-dimensional setting, where neural networks are the practitioner’s tool. The content is included for interested students; nothing later in the course depends on it.

Big Idea

A one-hidden-layer neural network is a basis expansion in ridge functions σ(wx+b)\sigma(w \cdot x + b). Cybenko (1989) and Hornik (1991) showed this basis is dense in C(K)C(K), just like polynomials: every continuous target can be approximated to any tolerance, in any dimension. Density is the qualitative result. It says nothing about how many ridges are needed; that is a separate question, and the one where dimension actually bites. We answer it on the next two pages.

Why we care: functions on Rd\mathbb{R}^d in practice

The input xRdx \in \mathbb{R}^d is just a list of dd numbers, and that covers most of the real-world objects we want to compute with:

The function f:RdRf: \mathbb{R}^d \to \mathbb{R} (or Rm\mathbb{R}^m) we want is then “image \to class label”, “sentence \to translation”, “molecule \to binding energy”, “board state \to best move”, “model parameters \to option price”. None of these are available in closed form, and the input dimensions are firmly outside what tensor-product Chebyshev or any other deterministic basis can handle. The next page makes that obstacle quantitative (the curse of dimensionality) and shows under what hypothesis a neural network gets around it (Barron’s theorem). The current page builds the basis (ridge functions), shows it can represent any continuous target (universal approximation), and demonstrates a 1D failure mode that flips when dd is large.

A neural network as a basis expansion

A one-hidden-layer network is

fn(x)  =  k=1nakσ(wkx+bk),f_n(x) \;=\; \sum_{k=1}^{n} a_k\, \sigma(w_k \cdot x + b_k),

with xRdx \in \mathbb{R}^d, weights wkRdw_k \in \mathbb{R}^d, biases bkb_k, output coefficients aka_k, and a fixed non-linearity σ\sigma. Stacking the wkw_k^\top as rows of a matrix WRn×dW \in \mathbb{R}^{n \times d} gives the matrix form

z=Wx+b,fn(x)=a ⁣σ(z),z = W x + b, \qquad f_n(x) = a^{\!\top}\sigma(z),

with σ\sigma acting elementwise.

Each term σ(wkx+bk)\sigma(w_k \cdot x + b_k) is a ridge function: a 1D function σ\sigma depending on xx only through wkx+bkw_k \cdot x + b_k, extruded along the (d1)(d-1) directions perpendicular to wkw_k. A network is a sum of nn such ridges; wkw_k controls the direction, bkb_k shifts it, aka_k scales it.

Compare with a Chebyshev expansion fn(x)=ckTk(x)f_n(x) = \sum c_k T_k(x). Both are linear combinations of nn basis functions. The Chebyshev basis {Tk}\{T_k\} is fixed; the ridge basis {σ(wkx+bk)}\{\sigma(w_k \cdot x + b_k)\} is parameterised by (wk,bk)(w_k, b_k) that we can choose.

A 1D ridge network: evaluation, differentiation, integration

In one variable a ridge is just a shifted, scaled copy of σ\sigma: ϕk(x)=σ(wkx+bk)\phi_k(x) = \sigma(w_k x + b_k) with wk,bkRw_k, b_k \in \mathbb{R}. The basis is parameterised but each basis function is something we can draw on a single axis. Before doing any algebra, look at it.

What a ridge basis looks like

Three panels. The left shows three single ridges with different (w,b)(w, b). The slope at the inflection point is w/4w/4 for tanh\tanh, so w|w| controls how fast the ridge transitions from one saturation level to the other; the sign of ww flips left and right; bb shifts the inflection point to x=b/wx = -b/w. The middle panel overlays a few akσ(wkx+bk)a_k\,\sigma(w_k x + b_k), dashed unscaled and solid scaled, to show how aka_k rescales each individual ridge. The right panel adds them up and overlays a target.

Source
<Figure size 1100x320 with 3 Axes>

The takeaway: each ridge contributes one bend, localised to a region of width 1/wk\sim 1/|w_k| around x=bk/wkx = -b_k/w_k. Outside that region the ridge is essentially constant and adds nothing. Contrast with Chebyshev, where every TkT_k oscillates across the entire interval [1,1][-1, 1]. Ridges are local, polynomials are global. This single fact drives the difference in behaviour in high dimensions: a dd-variable ridge is a 1D bump pulled along d1d-1 flat directions, so its cost does not multiply with dd, while a tensor-product basis of global polynomials does.

Random features: the linear case

The simplest way to fix (wk,bk)(w_k, b_k) is to draw them randomly from a probability distribution, for instance wkN(0,σ2I)w_k \sim \mathcal{N}(0, \sigma^2 I) and bkU(c,c)b_k \sim \mathcal{U}(-c, c). Once drawn, the weights are frozen and only {ak}\{a_k\} varies; every operation becomes linear in the coefficient vector aa, and the ridge basis is a fixed basis like {Tk}\{T_k\}. This is the random-features view of a one-hidden-layer network: no training, no gradient descent, just a random draw followed by a single linear solve.

The random draw raises two questions we defer to the next page: which distribution to sample from (Fourier inversion picks out the right one) and what does random sampling buy quantitatively (the draw is a Monte Carlo sample, with the dimension-free V/n\sqrt V/\sqrt n rate). Both are answered in §Barron. For now any concrete distribution suffices to demonstrate the construction.

We come back to the trained case at the end, where the random draw is replaced by gradient descent on (w,b)(w, b).

Evaluation

fn(x)=a ⁣ϕ(x)f_n(x) = a^{\!\top} \phi(x) at a point. For a batch x1,,xNx_1, \ldots, x_N, this is fn=Vaf_n = V a with Vjk=ϕk(xj)V_{jk} = \phi_k(x_j). Cost O(Nn)O(Nn), the same shape as Chebyshev’s value-from-coefficient map.

Differentiation

ϕk(x)=wkσ(wkx+bk)\phi_k'(x) = w_k\, \sigma'(w_k x + b_k), so

fn(x)  =  k(wkak)σ(wkx+bk).f_n'(x) \;=\; \sum_k (w_k a_k)\, \sigma'(w_k x + b_k).

The coefficient map is aWaa \mapsto W a with W=diag(wk)W = \mathrm{diag}(w_k), expressed in the related basis σ\sigma'. Like Chebyshev: differentiation maps TkUk1T_k \to U_{k-1} to a different family, and a recurrence converts back. For ridges we evaluate in the σ\sigma'-basis directly.

Integration

If Σ=σ\Sigma' = \sigma then ϕk=Σ(wkx+bk)/wk\int \phi_k = \Sigma(w_k x + b_k) / w_k. For σ=tanh\sigma = \tanh, Σ=logcosh\Sigma = \log\cosh. The definite integral is

abfn(x)dx  =  m ⁣a,mk=Σ(wkb+bk)Σ(wka+bk)wk.\int_a^b f_n(x)\,dx \;=\; m^{\!\top} a, \qquad m_k = \frac{\Sigma(w_k b + b_k) - \Sigma(w_k a + b_k)}{w_k}.

A linear functional with precomputed moments, identical in shape to the Chebyshev integration construction.

Fitting

Given data {(xj,f(xj))}j=1N\{(x_j, f(x_j))\}_{j=1}^N and the frozen weights from above, the coefficients aa minimise the empirical loss

L^(a)  =  1NVaf22  =  1Nj=1N(f(xj)fn(xj))2,\hat L(a) \;=\; \frac{1}{N}\,\|V a - f\|_2^2 \;=\; \frac{1}{N}\sum_{j=1}^N \bigl(f(x_j) - f_n(x_j)\bigr)^2,

a linear least squares problem in aa. With Chebyshev nodes for {xj}\{x_j\} and the TT-basis the matrix VV is structured and a DCT solves it in O(NlogN)O(N \log N). With random ridges no such structure, but the standard QR-based least-squares solver from §Least Squares is still cheap, and there is no non-convex search.

The empirical loss is itself a Monte Carlo estimator of the population loss

L(a)  =  K(f(x)fn(x))2dμ(x),L(a) \;=\; \int_K \bigl(f(x) - f_n(x)\bigr)^2\, d\mu(x),

provided the data points {xj}\{x_j\} are drawn iid from μ\mu; L^\hat L is unbiased for LL and its standard deviation around LL decays as V/N\sqrt V/\sqrt N (see the Monte Carlo notebook for the standard rate). So the random-features network is doing two things at once: sampling weights, and estimating an integral (the loss) from samples. The first piece is what §Barron makes principled.

The trained case

If we let (wk,bk)(w_k, b_k) vary too, the loss

L^(θ)  =  1Nj=1N(f(xj)fn(xj;θ))2,θ=(a,w,b),\hat L(\theta) \;=\; \frac{1}{N}\sum_{j=1}^N \bigl(f(x_j) - f_n(x_j; \theta)\bigr)^2, \qquad \theta = (a, w, b),

is non-convex. To see why, recall the basis matrix VRN×nV \in \mathbb{R}^{N \times n} from the fitting section above: Vjk=σ(wkxj+bk)V_{jk} = \sigma(w_k \cdot x_j + b_k) is the kk-th basis function evaluated at the jj-th data point. With (w,b)(w, b) frozen, VV is a fixed matrix and L^=1NVaf22\hat L = \tfrac{1}{N}\|Va - f\|_2^2 is quadratic (hence convex) in aa. Once we let (w,b)(w, b) vary, VV depends nonlinearly on ww through σ(wkxj+bk)\sigma(w_k \cdot x_j + b_k), the loss loses its quadratic structure, and many local minima appear. There is no closed-form solver. We descend.

Stochastic gradient descent

Vanilla gradient descent updates θt+1=θtηL^(θt)\theta_{t+1} = \theta_t - \eta\,\nabla \hat L(\theta_t) with L^=(1/N)jj\nabla \hat L = (1/N)\sum_j \nabla \ell_j and j=(f(xj)fn(xj;θ))2\ell_j = (f(x_j) - f_n(x_j;\theta))^2. Each step costs O(N)O(N) work; for NN in the millions this is unaffordable.

Stochastic gradient descent (SGD) replaces the full sum with a random mini-batch Bt{1,,N}\mathcal{B}_t \subset \{1, \ldots, N\} at each step (sizes 32 to 128 are typical):

g^t  =  1BtjBtj(θt),θt+1  =  θtηg^t.\hat g_t \;=\; \frac{1}{|\mathcal{B}_t|}\sum_{j \in \mathcal{B}_t}\nabla \ell_j(\theta_t), \qquad \theta_{t+1} \;=\; \theta_t - \eta\, \hat g_t.

Per-step cost drops to O(B)O(|\mathcal{B}|). The mini-batch gradient g^t\hat g_t is a Monte Carlo estimator of the full-batch gradient: unbiased, with variance O(1/B)O(1/|\mathcal{B}|). So in expectation each step does what gradient descent would; in any individual step the direction is noisy. In practice the step size η\eta is adapted per parameter using moving averages of g^t\hat g_t and g^t2\hat g_t^{\,2}; this is Adam, the standard default.

The full training loop:

Algorithm 1 (SGD training of a one-hidden-layer ridge network)

Inputs: initial parameters θ0=(a0,w0,b0)\theta_0 = (a_0, w_0, b_0), learning rate η\eta, mini-batch size B|\mathcal{B}|, dataset {(xj,f(xj))}j=1N\{(x_j, f(x_j))\}_{j=1}^N.

For t=0,1,2,t = 0, 1, 2, \ldots:

  1. Sample mini-batch Bt{1,,N}\mathcal{B}_t \subset \{1, \ldots, N\} uniformly at random (without replacement within an epoch).

  2. Forward pass: evaluate fn(xj;θt)=at ⁣σ(Wtxj+bt)f_n(x_j; \theta_t) = a_t^{\!\top}\sigma(W_t x_j + b_t) for jBtj \in \mathcal{B}_t.

  3. Loss: L^Bt(θt)=Bt1jBt(f(xj)fn(xj;θt))2\hat L_{\mathcal{B}_t}(\theta_t) = |\mathcal{B}_t|^{-1}\sum_{j \in \mathcal{B}_t}\bigl(f(x_j) - f_n(x_j; \theta_t)\bigr)^2.

  4. Backward pass: g^t=θL^Bt(θt)\hat g_t = \nabla_\theta \hat L_{\mathcal{B}_t}(\theta_t) by chain rule.

  5. Step: θt+1=θtηg^t\theta_{t+1} = \theta_t - \eta\, \hat g_t.

Stop when validation loss plateaus.

The “backward pass” step is just calculus: differentiate the loss with respect to each parameter. For a one-hidden-layer ridge network everything is fully explicit, which makes the training loop entirely concrete. For deeper networks the same chain-rule pattern is what backpropagation automates.

Let δj=f(xj)fn(xj;θ)\delta_j = f(x_j) - f_n(x_j; \theta) be the residual at sample jj, and j=δj2\ell_j = \delta_j^2. Differentiating:

jak  =  2δjσ(wkxj+bk),\frac{\partial \ell_j}{\partial a_k} \;=\; -2\,\delta_j\, \sigma(w_k \cdot x_j + b_k),
jwk  =  2δjakσ(wkxj+bk)xj,\frac{\partial \ell_j}{\partial w_k} \;=\; -2\,\delta_j\, a_k\, \sigma'(w_k \cdot x_j + b_k)\, x_j,
jbk  =  2δjakσ(wkxj+bk).\frac{\partial \ell_j}{\partial b_k} \;=\; -2\,\delta_j\, a_k\, \sigma'(w_k \cdot x_j + b_k).

The mini-batch gradient is the average of these over jBtj \in \mathcal{B}_t. Libraries (PyTorch, JAX) compute these automatically by reverse-mode automatic differentiation, but for a one-layer net you can write them by hand and verify.

The universal approximation theorem

The 1D story above shows ridges are a basis we can compute with: we can evaluate, differentiate, integrate, and fit, exactly as we did with {Tk}\{T_k\}. The first question to settle, before any quantitative rate, is whether this basis is flexible enough to represent every continuous function. That is the density question, and is what universal approximation answers.

Stone–Weierstrass already gives polynomial density in C(K)C(K) for compact KRdK \subset \mathbb{R}^d, so “any continuous function in any dimension is approximable by polynomials” was a solved problem. Cybenko (1989) and Hornik (1991) added that ridge functions are also dense, under a hypothesis on σ\sigma called discriminatory.

Definition 1 (Discriminatory function)

A function σ:RR\sigma: \mathbb{R} \to \mathbb{R} is discriminatory on a compact KRdK \subset \mathbb{R}^d if the only signed regular Borel measure μ\mu on KK satisfying

Kσ(wx+b)dμ(x)=0for every (w,b)Rd×R\int_K \sigma(w \cdot x + b)\, d\mu(x) = 0 \quad\text{for every } (w, b) \in \mathbb{R}^d \times \mathbb{R}

is the zero measure μ=0\mu = 0.

The condition says: no signed measure can be invisible to every ridge function with activation σ\sigma. Equivalently, the closed linear span of {σ(wx+b)}(w,b)\{\sigma(w \cdot x + b)\}_{(w,b)} in C(K)C(K) has no non-trivial annihilator, which by Hahn–Banach means the span is dense.

Theorem 1 (Universal Approximation (Cybenko 1989, Hornik 1991))

Let σ\sigma be continuous and discriminatory on a compact KRdK \subset \mathbb{R}^d. For every continuous f:KRf: K \to \mathbb{R} and every ε>0\varepsilon > 0 there is a width nn and parameters {(ak,wk,bk)}k=1n\{(a_k, w_k, b_k)\}_{k=1}^n with

f(x)fn(x)  <  εfor every xK,|f(x) - f_n(x)| \;<\; \varepsilon \qquad\text{for every } x \in K,

where fn(x)=kakσ(wkx+bk)f_n(x) = \sum_k a_k\,\sigma(w_k \cdot x + b_k).

A self-contained proof (Hahn–Banach plus a Fourier argument that continuous sigmoidal activations are discriminatory) is on this companion site for MATH 725. The two activations that dominate in practice both satisfy the hypothesis:

A first experiment: fitting sin(2πx)\sin(2\pi x)

Train a width-20 network with tanh\tanh on sin(2πx)\sin(2\pi x) over [1,1][-1, 1]. After 2000 Adam steps the network reaches 102\sim 10^{-2} pointwise error. The adaptive Chebyshev chebfit_adaptive from A Chebyshev Toolbox: Doing Linear Algebra with Functions reaches machine precision on the same function with about 20 coefficients, in one DCT. Both get arbitrarily close to sin(2πx)\sin(2\pi x); the network needs thousands of gradient steps and still plateaus at 10-2. In 1D the network is strictly worse than what we already have. The picture flips in high dd. Demo 1 of the companion notebook.

See Also