Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Big Idea

A one-hidden-layer neural network is a basis expansion in ridge functions σ(wx+b)\sigma(w \cdot x + b). Cybenko (1989) and Hornik (1991) showed this basis is dense in C(K)C(K), just like polynomials. Density is not the headline; the rate is. Barron’s theorem says that for ff with bounded Barron norm CfC_f, a width-nn network reaches L2L^2 accuracy Cf/nC_f / \sqrt n, with no dd in the rate.

A neural network as a basis expansion

A one-hidden-layer network is

fn(x)  =  k=1nakσ(wkx+bk),f_n(x) \;=\; \sum_{k=1}^{n} a_k\, \sigma(w_k \cdot x + b_k),

with xRdx \in \mathbb{R}^d, weights wkRdw_k \in \mathbb{R}^d, biases bkb_k, output coefficients aka_k, and a fixed non-linearity σ\sigma. Stacking the wkw_k^\top as rows of a matrix WRn×dW \in \mathbb{R}^{n \times d} gives the matrix form

z=Wx+b,fn(x)=a ⁣σ(z),z = W x + b, \qquad f_n(x) = a^{\!\top}\sigma(z),

with σ\sigma acting elementwise.

Each term σ(wkx+bk)\sigma(w_k \cdot x + b_k) is a ridge function: a 1D function σ\sigma depending on xx only through wkx+bkw_k \cdot x + b_k, extruded along the (d1)(d-1) directions perpendicular to wkw_k. A network is a sum of nn such ridges; wkw_k controls the direction, bkb_k shifts it, aka_k scales it.

Compare with a Chebyshev expansion fn(x)=ckTk(x)f_n(x) = \sum c_k T_k(x). Both are linear combinations of nn basis functions. The Chebyshev basis {Tk}\{T_k\} is fixed; the ridge basis {σ(wkx+bk)}\{\sigma(w_k \cdot x + b_k)\} is parameterised by (wk,bk)(w_k, b_k) that we can choose.

A 1D ridge network: evaluation, differentiation, integration

In one variable a ridge is just a shifted, scaled copy of σ\sigma: ϕk(x)=σ(wkx+bk)\phi_k(x) = \sigma(w_k x + b_k) with wk,bkRw_k, b_k \in \mathbb{R}. The basis is parameterised but each basis function is something we can draw on a single axis. Before doing any algebra, look at it.

What a ridge basis looks like

Three panels. The left shows three single ridges with different (w,b)(w, b). The slope at the inflection point is w/4w/4 for tanh\tanh, so w|w| controls how fast the ridge transitions from one saturation level to the other; the sign of ww flips left and right; bb shifts the inflection point to x=b/wx = -b/w. The middle panel overlays a few akσ(wkx+bk)a_k\,\sigma(w_k x + b_k), dashed unscaled and solid scaled, to show how aka_k rescales each individual ridge. The right panel adds them up and overlays a target.

Source
<Figure size 1100x320 with 3 Axes>

The takeaway: each ridge contributes one bend, localised to a region of width 1/wk\sim 1/|w_k| around x=bk/wkx = -b_k/w_k. Outside that region the ridge is essentially constant and adds nothing. Contrast with Chebyshev, where every TkT_k oscillates across the entire interval [1,1][-1, 1]. Ridges are local, polynomials are global. This single fact drives the difference in behaviour in high dimensions: a dd-variable ridge is a 1D bump pulled along d1d-1 flat directions, so its cost does not multiply with dd, while a tensor-product basis of global polynomials does.

Random features: the linear case

Treat (wk,bk)(w_k, b_k) as fixed up front and let only {ak}\{a_k\} vary, the random-features view. The ridge basis is then a fixed basis like {Tk}\{T_k\}, and every operation is linear in the coefficient vector aa. We come back to the trained case at the end.

Evaluation

fn(x)=a ⁣ϕ(x)f_n(x) = a^{\!\top} \phi(x) at a point. For a batch x1,,xNx_1, \ldots, x_N, this is fn=Vaf_n = V a with Vjk=ϕk(xj)V_{jk} = \phi_k(x_j). Cost O(Nn)O(Nn), the same shape as Chebyshev’s value-from-coefficient map.

Differentiation

ϕk(x)=wkσ(wkx+bk)\phi_k'(x) = w_k\, \sigma'(w_k x + b_k), so

fn(x)  =  k(wkak)σ(wkx+bk).f_n'(x) \;=\; \sum_k (w_k a_k)\, \sigma'(w_k x + b_k).

The coefficient map is aWaa \mapsto W a with W=diag(wk)W = \mathrm{diag}(w_k), expressed in the related basis σ\sigma'. Like Chebyshev: differentiation maps TkUk1T_k \to U_{k-1} to a different family, and a recurrence converts back. For ridges we evaluate in the σ\sigma'-basis directly.

Integration

If Σ=σ\Sigma' = \sigma then ϕk=Σ(wkx+bk)/wk\int \phi_k = \Sigma(w_k x + b_k) / w_k. For σ=tanh\sigma = \tanh, Σ=logcosh\Sigma = \log\cosh. The definite integral is

abfn(x)dx  =  m ⁣a,mk=Σ(wkb+bk)Σ(wka+bk)wk.\int_a^b f_n(x)\,dx \;=\; m^{\!\top} a, \qquad m_k = \frac{\Sigma(w_k b + b_k) - \Sigma(w_k a + b_k)}{w_k}.

A linear functional with precomputed moments, identical in shape to the Chebyshev integration construction.

Fitting

The coefficients aa minimise 1NVaf22\frac{1}{N}\|V a - f\|_2^2, linear least squares. With Chebyshev nodes for {xj}\{x_j\} and the TT-basis the matrix VV is structured and a DCT solves it in O(NlogN)O(N \log N). With random ridges no such structure, but the linear solve via pseudoinverse is still cheap, and there is no non-convex search.

This is the random-features network. Drawing (wk,bk)(w_k, b_k) from a fixed distribution and solving the linear least-squares problem for aa is exactly what Barron’s existence proof produces below: random samples from an integral representation of ff.

The trained case

If we let (wk,bk)(w_k, b_k) vary too, the loss

mina,w,b  1NV(w,b)af22\min_{a, w, b}\; \frac{1}{N}\,\|V(w, b)\, a - f\|_2^2

is non-convex. The standard tool is gradient descent (Adam in practice), used in Demo 1 of the companion notebook. The trade is linearity for adaptivity: random features keeps the basis fixed and the solve linear, a trained network places ridge directions where the structure of ff lives. In 1D the polynomial story already wins; the trade flips in high dd.

From construction to guarantees

The 1D story above shows ridges are a basis we can compute with: we can evaluate, differentiate, integrate, and fit, exactly as we did with {Tk}\{T_k\}. Two questions remain:

  1. Can ridges represent every continuous function? Density is the property that lets us call this a basis at all, and is what universal approximation answers.

  2. How efficiently? Density allows arbitrary error in the limit; it says nothing about how the required width nn scales with the target function or the input dimension. Barron’s theorem answers this.

These two questions have separate answers and the second is where the dimension-dependence (or lack of it) actually lives.

The universal approximation theorem

Stone-Weierstrass already gives polynomial density in C(K)C(K) for compact KRdK \subset \mathbb{R}^d, so “any continuous function in any dimension is approximable by polynomials” was a solved problem. Cybenko (1989) and Hornik (1991) added that ridge functions are also dense.

Theorem 1 (Universal Approximation (Cybenko 1989, Hornik 1991))

Let σ\sigma be continuous and non-polynomial. Let ff be continuous on a bounded KRdK \subset \mathbb{R}^d, and ε>0\varepsilon > 0. There is a width nn and parameters {(ak,wk,bk)}k=1n\{(a_k, w_k, b_k)\}_{k=1}^n with

f(x)fn(x)  <  εfor every xK,|f(x) - f_n(x)| \;<\; \varepsilon \qquad\text{for every } x \in K,

where fn(x)=kakσ(wkx+bk)f_n(x) = \sum_k a_k\,\sigma(w_k \cdot x + b_k).

The non-polynomial requirement is structural: a sum of polynomial ridges is itself a polynomial of bounded degree, which is not dense.

Two activations dominate in practice, both non-polynomial:

Density alone is not new news. The reason to care about ridges is the rate, which depends on the function class.

What “function on Rd\mathbb{R}^d” actually covers

The phrase f:RdRf: \mathbb{R}^d \to \mathbb{R} is more general than it sounds. The input is a list of dd numbers, and almost any object we can encode reduces to that:

Universal approximation says a wide-enough one-hidden-layer network can approximate “image \to class label”, “sentence \to translation”, “molecule \to binding energy”, and so on. That is a strong representability statement, but it is silent on how large “wide-enough” is, and the worst-case answer is exponential in dd. Real-world high-dimensional learning works because the targets people care about live in classes where the required width is manageable. Barron’s theorem identifies one such class.

A first experiment: fitting sin(2πx)\sin(2\pi x)

Train a width-20 network with tanh\tanh on sin(2πx)\sin(2\pi x) over [1,1][-1, 1]. After 2000 Adam steps the network reaches 102\sim 10^{-2} pointwise error. The adaptive Chebyshev chebfit_adaptive from A Chebyshev Toolbox: Doing Linear Algebra with Functions reaches machine precision on the same function with about 20 coefficients, in one DCT. Both get arbitrarily close to sin(2πx)\sin(2\pi x); the network needs thousands of gradient steps and still plateaus at 10-2. In 1D the network is strictly worse than what we already have. The picture flips in high dd. Demo 1 of the companion notebook.

Integral representation

The bridge between ff and a one-hidden-layer network is a Fourier-side identity. For nice enough f:RdRf: \mathbb{R}^d \to \mathbb{R} with Fourier transform f^\hat f,

f(x)  =  Rdf^(ω)eiωxdω.f(x) \;=\; \int_{\mathbb{R}^d} \hat f(\omega)\, e^{i\omega \cdot x}\,d\omega.

Splitting eiωxe^{i\omega \cdot x} into real ridge components and recasting sin/cos\sin/\cos as differences of sigmoidal ridges (Barron 1993, Theorem 2),

f(x)f(0)  =  Rda(ω)σ(ωx+b(ω))dμ(ω),f(x) - f(0) \;=\; \int_{\mathbb{R}^d} a(\omega)\, \sigma(\omega \cdot x + b(\omega))\, d\mu(\omega),

for some weight aa, phase bb, and probability measure μ\mu.

Where does the ω|\omega| factor in the Barron norm come from? A sinusoid sin(ωx)\sin(\omega \cdot x) is itself not a ridge (the activation σ\sigma is bounded and monotone, sin\sin oscillates). But it is a difference of sigmoidal ridges: imagine forming one period of sin(ωx)\sin(\omega \cdot x) by adding rescaled saturating sigmoids that turn on and off in the right places. To build a wave of frequency ω|\omega| from sigmoids that saturate at ±1\pm 1, the amplitudes a(ω)a(\omega) have to scale with ω|\omega| so the steep up-and-down of the wave can be matched by sigmoid steps that rise and fall on a length scale 1/ω1/|\omega|. That is the only place ω|\omega| enters, and it is exactly the weight ωf^(ω)|\omega| \cdot |\hat f(\omega)| that appears under the Barron integral. High-frequency content costs more ridge amplitude, in proportion to the frequency, so the Barron norm weights it accordingly.

Sampling ω1,,ωnμ\omega_1, \ldots, \omega_n \sim \mu and averaging gives a width-nn ridge network as a Monte Carlo estimator. By Monte Carlo Methods, the L2L^2 error of an MC estimator of an integral with variance VV is V/n\sqrt V / \sqrt n, regardless of dimension. The variance is bounded by the Barron norm.

The Barron norm

Definition 1 (Barron norm and Barron space)

For f:RdRf: \mathbb{R}^d \to \mathbb{R} with Fourier transform f^\hat f, the Barron norm is

Cf  =  Rdωf^(ω)dω.C_f \;=\; \int_{\mathbb{R}^d} |\omega|\, |\hat f(\omega)|\,d\omega.

The Barron space B(Rd)\mathcal{B}(\mathbb{R}^d) consists of ff with Cf<C_f < \infty.

Why an L¹ moment of the spectrum and not an L² Sobolev norm? Take a single ridge f(x)=σ0(wx+b)f(x) = \sigma_0(w \cdot x + b) in Rd\mathbb{R}^d. Its Fourier transform is concentrated on the 1D line through the origin parallel to ww. The Barron integral collapses to something that depends only on σ0\sigma_0 and w|w|: no dd appears, even though the function lives in Rd\mathbb{R}^d. The Sobolev norm fHk\|f\|_{H^k} tries to integrate f^2|\hat f|^2 over all of Rd\mathbb{R}^d with a high-frequency penalty, and is in fact infinite for a strict ridge when d>1d > 1.

The Barron norm registers concentration of the spectrum; the Sobolev norm assumes spread. Real-world high-dimensional functions tend to have concentrated spectra (decision boundaries given by ridges, densities with light tails), and Barron is the right norm for them.

Barron’s theorem

Theorem 2 (Barron 1993)

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} have Cf<C_f < \infty and let KRdK \subset \mathbb{R}^d have rK=supxKxr_K = \sup_{x \in K} |x|. For every n1n \ge 1 and every probability measure ν\nu on KK, there is a sigmoidal one-hidden-layer network fnf_n of width nn with

ffnL2(ν)    2CfrKn.\|f - f_n\|_{L^2(\nu)} \;\le\; \frac{2 C_f r_K}{\sqrt n}.
The headline

The dimension dd enters through CfC_f and rKr_K, not through the exponent of nn. For ridge functions CfC_f is dimension-free, so the network needs O(Cf2/ε2)O(C_f^2 / \varepsilon^2) neurons regardless of dd. This is the entire content of the page: a 1/n1/\sqrt n rate, in any dimension, for functions whose spectrum is concentrated.

Proof 1

Use the integral representation f(x)f(0)=a(ω)σ(ωx+b(ω))dμ(ω)f(x) - f(0) = \int a(\omega) \sigma(\omega \cdot x + b(\omega))\, d\mu(\omega) with aL(μ)2CfrK\|a\|_{L^\infty(\mu)} \le 2 C_f r_K and σ1|\sigma| \le 1. Then Varωgω(x)4Cf2rK2\mathrm{Var}_\omega g_\omega(x) \le 4 C_f^2 r_K^2 uniformly in xKx \in K, where gω(x)=a(ω)σ(ωx+b(ω))g_\omega(x) = a(\omega) \sigma(\omega \cdot x + b(\omega)). Drawing ω1,,ωnμ\omega_1, \ldots, \omega_n \sim \mu and averaging,

EωffnL2(ν)2  =  1nKVarωgωdν    4Cf2rK2n.\mathbb{E}_\omega \|f - f_n\|_{L^2(\nu)}^2 \;=\; \frac{1}{n}\, \int_K \mathrm{Var}_\omega g_\omega\, d\nu \;\le\; \frac{4 C_f^2 r_K^2}{n}.

Some realisation achieves ffnL2(ν)2CfrK/n\|f - f_n\|_{L^2(\nu)} \le 2 C_f r_K / \sqrt n.

The proof is non-constructive: weights exist by averaging, but finding a specific good draw is a separate problem, which in practice we solve by gradient descent.

The rate, in code

A single ridge is too easy (a width-1 network solves it). For a non-trivial test, fit a sum of K=16K = 16 ridges in d=20d = 20 at widths {8,32,128}\{8, 32, 128\}. Error drops sharply between 8 and 32 (the network catches the dominant ridge directions), then plateaus on a training-induced floor. The 1/n1/\sqrt n Barron bound is plotted as a guide; trained networks beat it because gradient descent finds parameters more efficient than typical Monte Carlo samples. Demo 2 of the companion notebook.

What sits in and out of Barron space

A single smooth ridge σ0(wx+b)\sigma_0(w \cdot x + b) has Cf2wσ0BVC_f \le 2|w|\,\|\sigma_0\|_{\text{BV}}, no dd-dependence. Sums accumulate CfC_f linearly in the number of summands. The Gaussian ex2/2e^{-|x|^2/2} on [1,1]d[-1, 1]^d has CfdC_f \sim \sqrt d. Polynomials of low total degree are in B\mathcal{B}.

Outside B\mathcal{B}: anything with a heavy high-frequency tail. A half-space indicator has a jump and Cf=C_f = \infty. Generic Lipschitz functions in dd dimensions usually have CfC_f infinite. A tensor product of dd smooth bumps is formally Barron but the constant grows fast enough in dd that the rate is useless.

The Barron rate is conditional on a concentrated spectrum. For ridge-type targets, smooth densities, and certain compositions it holds with dimension-free constants; for generic continuous functions in high dd it does not.

Dimension dependence, in code

Holding the network width fixed at n=64n = 64, sweep d{1,10,30,50}d \in \{1, 10, 30, 50\} on a sum-of-8-ridges target. The test RMSE stays in the 10-3 to 10-2 range across the whole sweep. A tensor-product polynomial would need (n+dd)\binom{n+d}{d} coefficients for the same target, 107\approx 10^7 at d=20d = 20 and exceeding any computer’s memory by d=50d = 50. The network handles it with 64 hidden units throughout. Demo 3 of the companion notebook.

Summary

Three caveats.

  1. Optimisation. Finding good weights is non-convex. Gradient descent works empirically but no general guarantee that it reaches the network Barron’s theorem promises. Barron is approximation, not training.

  2. Generalisation. From data (xi,f(xi))i=1N(x_i, f(x_i))_{i=1}^N, the empirical-risk minimiser has its own statistical error governed by Rademacher-style bounds. The 1/n1/\sqrt n in Barron is approximation (network width), not generalisation (sample size).

  3. Beyond Barron. Many practical functions are not in B\mathcal{B}. Active research extends to deep networks (deep Barron spaces, neural-ODE flow-induced spaces) and Banach-space variation norms. Each enlargement weakens the regularity assumption and the constants.