Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Optional Section

This section, together with §Ridge Functions and Universal Approximation, covers material beyond the core MATH 551 syllabus. The two pages develop the rate-of-convergence question for neural networks; the content is included for interested students.

Big Idea

The previous page showed ridge functions are dense: any continuous ff on KRdK \subset \mathbb{R}^d can be approximated by a wide-enough network. The natural question is how wide? For a fixed basis like tensor-product Chebyshev the answer is εd/k\varepsilon^{-d/k} basis functions for HkH^k targets in dd dimensions, exponential in dd: the curse of dimensionality. A one-hidden-layer ridge network reaches L2L^2 accuracy 2CfrK/n2 C_f r_K / \sqrt n with no dd in the exponent (Barron 1993), provided ff has finite Barron norm CfC_f. The proof is one-line Monte Carlo on an integral representation of ff.

What rate do neural networks give?

The previous page settled the density question: ridges are a basis, and Cybenko / Hornik tell us they can approximate any continuous target. What we do not yet have is a quantitative rate. For Chebyshev in 1D we know ρn\rho^{-n} for analytic ff and nkn^{-k} for CkC^k functions. Are there comparable statements for neural networks, and if so, in which dimensions dd?

Before we can appreciate what neural networks buy us, we need to know what a fixed basis costs. The benchmark is tensor-product Chebyshev in Rd\mathbb{R}^d; we work it out in 2D first and then extract the general pattern.

2D Chebyshev series

A function f:[1,1]Rf: [-1, 1] \to \mathbb{R} has the 1D Chebyshev expansion f(x)=jcjTj(x)f(x) = \sum_j c_j T_j(x). For a function f:[1,1]2Rf: [-1, 1]^2 \to \mathbb{R}, the natural extension is the tensor product: use Tj(x)T_j(x) in the first variable and Tk(y)T_k(y) in the second, and form all pairwise products,

f(x,y)  =  j=0k=0cjkTj(x)Tk(y).f(x, y) \;=\; \sum_{j=0}^{\infty} \sum_{k=0}^{\infty} c_{jk}\, T_j(x)\, T_k(y).

Truncating to j<nj < n and k<nk < n gives an approximation built from n×n=n2n \times n = n^2 basis functions Tj(x)Tk(y)T_j(x)\,T_k(y). The coefficient cjkc_{jk} is the inner product of ff against Tj(x)Tk(y)T_j(x)T_k(y), computed by applying the 1D DCT along each axis: first along the xx axis on every yy-row, then along the yy axis on every column. The result is a 2D coefficient matrix CRn×nC \in \mathbb{R}^{n \times n} whose (j,k)(j, k) entry is cjkc_{jk}.

The same recipe extends to dd dimensions. A degree-nn tensor product in Rd\mathbb{R}^d has ndn^d basis functions and ndn^d coefficients, computed by dd DCTs.

A 2D example

For a concrete look, fit a smooth non-separable function on [1,1]2[-1, 1]^2 by tensor-product Chebyshev with n=32n = 32 per axis, then plot the 32×3232 \times 32 matrix of coefficients on a log scale.

Source
<Figure size 1500x450 with 6 Axes>
Coefficients above 1e-12: 87 of 1024

The left panel shows the Chebyshev interpolant at the full n=32n = 32 per axis: visually indistinguishable from the target. The middle panel shows the residual at machine-precision scale (RMSE 1015\sim 10^{-15}), so tensor-product Chebyshev exactly nails this analytic target with n2=1024n^2 = 1024 coefficients. The right panel shows where the coefficient mass lives: it concentrates in the low-frequency corner (small jj and kk) and decays fast in both directions. Reaching 10-12 accuracy keeps about n=9n = 9 coefficients per axis, roughly n290n^2 \approx 90 in 2D. A 1D Chebyshev expansion of a comparably smooth function settles down in about 13 coefficients. The tensor product multiplies the per-axis cost: the total number of coefficients scales as ndn^d with n9n \approx 9, giving 93700\approx 9^3 \approx 700 coefficients in 3D, 9103×109\approx 9^{10} \approx 3 \times 10^9 in 10D, and 9201.2×1019\approx 9^{20} \approx 1.2 \times 10^{19} in 20D. The per-axis convergence is excellent; the dd-dimensional cost is unaffordable.

The curse of dimensionality

Definition 1 (Curse of dimensionality)

A numerical method for approximating a function f:RdRf: \mathbb{R}^d \to \mathbb{R} to accuracy ε\varepsilon suffers from the curse of dimensionality if the number of basis functions, parameters, or function evaluations nn it needs grows exponentially in the dimension dd, i.e.

n  =  Ω ⁣(Cd)orn  =  Ω ⁣(εd/k)n \;=\; \Omega\!\left(C^{\,d}\right) \quad \text{or} \quad n \;=\; \Omega\!\left(\varepsilon^{-d/k}\right)

for some constant C>1C > 1 and per-axis smoothness kk. The cost remains finite for any fixed dd, but blows up so fast with dd that already moderate dimensions (d10d \approx 10 to 20) make the method unaffordable. The phrase is due to Bellman (1957) in the context of dynamic programming.

Tensor-product Chebyshev is a textbook example. The next subsection explains why this is not a quirk of polynomials but a structural property of any deterministic basis approximating Sobolev-class functions.

Where does dd really come from? A coefficient-decay count

The exponential cost is not specific to Chebyshev. It is built into how the Sobolev space Hk([1,1]d)H^k([-1,1]^d), the space of functions on the cube whose first kk derivatives (in the weak sense) are square-integrable, sees the Chebyshev coefficients of ff. Here k1k \ge 1 is the smoothness index: k=1k = 1 means one L2L^2-derivative, k=2k = 2 two, and so on. (The same picture works for any reasonable Fourier basis on a bounded domain; a self-contained graduate-level treatment is on the MATH 725 Sobolev page.)

Sobolev norms measure frequency. The classical LpL^p norms see size (height and width) of a function but are blind to oscillation: a function and a fast wiggle of it have the same LpL^p norm. For fL2([1,1]d)f \in L^2([-1,1]^d) with tensor-Chebyshev coefficients cjc_j indexed by a multi-index j=(j1,,jd)Ndj = (j_1, \ldots, j_d) \in \mathbb{N}^d, the Sobolev HkH^k norm is equivalent to

fHk2=jNd(1+j2)kcj2,\|f\|_{H^k}^2 = \sum_{j \in \mathbb{N}^d} (1 + |j|^2)^k\, |c_j|^2,

with a frequency penalty (1+j2)k(1 + |j|^2)^k: high-frequency coefficients cjc_j (large j|j|) cost more than low-frequency ones. fHkf \in H^k is the assertion that the cj|c_j| decay fast enough in j|j| for this sum to be finite. The integer kk is the smoothness index, and it is the same kk that appears in the rate εd/k\varepsilon^{-d/k} below.

Approximating ff in L2L^2 to accuracy ε\varepsilon means approximating its Chebyshev coefficients. Truncating to multi-indices with jR|j|_\infty \le R, the tail satisfies

j>Rcj2    R2kfHk2,\sum_{|j|_\infty > R} |c_j|^2 \;\le\; R^{-2k}\, \|f\|_{H^k}^2,

so to push the L2L^2 tail below ε\varepsilon we need Rε1/kR \sim \varepsilon^{-1/k}. This is a Fourier uncertainty principle in disguise: L2L^2 accuracy in ff forces us to capture all Chebyshev coefficients with jR|j|_\infty \le R, no exceptions. The smaller ε\varepsilon, the larger RR; the smoother ff (the larger kk), the smaller the RR we need.

The curse comes from counting multi-indices in the box. The number of multi-indices jNdj \in \mathbb{N}^d with jR|j|_\infty \le R is (R+1)dRd(R + 1)^d \sim R^d, so the cost of resolving ff to accuracy ε\varepsilon is

n    (R+1)d    Rd    εd/k.n \;\sim\; (R+1)^d \;\sim\; R^d \;\sim\; \varepsilon^{-d/k}.

The dd in the exponent is the volume of the multi-index box jR|j|_\infty \le R in Nd\mathbb{N}^d, not a quirk of polynomials. Any deterministic basis has to cover that box.

For analytic ff (Bernstein ellipse with parameter ρ\rho), the per-axis count drops to nlogρ(1/ε)n \sim \log_\rho(1/\varepsilon), and the dd-dimensional cost is logd(1/ε)\log^d(1/\varepsilon) instead of εd/k\varepsilon^{-d/k}. Better, but still exponential in dd.

How neural networks handle this

The trick is to write ff itself as an integral against a parametric basis, then discretise the integral by Monte Carlo. The integral representation comes from the Fourier representation of ff.

From Fourier to ridges

For fL1(Rd)f \in L^1(\mathbb{R}^d) with f^L1\hat f \in L^1 (or for fL2f \in L^2 via Plancherel, which is the setting we mostly need), Fourier inversion writes ff as a continuous superposition of complex exponentials,

f(x)  =  Rdf^(ω)eiωxdω.f(x) \;=\; \int_{\mathbb{R}^d} \hat f(\omega)\, e^{i\omega \cdot x}\,d\omega.

Each exponential eiωxe^{i\omega \cdot x} depends on xx only through ωx\omega \cdot x, like a ridge, but is unbounded (oscillates forever) and complex, so it does not live in the ridge basis {σ(wx+b)}\{\sigma(w \cdot x + b)\}. The bridge is to rewrite each oscillating exponential as a continuous superposition of bounded sigmoid ridges. We work this out for a single Fourier mode first, then assemble.

A worked example: cos(ωx)\cos(\omega x) on [1,1][-1, 1]

We work out how cos(ωx)\cos(\omega x) on [1,1][-1, 1], with ω>0\omega > 0 fixed, becomes a continuous superposition of sigmoid ridges. The construction makes the ω|\omega|-factor in the Barron norm explicit and shows that the ridges combine in pairs.

Step 1: a difference of two ridges is a bump. A single sigmoid ridge σ(ωx+b)\sigma(\omega x + b) is a smooth step. The difference of two ridges with the same slope ω\omega but opposite shifts ±t\pm t,

Δt(x)  :=  σ(ωx+t)σ(ωxt),\Delta_t(x) \;:=\; \sigma(\omega x + t) - \sigma(\omega x - t),

is a localised bump on the interval xt/ω|x| \le t/\omega. Its width in xx is 2t/ω2t/\omega and its peak height is close to 2 for t1t \gtrsim 1. As tt ranges over [0,ω][0, \omega], Δt\Delta_t sweeps through a family of bumps of increasing width: at t=0t = 0 the bump is degenerate, and at t=ωt = \omega it covers the whole interval [1,1][-1, 1]. The middle panel of the figure below shows one such Δt\Delta_t.

Step 2: cos(ωx)\cos(\omega x) is an integral of these bumps. The layer-cake identity for cos\cos on [ω,ω][-\omega, \omega] reads

cos(z)    cos(ω)  =  0ωsin(t)[H(z+t)H(zt)]dt,\cos(z) \;-\; \cos(\omega) \;=\; \int_0^{\omega} \sin(t)\, \bigl[H(z + t) - H(z - t)\bigr]\, dt,

where HH is the Heaviside step (a sharp ridge). The integrand at fixed tt is exactly a difference of two Heaviside steps with shifts ±t\pm t, the sharp version of the bump Δt\Delta_t from Step 1. Replacing the Heaviside by a sharp sigmoid, H(z)12(1+σ(αz))H(z) \approx \tfrac{1}{2}(1 + \sigma(\alpha z)) for large α\alpha, puts the formula in pure-ridge form:

cos(ωx)    cos(ω)  +  120ωsin(t)[σ(α(ωx+t))σ(α(ωxt))]dt.\cos(\omega x) \;\approx\; \cos(\omega) \;+\; \tfrac{1}{2}\int_0^{\omega} \sin(t)\, \bigl[\sigma(\alpha(\omega x + t)) - \sigma(\alpha(\omega x - t))\bigr]\, dt.

So cos(ωx)\cos(\omega x) is the integral over t[0,ω]t \in [0, \omega] of ridge-difference bumps, weighted by sin(t)/2\sin(t)/2. The formula is exact in the sharp limit α\alpha \to \infty.

Step 3: where the ω|\omega| in the Barron norm comes from. The integration range in tt has length ω\omega. To approximate the integral by a Riemann sum at spacing Δt\Delta t, we need ω/Δt\sim \omega/\Delta t ridge pairs. The number of ridge pairs needed to build cos(ωx)\cos(\omega x) scales linearly with the frequency ω\omega. That linear scaling is the source of the ω|\omega| factor in the Barron norm.

Source
<Figure size 1100x340 with 3 Axes>

The right panel reconstructs cos(ωx)\cos(\omega x) via the layer-cake formula using 400 ridge pairs and sharp sigmoids (α=8\alpha = 8). The two curves match once the integral is discretised finely enough and the sigmoids are sharp enough; the formula is exact in the limit α\alpha \to \infty, Δt0\Delta t \to 0. The mechanism, not the numerical accuracy, is the takeaway: a single Fourier mode cos(ωx)\cos(\omega x) of frequency ω\omega is built from a continuous integral of differences of sigmoid ridges, with the count of ridge pairs scaling linearly in ω\omega.

The general statement

Going from one Fourier mode to the full Fourier integral is an exercise in exchanging the order of integration. Carrying this out (Barron 1993, Theorem 2) gives

f(x)f(0)  =  Rda(ω)σ(ωx+b(ω))dμ(ω),f(x) - f(0) \;=\; \int_{\mathbb{R}^d} a(\omega)\, \sigma(\omega \cdot x + b(\omega))\, d\mu(\omega),

for some weight function a(ω)a(\omega), phase b(ω)b(\omega), and probability measure μ\mu on Rd\mathbb{R}^d. This is ff as a continuous neural network: an integral of bounded sigmoid ridges over the parameter ω\omega, with f(0)-f(0) playing the role of the constant of integration in a fundamental-theorem-of-calculus identity for ridges. The ω|\omega|-factor from the worked example above appears under the integral and is what produces the ω|\omega| in the Barron norm in §3.

From the continuous integral to a finite-width network

The integral representation is a “neural network with infinitely many neurons”, which we cannot evaluate. Approximate the continuous integral by Monte Carlo: sample ω1,,ωnμ\omega_1, \ldots, \omega_n \sim \mu iid and replace the integral by its sample average,

fn(x)  =  1nk=1na(ωk)σ(ωkx+b(ωk)).f_n(x) \;=\; \frac{1}{n}\sum_{k=1}^n a(\omega_k)\, \sigma(\omega_k \cdot x + b(\omega_k)).

This is exactly a width-nn one-hidden-layer ridge network with weights ωk\omega_k drawn from μ\mu. The continuous neural network becomes a finite-dimensional one through the same Monte Carlo discretisation we used for ordinary integration in the Monte Carlo notebook.

This is the probabilistic move: we replace classical (deterministic) integration, which places nn quadrature points on a grid, with Monte Carlo (random) sampling, which draws nn points from a probability distribution. A grid in Rd\mathbb{R}^d carries the curse of dimensionality (its size scales as hdh^{-d} to reach spacing hh); nn random samples cost nn regardless of dd. The trade is the convergence rate: deterministic methods can converge exponentially in nn (Chebyshev for analytic ff), while MC converges only as 1/n1/\sqrt n. We accept the slower rate to escape the curse.

Why does this avoid the curse? The Monte Carlo error rate for any integrand is V/n\sqrt V / \sqrt n where VV is the variance of the integrand under μ\mu, independent of the dimension of the integration domain. The remaining question is whether the variance VV of the Barron integrand can be bounded as dd grows. The next section identifies the condition under which it can: ff has finite Barron norm CfC_f.

Barron space and the convergence theorem

The Barron norm

What controls the variance of the integrand gω(x)=a(ω)σ(ωx+b(ω))g_\omega(x) = a(\omega)\,\sigma(\omega \cdot x + b(\omega))? Two ingredients: how much amplitude a(ω)|a(\omega)| the integral representation needs, and how that amplitude couples to the spectrum of ff. Both turn out to be captured by a single quantity.

Definition 2 (Barron norm and Barron space)

For f:RdRf: \mathbb{R}^d \to \mathbb{R} with Fourier transform f^\hat f, the Barron norm is

Cf  =  Rdωf^(ω)dω.C_f \;=\; \int_{\mathbb{R}^d} |\omega|\, |\hat f(\omega)|\,d\omega.

The Barron space B(Rd)\mathcal{B}(\mathbb{R}^d) consists of ff with Cf<C_f < \infty.

Remark 1 (Examples in and out of Barron space)

A single smooth ridge σ0(wx+b)\sigma_0(w \cdot x + b) has Cf2wσ0BVC_f \le 2|w|\,\|\sigma_0\|_{\text{BV}}, no dd-dependence. Sums of ridges accumulate CfC_f linearly in the number of summands. The Gaussian ex2/2e^{-|x|^2/2} on [1,1]d[-1, 1]^d has CfdC_f \sim \sqrt d. Polynomials of low total degree on bounded KK are in B\mathcal{B}.

Outside B\mathcal{B}: anything with a heavy high-frequency tail. A half-space indicator has a jump and Cf=C_f = \infty. Generic Lipschitz functions in dd dimensions usually have Cf=C_f = \infty. A tensor product of dd smooth bumps is formally Barron but CfC_f grows fast enough in dd that the rate is useless.

The Barron rate is conditional on ff having concentrated Fourier support. For ridge-type targets, smooth densities with light tails, and certain compositions, CfC_f is dimension-free; for generic continuous functions in high dd, Cf=C_f = \infty and the rate is meaningless.

Why this norm and not Sobolev

The Sobolev HkH^k norm,

fHk2  =  Rd(1+ω2)kf^(ω)2dω,\|f\|_{H^k}^2 \;=\; \int_{\mathbb{R}^d} (1 + |\omega|^2)^k\, |\hat f(\omega)|^2\,d\omega,

is an L2L^2 norm of f^\hat f with weight (1+ω2)k(1 + |\omega|^2)^k. It penalises high frequency, but it integrates over all of Rd\mathbb{R}^d, treating frequency vectors ω\omega in every direction equally. The Barron norm,

Cf  =  Rdωf^(ω)dω,C_f \;=\; \int_{\mathbb{R}^d} |\omega|\, |\hat f(\omega)|\,d\omega,

is an L1L^1 norm of ωf^(ω)|\omega|\,|\hat f(\omega)|. The exponent difference matters: L1L^1 counts only where f^\hat f is nonzero, while L2L^2 also penalises spread.

The contrast is sharpest on a single ridge f(x)=σ0(wx+b)f(x) = \sigma_0(w \cdot x + b) in Rd\mathbb{R}^d. The Fourier transform of a function that depends on xx only through wxw \cdot x is supported on the 1D line through the origin parallel to ww:

suppf^  =  {λw:λR}.\mathrm{supp}\,\hat f \;=\; \{\lambda w : \lambda \in \mathbb{R}\}.

Two consequences:

A function whose Fourier transform is concentrated on a low- dimensional set is invisible to Sobolev’s L2L^2-over-the-ball machinery, but legible to Barron’s L1L^1-on-the-support measure.

This is the structural explanation for how neural networks escape the curse on Barron-class targets. The Monte Carlo discretisation ωkμ\omega_k \sim \mu samples where the Fourier mass actually lives. For a ridge, the Fourier mass lives on a 1D line, so the samples concentrate there and ignore the rest of Rd\mathbb{R}^d. A deterministic basis (Chebyshev) cannot do this; it pre-decides where to place basis functions before seeing ff, and ends up wasting most of them on directions where f^=0\hat f = 0. Random sampling of the parameter ω\omega adapts to the support of f^\hat f: the basis is built where the function lives, not over the whole ball.

Barron’s theorem

Theorem 1 (Barron 1993)

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} have Cf<C_f < \infty and let KRdK \subset \mathbb{R}^d have rK=supxKxr_K = \sup_{x \in K} |x|. For every n1n \ge 1 and every probability measure ν\nu on KK, there is a sigmoidal one-hidden-layer network fnf_n of width nn with

ffnL2(ν)    2CfrKn.\|f - f_n\|_{L^2(\nu)} \;\le\; \frac{2 C_f r_K}{\sqrt n}.
The headline

The dimension dd enters through CfC_f and rKr_K, not through the exponent of nn. For ridge-type targets CfC_f is dimension-free, so a width n=O(Cf2rK2/ε2)n = O(C_f^2 r_K^2 / \varepsilon^2) suffices for L2L^2 accuracy ε\varepsilon in any dimension. This is the entire content of the page: a 1/n1/\sqrt n rate, independent of dd, for functions with concentrated spectrum.

Proof 1

Use the integral representation f(x)f(0)=a(ω)σ(ωx+b(ω))dμ(ω)f(x) - f(0) = \int a(\omega)\, \sigma(\omega \cdot x + b(\omega))\, d\mu(\omega) with aL(μ)2CfrK\|a\|_{L^\infty(\mu)} \le 2 C_f r_K and σ1|\sigma| \le 1. Then Varωgω(x)4Cf2rK2\mathrm{Var}_\omega g_\omega(x) \le 4 C_f^2 r_K^2 uniformly in xKx \in K, where gω(x)=a(ω)σ(ωx+b(ω))g_\omega(x) = a(\omega)\, \sigma(\omega \cdot x + b(\omega)). Drawing ω1,,ωnμ\omega_1, \ldots, \omega_n \sim \mu iid and averaging,

EωffnL2(ν)2  =  1nKVarωgωdν    4Cf2rK2n.\mathbb{E}_\omega \|f - f_n\|_{L^2(\nu)}^2 \;=\; \frac{1}{n}\, \int_K \mathrm{Var}_\omega g_\omega\, d\nu \;\le\; \frac{4 C_f^2 r_K^2}{n}.

Some realisation achieves ffnL2(ν)2CfrK/n\|f - f_n\|_{L^2(\nu)} \le 2 C_f r_K / \sqrt n.

The proof is non-constructive: a good draw exists by averaging, but finding a specific good draw is a separate problem. In practice we solve it by gradient descent (which beats the typical MC draw by finding a better-than-typical configuration), but the existence guarantee comes from this one-line MC argument.

Examples in code

A 2D example, NN version

The 2D Chebyshev demo in §1 fit f(x,y)=exp(xyx2/2)f(x, y) = \exp(xy - x^2/2) on [1,1]2[-1, 1]^2 with 322=102432^2 = 1024 basis functions, keeping about 90 above 10-12. The Barron analogue is to fit the same target with nn random ridges and watch the error decay as the width grows.

Source
<Figure size 1300x420 with 5 Axes>

Random features at n=64n = 64 already capture the shape of the target (the contours line up); the error map shows residuals well below 10-2. Doubling the width roughly halves the RMSE, matching the 1/n1/\sqrt n Barron guide. The convergence panel overlays the 2D Chebyshev curve, with nn counted as the total number of coefficients (n=na2n = n_a^2 where nan_a is per-axis): Chebyshev’s exponential rate in nan_a collapses the error to machine precision at modest totals, while NN’s slow 1/n1/\sqrt n rate continues linearly on the log-log plot. In 2D, on this analytic target, Chebyshev wins decisively; the curse of dimensionality has not yet bitten. The next two demos show what happens when dd grows.

Rate vs width in d=20d = 20 (Demo 2)

Fit a sum of K=16K = 16 ridges in d=20d = 20 at widths n{8,32,128}n \in \{8, 32, 128\}. Error drops sharply between n=8n = 8 and n=32n = 32 (the network catches the dominant ridge directions), then plateaus on a training-induced floor. The 1/n1/\sqrt n Barron bound is plotted as a guide; trained networks beat it because gradient descent finds parameters more efficient than typical Monte Carlo samples. Demo 2 of the companion notebook.

Dimension sweep (Demo 3)

Holding n=64n = 64 fixed, sweep d{1,10,30,50}d \in \{1, 10, 30, 50\} on a sum-of-8-ridges target. Test RMSE stays in the 10-3 to 10-2 range across the entire sweep. A tensor-product polynomial would need (n+dd)\binom{n+d}{d} coefficients for the same target, 107\approx 10^7 at d=20d = 20 and exceeding any computer’s memory by d=50d = 50. The network handles it with n=64n = 64 hidden units throughout. Demo 3 of the companion notebook.

Caveats

Three things to keep in mind.

  1. Optimisation. Finding good weights is non-convex. Gradient descent works empirically but no general guarantee that it reaches the network Barron’s theorem promises. Barron is approximation, not training.

  2. Generalisation. From data (xi,f(xi))i=1N(x_i, f(x_i))_{i=1}^N, the empirical-risk minimiser has its own statistical error governed by Rademacher-style bounds. The 1/n1/\sqrt n in Barron is approximation (network width), not generalisation (sample size).

  3. Beyond Barron. Many practical functions are not in B\mathcal{B}. Active research extends to deep networks (deep Barron spaces, neural-ODE flow-induced spaces) and Banach-space variation norms. Each enlargement weakens the regularity assumption and the constants.

See Also