Tighter Sparse Approximation Bounds for ReLU Neural Networks
Abstract
A wellknown line of work [Barron, 1993, Breiman, 1993, Klusowski and Barron, 2018] provides bounds on the width of a ReLU twolayer neural network needed to approximate a function over the ball up to error , when the Fourier based quantity is finite. More recently Ongie et al. [2019] used the Radon transform as a tool for analysis of infinitewidth ReLU twolayer networks. In particular, they introduce the concept of Radonbased norms and show that a function defined on can be represented as an infinitewidth twolayer neural network if and only if its norm is finite. In this work, we extend the framework of [Ongie et al., 2019] and define similar Radonbased seminorms (norms) such that a function admits an infinitewidth neural network representation on a bounded open set when its norm is finite. Building on this, we derive sparse (finitewidth) neural network approximation bounds that refine those of Breiman [1993], Klusowski and Barron [2018]. Finally, we show that infinitewidth neural network representations on bounded open sets are not unique and study their structure, providing a functional view of mode connectivity.
1 Introduction
Extensive work has shown that for a neural network to be able to generalize, the size or magnitude of the parameters is more important than the size of the network, when the latter is large enough [Bartlett, 1997, Neyshabur et al., 2015, Zhang et al., 2016]. Under certain regimes, the size of the neural networks used in practice is so large that the training data is fit perfectly and an infinitewidth approximation is appropriate. In this setting, what matters to obtain good generalization is to fit the data using the right inductive bias, which is specified by how network parameters are controlled [Wei et al., 2020] together with the training algorithm used [Lyu and Li, 2020].
The infinitewidth twolayer neural network model has been studied from several perspectives due to its simplicity. One can replace the finitewidth ReLU network by an integral over the parameter space with respect to a signed Radon measure: . Thus, controlling the magnitude of the neural network parameters is akin to controlling the measure according to a certain norm. Bach [2017] introduced the space, which is the infinitewidth neural network space with norm , derived from the finitewidth regularizer (the infimum is over all the measures which represent the function at hand). A different line of work [Savarese et al., 2019, Ongie et al., 2019] consider the infinitewidth spaces with norm , which is derived from the finitewidth regularizer (i.e. omitting the bias term). Both of these works seek to find expressions for this norm, leading to characterizations of the functions that are representable by infinitewidth networks. Savarese et al. [2019] solves the problem in the onedimensional case: they show that for a function on , this norm takes value . Ongie et al. [2019] give an expression for this norm (the norm) for functions on , making use of Radon transforms (see Subsec. 2.3).
Although we mentioned in the first paragraph that in many occasions the network size is large enough that the specific number of neurons is irrelevant, when the target function is hard to approximate it is interesting to have an idea of how many neurons one needs to approximate it. The first contribution in this direction was by Cybenko [1989], Hornik et al. [1989], which show that twolayer neural networks with enough neurons can approximate any reasonable function on bounded sets in the uniform convergence topology. Later on, Barron [1993], Breiman [1993] provided sparse approximation bounds stating that if a function is such that a certain quantity constructed from the Fourier transform is finite, then there exists a neural network of width such that the approximation error with respect to a distribution of bounded support is lower than . More recently, Klusowski and Barron [2018] provided alternative sparse approximation bounds of Breiman [1993] by restricting to networks with bounded weights and a slightly better dependency on at the expense of a constant factor increasing with (see Subsec. 2.2).
Contributions. In our work, we seek to characterize the functions that coincide with an infinitewidth twolayer neural network on a fixed bounded open set. This endeavor is interesting in itself because in practice, we want to learn target functions for which we know samples on a bounded set, and we are typically unconcerned with the values that the learned functions take at infinity. Moreover, the tools that we develop allow us to derive stateoftheart sparse approximation bounds. Our main contributions are the following:

[leftmargin=5.5mm]

In the spirit of the norm introduced by Ongie et al. [2019], for any bounded open set we define the norm of a function on , and show that when the norm of is finite, can admits a representation of the form for , where , and is an even signed Radon measure.

Using the norm, we derive function approximation bounds for neural networks with a fixed finite width. We compute the norm of a function in terms of its Fourier representation, and show that it admits an upper bound by the quantity . This shows that our approximation bound is tighter than the previous bound by Breiman [1993], and meaningful in more instances (e.g. for finitewidth neural networks). We also show normbased bounds analogous to the ones of Klusowski and Barron [2018].

Setting as the open unit ball of radius , we show that neural network representations of on hold for multiple even Radon measures, which contrasts with the uniqueness result provided by Ongie et al. [2019] for the case of . We study the structure of the sets of Radon measures which give rise to the same function on . The nonuniqueness of the measure representing a measure could be linked to the phenomenon of mode connectivity.
Additional related work. There have been other recent works which have used the Radon transform to study neural networks in settings different from ours [Parhi and Nowak, 2021a, Bartolucci et al., 2021]. These two works consider the norm as a regularizer for an inverse problem, and proceed to prove representer theorems: there exists a solution of the regularized problem which is a twolayer neural network equal to the number of datapoints. Regarding infinitewidth network spaces, E and Wojtowytsch [2020] present several equivalent definitions and provides a review. A wellknown line of work [Mei et al., 2018, Chizat and Bach, 2018, Rotskoff and VandenEijnden, 2018] studies the convergence of gradient descent for infinitewidth twolayer neural networks.
2 Framework
2.1 Notation
denotes the dimensional hypersphere (as a submanifold of ) and is the Euclidean open ball of radius . For measurable, the space of functions vanishing at infinity contains the continuous functions such that for any , there exists compact depending on such that for . is the set of Borel probability measures, is the space of finite signed Radon measures (which may be seen as the dual of ). Throughout the paper, the term Radon measure refers to a finite signed Radon measure for shortness. If , then is the total variation (TV) norm of . denotes the space of complexvalued finite signed Radon measures, defined as the dual space of (the space of complexvalued functions vanishing at infinity). We denote by the space of Schwartz functions, which contains the functions in whose derivatives of any order decay faster than polynomials of all orders, i.e. for all , . For , we use to denote the unitary Fourier transforms with angular frequency, defined as . If as well, we have the inversion formula . The Fourier transform is a continuous automorphism on .
2.2 Existing sparse approximation bounds
One of the classical results of the theory of twolayer neural networks (Breiman [1993], building on [Barron, 1993]) states that given a probability measure and a function admitting a Fourier representation of the form , where is a complexvalued Radon measure such that , there exists a twolayer neural network such that
(1) 
These classical results do not provide bounds on the magnitude of the neural network weights. More recently, Klusowski and Barron [2018] showed similar approximation bounds for twolayer ReLU networks under additional and bounds on the weights . Namely, if there exists a twolayer neural network with , and , and
(2) 
where is a universal constant.
2.3 Representation results on based on the Radon transform
One defines denotes the space of hyperplanes on , whose elements may be represented by points in by identifying with both and . Thus, functions on are even functions on and we will use both notions interchangeably^{1}^{1}1Similarly, the space of Radon measures over contains the even measures in . If , is well defined for any measurable function on , but is only defined for even ..
The Radon transform and the dual Radon transform.
If is a function which is integrable over all the hyperplanes of , we may define the Radon transform as
(3) 
That is, one integrates the function over the hyperplane . If is a continuous function, the dual Radon transform is defined as
(4) 
where the integral is with respect to the Hausdorff measure over . and are adjoint operators in the appropriate domains (see 13).
The Radon inversion formula.
When , one has that (Theorem 3.1, Helgason [2011])
(5) 
where and denotes the (negative) fractional Laplacian, defined via its Fourier transform as .
The norm.
Given a function , Ongie et al. [2019] introduce the quantity
(6) 
They call it the norm of , although it is formally a seminorm. Here, the space of Schwartz functions on is defined, in analogy with , as the space of functions on which for any integers and any differential operator on satisfy (Helgason [2011], p. 5). Moreover, , which means the conditions on in (6) can be written as .
The finiteness of the norm indicates whether a function on admits an exact representation as an infinitely wide neural network. Namely, Ongie et al. [2019] in their Lemma 10 show that is finite if and only if there exists a (unique) even measure and (unique) , such that for any ,
(7) 
in which case, .
Remark the following differences between this result and the bounds by [Breiman, 1993, Klusowski and Barron, 2018] shown in equations (1) and (2):

[label=()]

in (7) we have an exact representation with infinitewidth neural networks instead of an approximation result with finitewidth,

in (7) the representation holds on instead of a bounded domain.
In our work, we derive representation results similar to the ones of Ongie et al. [2019] for functions defined on bounded open sets, which naturally give rise to sparse approximation results that refine those of [Breiman, 1993, Klusowski and Barron, 2018].
One property that makes the Radon transform and its dual useful to analyze neural networks can be understood at a very high level via the following argument: if for some smooth rapidly decreasing function , then . For a general function of the form (7), one has similarly that for any . This property relates the evaluations of the measure to the function via the Radon transform, and is the main ingredient in the proof of Lemma 10 of Ongie et al. [2019]. While we also rely on it, we need many additional tools to deal with the case of bounded open sets.
3 Representation results on bounded open sets
Schwartz functions on open sets.
Let be an open subset. The space of Schwartz functions on may be defined as , i.e. they are those Schwartz functions on such that the derivatives of all orders vanish outside of (c.f. Def. 3.2, Shaviv [2020]). The structure of is similar to in that its topology is given by a family of seminorms indexed by : . Similarly, if is open, we define , where is the spherical Laplacian.
The norm.
Let be a bounded open set, and let . For any function , we define the norm of as
(8) 
Note the similarity between this quantity and the norm defined in (6); the main differences are that the supremum here is taken over the even Schwartz functions on instead of , and that the nonLipschitz case does not need a separate treatment. Remark that . If has enough regularity, we may write , using that the fractional Laplacian is selfadjoint and is the adjoint of .
Define to be the bounded open set of hyperplanes of that intersect , which in analogy with Subsec. 2.3, is equal to up to the identification of with . Similarly, note that , which allows to rewrite the conditions in (8) as .
The following proposition, which is based on the RieszMarkovKakutani representation theorem, shows that when the norm is finite, it can be associated to a unique Radon measure over .
Proposition 1.
If , there exists a unique Radon measure such that for any . Moreover, .
Building on this, we see that a neural network representation for bounded holds when the norm is finite:
Theorem 1.
Let be a open, bounded subset of . Let such that . Let be given by 1. For any , there exist unique and such that
(9) 
That is, for a.e. (almost everywhere) in . If is continuous, then the equality holds for all .
Remark that this theorem does not claim that the representation given by is unique, unlike Lemma 10 by Ongie et al. [2019] concerning analogous representations on . In Sec. 5 we see that such representations are in fact not unique, for particular choices of the set . We want to underline that the proof of Theorem 1 uses completely different tools from the ones of Lemma 10 by Ongie et al. [2019]: their result relies critically on the fact that the only harmonic Lipschitz functions on are affine functions, which is not true for functions on bounded subsets in our setting.
4 Sparse approximation for functions with bounded norm
In this section, we show how to obtain approximation bounds of a function on a bounded open set using a fixedwidth neural network with bounded coefficients, in terms of the norm introduced in the previous section.
Theorem 2.
The proof of Theorem 2 (in App. B) uses the neural network representation (9) and a probabilistic argument. If one samples from a probability distribution proportional to , a Rademacher complexity bound upperbounds the expectation of the supremum norm between and , which yields the result.
Note the resemblance of (11) with the bound (1); the norm of replaces the quantity . We can also use the norm to obtain a bound analogous to (2), that is, with a slightly better dependency in the exponent of at the expense of a constant factor growing with the dimension.
Proposition 2.
Let and open such that . Then, then there exist , and and such that the function
(12) 
fulfills, for a.e. in and some universal constant ,
(13) 
The proof of this result (in App. B) follows readily from the representation (9) and Theorem 1 of Klusowski and Barron [2018].
4.1 Links with the Fourier sparse approximation bounds
The following result shows that setting , the norm can be bounded by the Fourierbased quantities introduced in Subsec. 2.2.
Theorem 3.
Assume that the function admits a Fourier representation of the form with a complexvalued Radon measure. Let be the quantity used in the sparse approximation bound by Breiman [1993] (see Subsec. 2.2). Then, one has that
(14) 
As a direct consequence of Theorem 3, when the righthand side of (11) can be upperbounded by . This allows to refine the bound (1) from Breiman [1993] to a bound in the supremum norm over , and where the approximating neural network fulfills , and .
While we are able to prove the bound (14), the Fourier representation of does not allow for a manageable expression for the measure described in 1. For that, the following theorem starts from a slightly modified Fourier representation of , under which one can describe the measure and provide a formula for the norm.
Theorem 4.
Let admitting the representation
(15) 
for some complexvalued Radon measures such that , and . Choosing , the unique measure specified by 1 takes the following form:
(16) 
where . Note that is realvalued because as . Consequently, the norm of is
(17) 
Remark that is the pushforward of the measure by the mappings . When the Fourier transform admits a density, one may obtain the density of via a change from Euclidean to spherical coordinates: . Hence, Theorem 4 provides an operative way to compute the norm of if one has access to the Fourier transform of .
Theorems 3 and 4 are proven jointly in App. B. Note that from the expression (17) one can easily see that is upperbounded by :
(18) 
where the equality holds since is the pushforward of . Equation (18) makes apparent the norm is sometimes much smaller than the quantities , as is showcased by the following onedimensional example (see the proof in App. B). In these situations, the sparse approximation bounds that we provide in Theorem 2 and 2 are much better than the ones in (1)(2).
Example 1.
Take the function defined as , with . admits the Fourier representation
An interesting class of functions for which is finite but are infinite are functions that can be written as a finitewidth neural network on , as shown in the following proposition.
Proposition 3.
Let defined as for all , with , . Then, for any bounded open set , we have , while if is not an affine function.
3 makes use of the fact that the norm is always upperbounded by the norm, which also means that all the bounds developed in Ongie et al. [2019] apply for the norm. The fact that finitewidth neural networks have infinite was stated by E and Wojtowytsch [2020], that used them to show the gap between the functions with finite and functions representable by infinitewidth neural networks (belonging to the Barron space, in their terminology). It remains to be seen whether the gap is closed when considering functions with finite norm, i.e., whether any function admitting an infinitewidth representation (9) on has a finite norm.
Moving to the nonlinear Radon transform. In many applications the function of interest may be better represented as , where is a fixed finite dimensional, nonlinear and bounded feature map. Our results trivially extend to this case where in the Radon transform hyperplanes are replaced by hyperplanes in the feature space. This can be seen as the “kernel trick" applied to the Radon transform. The corresponding corresponds to the sparsity of the decomposition in the feature space, and we have better approximation when . This gives a simple condition for when transfer learning is successful, and explains the success of using random fourier features as a preprocessing in implicit neural representations of images and surfaces [Tancik et al., 2020]. In order to go beyond the fixed feature maps and tackle deeper ReLU networks, we think that the nonlinear Radon transform [Ehrenpreis, 2003] is an interesting tool to explore. We note that Parhi and Nowak [2021b] introduced recently a representer theorem for deep ReLU networks using Radon transforms as a regularizer.
5 Infinitewidth representations are not unique on bounded sets
Ongie et al. [2019] show that when the norm of is finite, there is a unique measure such that the representation (7) holds for . In this section we show that when we only ask the representation to hold for in a bounded open set, there exist several measures that do the job; in fact, they span an infinitedimensional space.
Let be the open ball of radius in , which means that and is the set of hyperplanes such that and , which we denote by for simplicity. In the following we will construct a space of Radon measures whose neural network representation (7) coincide for all . Note that since any bounded subset of is included in some open ball, our results imply that such representations are nonunique on any bounded set.
Remark 1.
When one considers representations on of the sort (7) with the measure lying in the larger space , the nonuniqueness is apparent because there are two ‘trivial’ kinds of symmetry at play:

[label=()]

Related to parity: when the measure is odd, we have

Related to boundedness: if , restricted to is an affine function of . Hence, if is supported on , is an affine function when restricted to .
Since in Sec. 3 we restrict our scope to measures lying in , these two kinds of symmetries are already quotiented out in our analysis. The third kind of nonuniqueness that we discuss in this section is conceptually deeper, taking place within .
Let be the orthonormal basis of spherical harmonics of the space [Atkinson and Han, 2012]. It is well known that for any , the functions are the restrictions to of homogeneous polynomials of degree , and in fact is the dimension of the space of homogeneous harmonic polynomials of degree . Consider the following subset of even functions in :
(19) 
where denotes the monomial of degree on . We have the following result regarding the nonuniqueness of neural network representations:
Theorem 5.
If is such that , then we have that for any . That is, yields a neural network representation of the zerofunction on . Here, we consider as a subset of by the RieszMarkovKakutani representation theorem via the action for any , and denotes the closure in the topology of weak convergence of .
In particular, any measure whose density is in the span of will yield a function which is equal to zero when restricted to . As an example of this result, we show a simple measure in which represents the zero function on .
Example 2 (Nonzero measure representing the zero function on ).
We define the even Radon measure with density where . Then, for any , .
On the one hand, 1 states that there exists a unique measure such that for any if is finite. On the other hand, Theorem 5 claims that functions admit distinct representations by measures in . The following theorem clarifies these two seemingly contradictory statements. Consider the following subset of even functions in , which contains :
(20) 