Deep neural networks have an inbuilt Occam’s razor

Table of Contents

Quantifying inductive bias with Bayesian priors

The prior over functions, P(f), is the probability that a DNN $\mathcalN(\Theta )$ expresses f upon random sampling of parameters over a parameter initialization distribution P_par(Θ):

$$P(f)=\int \mathbb1[\mathcalN(\Theta )==f]P_{\rmpar}(\Theta )d\Theta ,$$

(1)

where $\mathbb1$ is an indicator function (1 if its argument is true, and 0 otherwise). Explicitly, this term is 1 if the neural network ${{\mathcalN}}(\Theta )$ expresses f with parameters Θ, else 0. It was shown in ref. ²⁴ that, for ReLU activation functions, P(f) for the Boolean system was insensitive to different choices of P_par(Θ), and that it exhibits an exponential bias of the form $P(f)\lesssim 2^-a\tildeK(f)+b$ towards simple functions with low descriptional complexity $\tildeK(f)$, which is a proxy for the true (but uncomputable) Kolmogorov complexity. We will, as in ref. ²⁴, calculate $\tildeK(f)$ using C_LZ, a Lempel-Ziv (LZ) based complexity measure from ref. ²⁵ on the 2ⁿ long bitstring that describes the function, taken on an ordered list of inputs. Other complexity measures give similar results^24,26, so there is nothing fundamental about this particular choice. To simplify notation, we will use K(f) instead of $\tildeK(f)$. The exponential drop of P(f) with K(f) in the map from parameters to functions is consistent with an algorithmic information theory (AIT) coding theorem²⁷ inspired simplicity bias bound²⁵ which works for a much wider set of input-output maps. It was argued in ref. ²⁴ that if this inductive bias in the priors matches the simplicity of structured data then it would help explain why DNNs generalize so well. However, the weakness of that work, and related works arguing for such a bias towards simplicity^{24,26,28,29,30,31,32,33,34,35}, is that it is typically not possible to significantly change this inductive bias towards simplicity, making it hard to conclusively show that it is not some other property of the network that instead generates the good performance. Here we exploit a particularity of $\tanh$ activation functions that enable us to significantly vary the inductive bias of DNNs. In particular, for a Gaussian P_par(Θ) with standard deviation σ_w, it was shown^36,37 that, as σ_w increases, there is a transition to a chaotic regime. Moreover, it was recently demonstrated that the simplicity bias in P(f) becomes weaker in the chaotic regime³⁸ (see also Supplementary Note 3). We will exploit this behavior to systematically vary the inductive bias over functions in the prior.

In Fig. 1a, b we depict prior probabilities P(f) for functions f defined on all 128 inputs of a n = 7 Boolean system upon random sampling of parameters of an FCN with 10 layers and hidden width 40 (which is provably fully expressive for this system³¹), and $\tanh$ activation functions. The simplicity bias in P(f) becomes weaker as the width σ_w of the Gaussian P_par(σ_w) increases. By contrast, for ReLU activations, the bias in P(f) barely changes with σ_w (see Fig. S3a). The effect of the decrease in simplicity bias on DNN generalization performance is demonstrated in Fig. 1c for a DNN trained to zero error on a training set S of size m = 64 using advSGD (an SGD variant taken from ref. ²⁴), and tested on the other 64 inputs x_i ∈ T. The generalization error (the fraction of incorrect predictions on T) varies as a function of the complexity of the target function. Although all these DNNs exhibit simplicity bias, weaker forms of the bias correspond to significantly worse generalization on the simpler targets (see also Supplementary Note 10). For very complex targets, both networks perform poorly. For reference, we also show an unbiased learner, where functions f are chosen uniformly at random with the proviso that they exactly fit the training set S. Not surprisingly, given the 2⁶⁴ ≈ 2 × 10¹⁹ functions that can fit S, the performance of this unbiased learner is no better than random chance.

**Fig. 1: Priors over functions and over complexity.**

The scatter plots of Fig. 1d–f depict a more fine-grained picture of the behavior of the SGD-trained networks for three different target functions. For each target, 1000 independent initializations of the SGD optimizer, with initial parameters taken from P_par(σ_w), are used. The generalization error and complexity of each function found when the DNN first reaches zero training error are plotted. Since there are 2⁶⁴ possible functions that give zero error on the training set S, it is not surprising that the DNN converges to many different functions upon different random initializations. For the σ_w = 1 network (where P(f) resembles that of ReLU networks) the most common function is typically simpler than the target. By contrast, the less biased network converges on functions that are typically more complex than the target. As the target itself becomes more complex, the relative difference between the two generalization errors decreases, because the strong inductive bias towards simple functions of the first network becomes less useful. No free lunch theorems for supervised learning tell us that when averaged over all target functions, the three learners above will perform equally badly^39,40 (see also Supplementary Note 43).

Priors over complexity

To understand why relatively modest changes in the inductive bias towards simplicity lead to such significant differences in generalization performance, we need another important ingredient, namely how the number of functions vary with complexity. Basic counting arguments imply that the number of strings of a fixed length that have complexity K scales exponentially as 2^K ²⁷. Therefore, the vast majority of functions picked at random will have high complexity. This exponential growth of the number of functions with complexity can be captured in a more coarse-grained prior, the probability P(K) that the DNN expresses a function of complexity K upon random sampling of parameters over a parameter initialization function P_par(Θ), which can also be written in terms of functions as $P(K^\prime )=\sum _f\in \mathcalH_K^\prime P(f)$, the weighted sum over the set $\mathcalH_K^\prime $ of all functions with complexity $\tildeK(f)=K^\prime $. In Fig. 1g P(K) is shown for uniform random sampling of functions for 10⁸ samples using the LZ measure, and also for the theoretical ideal compressor with $P(K)=2^K-K_max-1$ over all 2¹²⁸ ≈ 3 × 10³⁸ functions (see also Supplementary Note 9). In (h) we display P(K) for functions not sampled at random, but rather from the two networks. There is a dramatic difference between random sampling functions (as in (g)) and between the network with σ_w = 1, where P(K) is nearly flat. This behavior follows from the interesting fact that the AIT coding theorem-like scaling^24,25 of the prior over functions $P(f) \sim 2^-\tildeK(f)$ counters the 2^K growth in the number of functions.

By contrast, even though, relative to the 38 or so orders of magnitude scale on which P(f) varies, the more artefactual σ_w = 8 system has strong simplicity bias (we estimate that for the simplest functions, P(f) is about 10²⁵ times higher than the mean probability $ < P(f) > =2^-128\approx 3\times 10^-39$), this is not enough to counter the 2^K growth in the number of functions with complexity. Therefore, this DNN is exponentially more likely to throw up complex functions, an effect that SGD is unable to overcome.

More generally, the fact that the number of complex functions grows exponentially with complexity K lies at the heart of the classical explanation of why an insufficiently biased agent suffers from variance: It can too easily find many different functions that all fit the data. The marked differences in the generalization performance between the two networks observed in Fig. 1c–f can be therefore traced to differences in the inductive bias of the networks, as measured by the differences in their priors.

Artificially restricting model capacity

To further illustrate the effect of inductive bias we create a K-learner that only allows functions with complexity ≤K_M to be learned and discards all others. As can be seen in Fig. 1i, the learners typically cannot reach zero training error on the training set if K_M is less than the target function complexity K_t. For K_M≥ K_t, zero training error can be reached and not surprisingly, the lowest generalization error occurs when K_M = K_t. As the upper limit K_M is increased, all three learning agents are more likely to make errors in predictions due to variance. The random learner has an error that grows linearly with K_M. This behavior can be understood with a classic Probably Approximately Correct (PAC) bound⁶ where the generalization error (with confidence 0 ≤ (1 − δ) ≤1) scales as $\epsilon _G\le (\ln | {{\mathcalH}}_\le K_M| -\ln \delta )/m$, where $| {{{{\mathcalH}}}}_\le K_M| \,K\le K_M$ is the size of the hypothesis class of all functions with K≤K_M; the bound scales linearly in K_M, as the error does (see Supplementary Note 7 for further discussion including the more sophisticated PAC-Bayes bound^41,42). The generalization error for the σ_w = 1 DNN does not change much with K_M for K_M > K_t because the strong inductive bias towards simple solutions means access to higher complexity solutions doesn’t significantly change what the DNN converges on.

Finally, we show data for DNNs in the ordered regime with σ_w ≪ 1, and for other optimizers, loss functions, and activation functions in Figs. S6–S11. These results broadly exhibit the same behavior we describe here.

Calculating the Bayesian posterior and likelihood

To better understand the generalization behavior observed in Fig. 1 we apply Bayes’ rule, P(f∣S) = P(S∣f)P(f)/P(S) to calculate the Bayesian posterior P(f∣S) from the prior P(f), the likelihood P(S∣f), and the marginal likelihood P(S). Since we condition on zero training error, the likelihood takes on a simple form. P(S∣f) = 1if ∀ x_i ∈ S, f(x_i) = y_i, while P(S∣f) = 0 otherwise. For a fixed training set, all the variation in P(f∣S) for f ∈ U(S), the set of all functions compatible with S, comes from the prior P(f) since P(S) is constant. Therefore, in this Bayesian picture, the bias in the prior is translated over to the posterior.

The marginal likelihood also takes a relatively simple form for discrete functions, since P(S) = ∑_fP(S∣f)P(f) = ∑_f∈U(S)P(f). It is equivalent to the probability that the DNN obtains zero error on the training set S upon random sampling of parameters, and so can be interpreted as a measure of the inductive bias towards the data. The Marginal-likelihood PAC-Bayes bound⁴² makes a direct link $P(S)\lesssim e^-m\epsilon _G$ to the generalization error ϵ_G which captures the intuition that, for a given m, a better inductive bias towards the data (larger P(S)) implies better performance (lower ϵ_G).

One can also define the posterior probability P_SGD(f∣S), that a network trained with SGD (or another optimizer) on training set S, when initialized with P_par(Θ), converges on function f. For simplicity, we take this probability at the epoch where the system first reaches zero training error. Note that in Fig. 1d–f it is this SGD-based posterior that we plot in the histograms at the top and sides of the plots, with functions grouped either by complexity, which we will call P_SGD(K∣S), or by generalization error ϵ_G, which we will call P_SGD(ϵ_G∣S).

DNNs are typically trained by some form of SGD, and not by randomly sampling over parameters which is much less efficient. However, a recent study⁴³ which carefully compared the two posteriors has shown that to first order, P_B(f∣S) ≈ P_SGD(f∣S), for many different data sets and DNN architectures. We demonstrate this close similarity in Fig. S15 explicitly for our n = 7 Boolean system. This evidence suggests that Bayesian posteriors calculated by random sampling of parameters, which are much simpler to analyze, can be used to understand the dominant behavior of an SGD-trained DNN, even if, for example, hyperparameter tuning can lead to 2nd-order deviations between the two methods (see also Supplementary Note 1).

To test the predictive power of our Bayesian picture, we first define the function error ϵ(f) as the fraction of incorrect labels f produces on the full set of inputs. Next, we average Bayes’ rule over all training sets S of size m:

$$ S)\rangle _m=P(f)\left\langle \frac f)P(S)\right\rangle _m\approx \fracP(f)\left(1-\epsilon (f)\right)^m\langle P(S)\rangle _m$$

(2)

where the mean likelihood 〈P(S∣f)〉_m = (1−ϵ(f))^m is the probability of a function f obtaining zero error on a training set of size m. In the second step, we approximate the average of the ratio with the ratio of the averages which should be accurate if P(S) is highly concentrated, as is expected if the training set is not too small.

Equation (2) is hard to calculate, so we coarse-grain it by grouping together functions by their complexity:

$$\langle P(K_m=\sum_C_LZ(f)=K S)\rangle _m\propto P(K)\langle \left(1-\epsilon _G(K)\right)^m\rangle _l,$$

(3)

and in the second step make a decoupling approximation where we average the likelihood term over a small numberlof functions with complexity K with lowest generalization error ϵ_G(K) since the smallest errors in the sum dominate exponentially since $(1-\epsilon _G)\approx e^-\epsilon _G$ for ∣ϵ_G∣ ≪ 1. We then multiply by P(K), which takes into account the value of the prior and the multiplicity of functions at that K, and normalize ∑_KP(K∣S) = 1. For a given target, we make the ansatz that this decoupling approximation provides an estimate that scales as the true (averaged) posterior.

To test our approximations, we first plot, in Fig. 2a–c, the likelihood term in Equation (3) for three different target functions. To obtain these curves, we considered a large number of functions (including all functions with up to 5 errors w.r.t. the target, with further functions sampled). For each complexity, we average this term over the l = 5 functions with smallest ϵ_G. Not surprisingly, functions close to the complexity of the target have the smallest error. These graphs help illustrate how the DNN interacts with data. As the training set size m increases, functions that are most likely to be found upon training to zero training error are increasingly concentrated close to the complexity of the target function.

**Fig. 2: How training data affects the posteriors.**

To test the decoupling approximation from Eq. (3), we compare in Fig. 2d–f the posterior 〈P(K∣S)〉_m, calculated by multiplying the Bayesian likelihood curve from Fig. 2a–c with the two Bayesian priors P(K) from Fig. 1h, i, to the posteriors $ S)\rangle _m$ calculated by advSGD²⁴ over a 1000 different parameter initializations and training sets. It is remarkable to see how well the simple decoupling approximation performs across target functions and training set sizes. In Figs. S13 and S14 we demonstrate the robustness of our approach by showing that using l = 1 or l = 50 functions does not change the predictions much. This success suggests that our simple approach captures the essence of the interaction between the data (measured by the likelihood, which is independent of the learning algorithm), and the DNN architecture (which is measured by the prior and is independent of the data).

We have therefore separated out two of the three parts of the tripartite scheme, which leaves the training algorithm. In the figures above our Bayesian approximation captures the dominant behavior of an SGD-trained network. This correspondence is consistent with the results and arguments of ref. ⁴³. We checked this further in Fig. S15 for a similar set-up using MSE loss, where Bayesian posteriors can be exactly calculated using Gaussian processes (GPs). The direct Bayesian GP calculation closely matches SGD-based results for our much smaller network. Note that, in the spirit of model calculations, as called for in ref. ^3,4, we mainly used a much smaller DNN. But their agreement with the GP-based posteriors, calculated for the infinite width limit, shows that at the scale of our Bayesian approach to the 1st-order generalization question we are addressing here, the size of the DNN is not an important factor. The width of a DNN can, of course, be a factor for 2nd order generalization questions.

Beyond the Boolean model: MNIST & CIFAR-10

Can the principles worked out for the Boolean system be observed in larger systems that are closer to the standard practice of DNNs? To this end, we show, in Fig. 3a, b how the generalization error for the popular image datasets MNIST and CIFAR-10 changes as a function of the initial parameter width σ_w and the number of layers N_l for a standard FCN, trained with SGD on cross-entropy loss with $\tanh$ activation functions. Larger σ_w and larger N_l push the system deeper into the chaotic regime^36,37 and result in decreasing generalization performance, similar to what we observe for the Boolean system for relatively simple targets. In Fig. 3, we plot the prior over complexity P(K) for a complexity measure called the critical sample ratio (CSR)²⁸, an estimate of the density of decision boundaries that should be appropriate for this problem. Again, increasing σ_w greatly increases the prior probability that the DNN produces more complex functions upon random sampling of parameters. Thus the decrease in generalization performance is consistent with the inductive bias of the network becoming less simplicity biased, and therefore less well aligned with structure in the data. Indeed, datasets such as MNIST and CIFAR-10 are thought to be relatively simple^44,45.

These patterns are further illustrated in Fig. 3d–f where we show scatterplots of generalization error v.s. CSR complexity for three target functions that vary in complexity (here obtained by corrupting labels). The qualitative behavior is similar to that observed for the Boolean system in Fig. 1. The more simplicity-biased networks perform significantly better on the simpler targets, but the difference with the less simplicity-biased network decreases for more complex targets. While we are unable to directly calculate the likelihoods because these systems are too big, we argue that the strong similarities to our simpler model system suggest that the same basic principles of inductive bias are at work here.

link

Facebook baixar gratis

Deep neural networks have an inbuilt Occam’s razor

Quantifying inductive bias with Bayesian priors

Priors over complexity

Artificially restricting model capacity

Calculating the Bayesian posterior and likelihood

Beyond the Boolean model: MNIST & CIFAR-10

Top 10 Business Analytics careers (and their salaries) in the UK 2025

Greg Kutzin on the Role of Python in Modern Business Analysis

How to Stand Out as a Business Analyst

Scaling Medical Digital Transformation in Global Pharma While Ensuring Local Fit