Network properties determine neural network performance

Dimension reduction of networked systems

Real-world complex systems, such as plant-pollinator interactions63 and the spread of COVID-1964, are commonly modeled using networks65,66. Consider a network G = (V, E) with nodes V and edges E. Let n = V be the number of nodes in the network, the interactions between nodes can be formulated as a set of differential equations

$$\dotx_i=f(x_i)+\sum_j\in VP_ijg(x_i,\, x_j),\forall i\in V,$$

(1)

where xi is the state of node i in the system. For instance, in an ecological network, xi could represent the abundance of a particular species of plant, while in an epidemic network, it could represent the infection rate of a person. The adjacency matrix P encodes the interaction strength between nodes, where Pij is the entry in row i and column j. The functions f(  ) and g(  ,  ) capture the internal and external impacts on node i, respectively. Typically, these functions are nonlinear. Let x = (x1, x2, …, xn). For a small network, given an initial state, one can run a forward simulation for an equilibrium state x*, such that \(\dotx_i^*=f(x_i^*)+\sum _j\in VP_ijg(x_i^*,x_j^*)=0\).

However, when the size of the system goes up to millions or even billions, it will pose a big challenge to solve the coupled differential equations. The problem can be efficiently addressed by employing a mean-field technique23,24, where a linear operator \(\mathcalL_P(\cdot )\) is introduced to decouple the system. Specifically, \(\mathcalL_P\) depends on the adjacency matrix P and is defined as \({{{{\mathcalL}}}}_P(\boldsymbolz)=\frac{\boldsymbol1^TP\boldsymbolz}{{{{{{{{\boldsymbol1}}}}}}}^TP{{{{{{{\boldsymbol1}}}}}}}}\), where \(\boldsymbolz\in \mathcalR^n\). Let δin = P1 and δout = 1TP be the in- and out-degrees of nodes. For a weighted G, the degrees are weighted as well. Applying \({{{{{{{{\mathcalL}}}}}}}}_P(\cdot )\) to δin, it gives

$$\beta _{{\rmeff}}={{{{{{{{\mathcalL}}}}}}}}_P(\boldsymbol\delta _\rmin)=\frac{\bf1^TP\boldsymbol\delta _{{{{{\rmi}}}}n}}{{{{{{{{\bf1}}}}}}}^T\boldsymbol\delta _{{\rmin}}}={\frac{{\boldsymbol\delta _{{{{{\rmo}}}}ut}}^T}{{\boldsymbol\delta }}}_{{{{{{{{\rmin}}}}}}}}{{{{{{{{\bf1}}}}}}}}^T{{{{{{{{\boldsymbol\delta }}}}}}}}_{{{{{{{{\rmin}}}}}}}},$$

(2)

which proves to be a powerful metric to measure the resilience of networks, and has been applied to make reliable inferences from incomplete networks67,68. We use it to measure the predictive ability of a neural network, whose training in essence is a dynamical system. For an overview of the related technique, see Supplementary Note 6.

Neural network training is a dynamical system

Conventionally, training a neural network is a nonlinear optimization problem. Because of the hierarchical structure of neural networks, the training procedure is implemented by two alternate procedures: forward-propagation (FP) and back-propagation (BP), as described in Fig. 1a. During FP, data goes through the input layer, hidden layers, up to the output layer, which produces the predictions of the input data. The differences between the outputs and the labels of the input data are used to define an objective function \(\mathcalC\), a.k.a training error function. BP proceeds to minimize \(\mathcalC\), in a reverse way as did in FP, by propagating the error from the output layer down to the input layer. The trainable weights of synaptic connections are updated accordingly.

Let GA be a neural network, w be the flattened weight vector of GA, and z be the activation values. As a whole, the training of a neural network GA can be described with two coupled dynamics: \(\mathcalA\) on GA, and \(\mathcalB\) on GB, where nodes in GA are neurons, and nodes in GB are the synaptic connections. The coupling relation arises from the strong inter-dependency between z and w: the states z (activation values or activation gradients) of GA are the parameters of \(\mathcalB\), and the states w of GB are the trainable parameters of GA. If we put the whole training process in the context of networked systems, \(\mathcalA\) denotes a node dynamics because the states of nodes evolve during FP, and \(\mathcalB\) expresses an edge dynamics because of the updates of edge weights during BP13,69,70. Mathematically, we formulate the node and edge dynamics based on the gradients of \(\mathcalC\):

$$({{{{{{{\mathcalA}}}}}}}) \, d\boldsymbolz/dt \, \approx \, h_{{{{{{{{\mathcalA}}}}}}}}(\boldsymbolz,t;\boldsymbolw)=-\nabla _{{{{{{{{\boldsymbolz}}}}}}}}{{{{\mathcalC}}}}({{{{{{{\boldsymbolz}}}}}}}(t)),$$

(3)

$$({{{{\mathcalB}}}}) \, d\boldsymbolw/dt \, \approx \, h_{{{{{{{{\mathcalB}}}}}}}}(\boldsymbolw,t;{{{{{{{\boldsymbolz}}}}}}})=-\nabla _{{{{{\boldsymbolw}}}}}{{{{{{{\mathcalC}}}}}}}({{{{{{{\boldsymbolw}}}}}}}(t)),$$

(4)

where t denotes the training step. Let \(a_i^(\ell )\) be the pre-activation of node i on layer , and σ(  ) be the activation function of layer . Usually, the output activation function is a softmax. The hierarchical structure of GA exerts some constraints over z for neighboring layers, i.e., \(z_i^(\ell )=\sigma _\ell (a_i^(\ell )),1\le i\le n_\ell ,\forall 1\le \ell < L\) and \(z_k^(L)=\exp \a_k^(L)\/\sum _j\exp \a_j^(L)\,1\le k\le n_L\), where n is the total number of neurons on layer , and GA has L + 1 layers. It also presents a dependency between z and w, e.g, when GA is an MLP without bias, \(a_i^(\ell )={{{{{{{{\boldsymbolw}}}}}}}}_i^(\ell )T{{{{{{{{\boldsymbolz}}}}}}}}^(\ell -1)\), which builds a connection from GA to GB. It is obvious, given w, the activation z satisfying all these constraints, is also a fixed point of \({{{{{{{\mathcalA}}}}}}}\). Meanwhile, an equilibrium state of \({{{{{{{\mathcalB}}}}}}}\) provides a set of optimal weights for GA.

The metric βeff is a universal metric to characterize different types of networks, including biological neural networks71. Because of the generality of βeff, we analyze how it looks on artificial neural networks, which are designed to mimic the biological counterparts for general intelligence. Therefore, we set up an analog system for the trainable weights. To the end, we build a line graph for the trainable weights, and reformulate the training dynamics in the same form as the general dynamics (Eq. (1)). The reformulated dynamics reveals a simple yet powerful property regarding βeff, which is utilized to predict the final accuracy of GA with a few observations during the early phase of the training.

Quantify the interaction strengths of edges

In SGD, each time a batch of samples is chosen to update w, i.e., \({{{{{{{\boldsymbolw}}}}}}}\leftarrow {{{{{{{\boldsymbolw}}}}}}}-\alpha \nabla _{{{{{{{{\boldsymbolw}}}}}}}}{{{{{{{\mathcalC}}}}}}}\), where α > 0 is the learning rate. When desired conditions are met, training is terminated. Let \({{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell )={[\partial {{{{{{{\mathcalC}}}}}}}/\partial z_1^(\ell ),\cdots,\partial {{{{{{{\mathcalC}}}}}}}/\partial z_n_\ell ^(\ell )]}^T\in \mathcalR^n_\ell \) (in some literature δ() is defined as gradients with respect to a(), which does not affect our analysis) be the activation gradients, and \(\boldsymbol\sigma _\ell ^\prime =[\sigma _\ell,1^\prime ,\cdots,\sigma _\ell,n_\ell ^\prime ]^T\in {{{{{{{\mathcalR}}}}}}}^n_\ell \) be the derivatives of activation function σ for layer , with \(\sigma _\ell,k^\prime =\sigma _\ell ^\prime (a_k^(\ell )),1\le k\le n_\ell ,1\le \ell \le L\). To understand how the weights W() affect each other, we explicitly expand δ() and have \({{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell )=W^(\ell+1)T(W^(\ell+2)T(\cdots (W^(L-1)T(W^(L)T({{{{{{{{\boldsymbolz}}}}}}}}^(L)-\boldsymboly))\odot \boldsymbol\sigma _L-1^\prime )\cdots )\odot \boldsymbol\sigma _\ell+2^\prime )\odot \boldsymbol\sigma _\ell+1^\prime \left.\right)\), where  is the Hadamard product. We find that W() is associated with all accessible parameters on downstream layers, and the recursive relation defines a high-order hyper-network interaction72 between any W() and the other parameters. With the fact that xy = Λ(y)x, where Λ(y) is a diagonal matrix with the entries of y on the diagonal, we have \({{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell )=W^(\ell+1)T\Lambda ({{\boldsymbol\sigma }}_\ell+1^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell+1)=W^(\ell+1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+1^\prime )W^(\ell+2)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+2^\prime )\cdots W^(L-1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_L-1^\prime )\)W(L)T(z(L) − y). For a ReLU σ(  ), \({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime \) is binary depending on the sign of the input pre-activation values a() of layer . If \(a_i^(\ell )\le 0\), then \(\sigma _\ell ^\prime (a_i^(\ell ))=0\), blocking a BP propagation route of the prediction deviations z(L) − y and giving rise to vanishing gradients.

We intended to build direct interactions between synaptic connections. It can be done by identifying which units provide direct physical interactions to a given unit and appear on the right-hand side of its differential equation \({{{{{{{\mathcalB}}}}}}}\) in Eq.(3), and how much such interactions come into play. There are multiple routes to build up a direct interaction between any pair of network weights from different layers, as presented by the product terms in δ(). However, the coupled interaction makes it an impossible task, which is well-known as a credit assignment problem51,73. We propose a remedy. The impacts of all the other units on W() is approximated by direct, local impacts from W(+1), and the others’ contribution as a whole is encoded in the activation gradient δ(+1). Moreover, we have the weight gradient (Supplementary Note 1)

$$\boldsymbol\nabla _W^(\ell )=\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell ){{{{{{{{\boldsymbolz}}}}}}}}^(\ell -1)T=\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime )W^(\ell+1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+1^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell+1){{{{{{{{\boldsymbolz}}}}}}}}^(\ell -1)T,$$

(5)

which shows the dependency of W() on W(+1), and itself can be viewed as an explicit description of the dynamical system \({{{{{{{\mathcalB}}}}}}}\) in Eq.(3). Put it in terms of a differential equation, we have

$$\fracdW^(\ell )dt=-\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime )W^(\ell+1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+1^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell+1){{{{{{{{\boldsymbolz}}}}}}}}^(\ell -1)T \, \triangleq \, F(W^(\ell+1)).$$

(6)

Because of the mutual dependency of the weights and the activation values, it is hard to make an exact decomposition of the impacts of different parameters on W(). But, in the gradient \({{{{{\boldsymbol\nabla }}}}}_W^(\ell )\), W(+1) presents as an explicit term and contributes the direct impact on W(). To capture such direct impact and derive the adjacency matrix P for GB, we apply Taylor expansion on \({{{{{{{{\boldsymbol\nabla }}}}}}}}_W^(\ell )\) and have

$$P^(l,l+1)=\partial ^2{{{{{{{\mathcalC}}}}}}}/\partial W^(\ell )\partial W^(\ell+1),$$

(7)

which defines the interaction strength between each pair of weights from layer  + 1 to layer . For a detailed derivation of P on MLP and general neural networks, see Supplementary Notes 2 and 3. Let w = (w1, w2, …) be a flattened vector of all trainable weights of GA. Given a pair of weights wi and wj, one from layer 1, another from layer 2. If 2 = 1 + 1, the entry Pij is defined according to Eq.(7), otherwise Pij = 0. Considering the scale of the trainable parameters in GA, P is very sparse. Let W(+1)* be the equilibrium states (Supplementary Note 3), the training dynamics Eq.(6) is reformulated into the form of Eq.(1), and gives the edge dynamics \({{{{{{{\mathcalB}}}}}}}\) for GB:

$$\dotw_i=f(w_i)+\sum_jP_ijg(w_i,w_j),$$

(8)

with \(f(w_i)=F(w_i^*)\) and \(g(w_i,w_j)=w_j-w_j^*\). The value of weights at an equilibrium state \(\w_j^*\\) is unknown, but it is a constant and does not affect the computing of βeff.

Property of the neural capacitance

According to Eq.(7), we have the weighted adjacency matrix P of GB in place. The matrix P encodes rich information of the network, such as the topology, the weights, the gradients, and the training labels indirectly. Now we quantify the total impact that a trainable parameter (or synaptic connection) receives from itself and the others, corresponding to the weighted in-degrees δin = P1. Applying \({{{{{{{{\mathcalL}}}}}}}}_P(\cdot )\) to δin, we get a “counterpart” metric \(\beta _{{{{{{{{\rmeff}}}}}}}}={{{{{{{{\mathcalL}}}}}}}}_P({{{{{{{{\boldsymbol\delta }}}}}}}}_{{{{{{{{\rmin}}}}}}}})\) to measure the predictive ability of a neural network GA, as the resilience metric (Eq. (2)) does to a general network G. If GA is an MLP, we can explicitly write the entries of P and βeff. For details of how to derive P and βeff of an MLP, see Supplementary Note 2. Moreover, we prove in Theorem 1 below that as GA converges, \({{{{{{{{\boldsymbol\nabla }}}}}}}}_W^(\ell )\) vanishes, and βeff approaches zero (see Supplementary Note 4).

Theorem 1

Let ReLU be the activation function of GA. When GA converges, then βeff = 0.

To be noted that a small value is added to the denominator of Eq.(2) to avoid a possible 0/0.

Algorithm 1

Implement NCP and Computeβeff(t)

Input: Pre-trained source model \(\mathcalF_s=\{\mathcalF_s^(1),\mathcalF_s^(2)\}\) with bottom \({{\mathcalF}}_s^(1)\) and output layer \({{{{{{{{\mathcalF}}}}}}}}_s^(2)\), target dataset Dt, maximum epoch T

1: Remove \({{{{{{{{\mathcalF}}}}}}}}_s^(2)\) from \({{{{{{{{\mathcalF}}}}}}}}_s\) and add on top of \({{{{{{{{\mathcalF}}}}}}}}_s^(1)\) an NCP unit \(\mathcalU\) with multiple layers (Fig. 1b)

2: Randomly initialize and freeze \({{{\mathcalU}}}\)

3: Train target model \({{{{{{{{\mathcalF}}}}}}}}_t=\{{{{{{{{{\mathcalF}}}}}}}}_s^(1),{{{{{{{\mathcalU}}}}}}}\}\) by fine-tuning \({{{{{{{{\mathcalF}}}}}}}}_s^(1)\) on Dt for epochs of T

4: Obtain P from \({{{{{{{\mathcalU}}}}}}}\) according to Eq.(7)

5: Compute βeff with P according to Eq.(2)

For an MLP GA, it is possible to derive an analytical form of βeff. However, it becomes extremely complicated for a deep neural network with multiple convolutional layers. To realize βeff for deep neural networks in any form, we take advantage of the automatic differentiation implemented in TensorFlow74. Considering the number of parameters, it is still computationally prohibitive to calculate a βeff for the entire GA.

Therefore, we seek to derive a surrogate from a partial of GA. Specifically, we insert a neural capacitance probe (NCP) unit, i.e., putting additional layers on top of the beheaded GA (excluding the original output layer), and estimate the predictive ability of the entire GA using βeff of the NCP unit. Therefore, we call βeff a neural capacitance.

Bayesian ridge regression

Ridge regression introduces an 2-regularization to linear regression, and solves the problem

$$\arg \min _\boldsymbol\theta (\boldsymboly-X\boldsymbol\theta )^T(\boldsymboly-X\boldsymbol\theta )+\lambda \parallel \boldsymbol\theta \parallel _2^2,$$

(9)

where \(X\in {{{{{{{{\mathcalR}}}}}}}}^n\times d\), \({{{{{{{\boldsymboly}}}}}}}\in {{{{{{{{\mathcalR}}}}}}}}^n\), \(\boldsymbol\theta \in {{{{{{{{\mathcalR}}}}}}}}^d\) is the associated set of coefficients, the hyper-parameter λ > 0 controls the impact of the penalty term \(\parallel {{{\boldsymbol\theta }}}\parallel _2^2\). Bayesian ridge regression introduces uninformative priors over the hyper-parameters, and estimates a probabilistic model of the problem in Eq.(9). Usually, the ordinary least squares method posits the conditional distribution of y to be a Gaussian, i.e., \(p({{{{{{{\boldsymboly}}}}}}}| X,{{{{{{{\boldsymbol\theta }}}}}}})=\mathcalN({{{{{{{\boldsymboly}}}}}}}| X{{{{{{{\boldsymbol\theta }}}}}}},\sigma ^2I_d)\), where σ > 0 is a hyper-parameter to be tuned, and Id is a d × d identity matrix. Moreover, if we assume a spherical Gaussian prior θ, i.e., \(p({{{{{{{\boldsymbol\theta }}}}}}})={{{\mathcalN}}}({{{{{{{\boldsymbol\theta }}}}}}}| 0,\tau ^2I_d)\), where τ > 0 is another hyper-parameter to be estimated from the data at hand. According to Bayes’ theorem, p(θX, y) p(θ)p(yX, θ), the estimates of the model are made by maximizing the posterior distribution p(θX, y), i.e., \(\arg \max _{{{{{{{{\boldsymbol\theta }}}}}}}}\log p({{{{{{{\boldsymbol\theta }}}}}}}| X,{{{{{{{\boldsymboly}}}}}}})=\arg \max _{{{{{{{{\boldsymbol\theta }}}}}}}}\log {{{{{{{\mathcalN}}}}}}}({{{{{{{\boldsymboly}}}}}}}| X{{{{{{{\boldsymbol\theta }}}}}}},\sigma ^2I_d)+\log {{{{{{{\mathcalN}}}}}}}({{{{{{{\boldsymbol\theta }}}}}}}| {{{{\bf0}}}},\tau ^2I_d)\), which is a maximum-a-posteriori (MAP) estimation of the ridge regression when λ = σ2/τ2. All θ, λ, and τ are estimated jointly during the model fitting, and \(\sigma=\tau \sqrt\lambda \). Based on the approach proposed by Tipping29 and MacKay75 to update the parameters λ and τ, we estimate I = h(βeff; θ) with scikit-learn76. We can summarize the application of Bayesian ridge regression to our framework as follows:

  • Inputs: (βeff,k, Ik)k = 1, 2, …, K is a set of observations, where βeff,k is the proposed metric calculated from the training set, Ik represents the validation accuracy, K is the total number of observations collected from early stage of the model training.

  • Output: I − h(βeff; θ) = 0, where θ is the fitting parameters in the Bayesian ridge regression.

  • Prediction: I* = h(0, θ) as per Theorem 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link