Dimension reduction of networked systems
Realworld complex systems, such as plantpollinator interactions^{63} and the spread of COVID19^{64}, are commonly modeled using networks^{65,66}. Consider a network G = (V, E) with nodes V and edges E. Let n = ∣V∣ be the number of nodes in the network, the interactions between nodes can be formulated as a set of differential equations
$$\dotx_i=f(x_i)+\sum_j\in VP_ijg(x_i,\, x_j),\forall i\in V,$$
(1)
where x_{i} is the state of node i in the system. For instance, in an ecological network, x_{i} could represent the abundance of a particular species of plant, while in an epidemic network, it could represent the infection rate of a person. The adjacency matrix P encodes the interaction strength between nodes, where P_{ij} is the entry in row i and column j. The functions f( ⋅ ) and g( ⋅ , ⋅ ) capture the internal and external impacts on node i, respectively. Typically, these functions are nonlinear. Let x = (x_{1}, x_{2}, …, x_{n}). For a small network, given an initial state, one can run a forward simulation for an equilibrium state x^{*}, such that \(\dotx_i^*=f(x_i^*)+\sum _j\in VP_ijg(x_i^*,x_j^*)=0\).
However, when the size of the system goes up to millions or even billions, it will pose a big challenge to solve the coupled differential equations. The problem can be efficiently addressed by employing a meanfield technique^{23,24}, where a linear operator \(\mathcalL_P(\cdot )\) is introduced to decouple the system. Specifically, \(\mathcalL_P\) depends on the adjacency matrix P and is defined as \({{{{\mathcalL}}}}_P(\boldsymbolz)=\frac{\boldsymbol1^TP\boldsymbolz}{{{{{{{{\boldsymbol1}}}}}}}^TP{{{{{{{\boldsymbol1}}}}}}}}\), where \(\boldsymbolz\in \mathcalR^n\). Let δ_{in} = P1 and δ_{out} = 1^{T}P be the in and outdegrees of nodes. For a weighted G, the degrees are weighted as well. Applying \({{{{{{{{\mathcalL}}}}}}}}_P(\cdot )\) to δ_{in}, it gives
$$\beta _{{\rmeff}}={{{{{{{{\mathcalL}}}}}}}}_P(\boldsymbol\delta _\rmin)=\frac{\bf1^TP\boldsymbol\delta _{{{{{\rmi}}}}n}}{{{{{{{{\bf1}}}}}}}^T\boldsymbol\delta _{{\rmin}}}={\frac{{\boldsymbol\delta _{{{{{\rmo}}}}ut}}^T}{{\boldsymbol\delta }}}_{{{{{{{{\rmin}}}}}}}}{{{{{{{{\bf1}}}}}}}}^T{{{{{{{{\boldsymbol\delta }}}}}}}}_{{{{{{{{\rmin}}}}}}}},$$
(2)
which proves to be a powerful metric to measure the resilience of networks, and has been applied to make reliable inferences from incomplete networks^{67,68}. We use it to measure the predictive ability of a neural network, whose training in essence is a dynamical system. For an overview of the related technique, see Supplementary Note 6.
Neural network training is a dynamical system
Conventionally, training a neural network is a nonlinear optimization problem. Because of the hierarchical structure of neural networks, the training procedure is implemented by two alternate procedures: forwardpropagation (FP) and backpropagation (BP), as described in Fig. 1a. During FP, data goes through the input layer, hidden layers, up to the output layer, which produces the predictions of the input data. The differences between the outputs and the labels of the input data are used to define an objective function \(\mathcalC\), a.k.a training error function. BP proceeds to minimize \(\mathcalC\), in a reverse way as did in FP, by propagating the error from the output layer down to the input layer. The trainable weights of synaptic connections are updated accordingly.
Let G_{A} be a neural network, w be the flattened weight vector of G_{A}, and z be the activation values. As a whole, the training of a neural network G_{A} can be described with two coupled dynamics: \(\mathcalA\) on G_{A}, and \(\mathcalB\) on G_{B}, where nodes in G_{A} are neurons, and nodes in G_{B} are the synaptic connections. The coupling relation arises from the strong interdependency between z and w: the states z (activation values or activation gradients) of G_{A} are the parameters of \(\mathcalB\), and the states w of G_{B} are the trainable parameters of G_{A}. If we put the whole training process in the context of networked systems, \(\mathcalA\) denotes a node dynamics because the states of nodes evolve during FP, and \(\mathcalB\) expresses an edge dynamics because of the updates of edge weights during BP^{13,69,70}. Mathematically, we formulate the node and edge dynamics based on the gradients of \(\mathcalC\):
$$({{{{{{{\mathcalA}}}}}}}) \, d\boldsymbolz/dt \, \approx \, h_{{{{{{{{\mathcalA}}}}}}}}(\boldsymbolz,t;\boldsymbolw)=\nabla _{{{{{{{{\boldsymbolz}}}}}}}}{{{{\mathcalC}}}}({{{{{{{\boldsymbolz}}}}}}}(t)),$$
(3)
$$({{{{\mathcalB}}}}) \, d\boldsymbolw/dt \, \approx \, h_{{{{{{{{\mathcalB}}}}}}}}(\boldsymbolw,t;{{{{{{{\boldsymbolz}}}}}}})=\nabla _{{{{{\boldsymbolw}}}}}{{{{{{{\mathcalC}}}}}}}({{{{{{{\boldsymbolw}}}}}}}(t)),$$
(4)
where t denotes the training step. Let \(a_i^(\ell )\) be the preactivation of node i on layer ℓ, and σ_{ℓ}( ⋅ ) be the activation function of layer ℓ. Usually, the output activation function is a softmax. The hierarchical structure of G_{A} exerts some constraints over z for neighboring layers, i.e., \(z_i^(\ell )=\sigma _\ell (a_i^(\ell )),1\le i\le n_\ell ,\forall 1\le \ell < L\) and \(z_k^(L)=\exp \a_k^(L)\/\sum _j\exp \a_j^(L)\,1\le k\le n_L\), where n_{ℓ} is the total number of neurons on layer ℓ, and G_{A} has L + 1 layers. It also presents a dependency between z and w, e.g, when G_{A} is an MLP without bias, \(a_i^(\ell )={{{{{{{{\boldsymbolw}}}}}}}}_i^(\ell )T{{{{{{{{\boldsymbolz}}}}}}}}^(\ell 1)\), which builds a connection from G_{A} to G_{B}. It is obvious, given w, the activation z satisfying all these constraints, is also a fixed point of \({{{{{{{\mathcalA}}}}}}}\). Meanwhile, an equilibrium state of \({{{{{{{\mathcalB}}}}}}}\) provides a set of optimal weights for G_{A}.
The metric β_{eff} is a universal metric to characterize different types of networks, including biological neural networks^{71}. Because of the generality of β_{eff}, we analyze how it looks on artificial neural networks, which are designed to mimic the biological counterparts for general intelligence. Therefore, we set up an analog system for the trainable weights. To the end, we build a line graph for the trainable weights, and reformulate the training dynamics in the same form as the general dynamics (Eq. (1)). The reformulated dynamics reveals a simple yet powerful property regarding β_{eff}, which is utilized to predict the final accuracy of G_{A} with a few observations during the early phase of the training.
Quantify the interaction strengths of edges
In SGD, each time a batch of samples is chosen to update w, i.e., \({{{{{{{\boldsymbolw}}}}}}}\leftarrow {{{{{{{\boldsymbolw}}}}}}}\alpha \nabla _{{{{{{{{\boldsymbolw}}}}}}}}{{{{{{{\mathcalC}}}}}}}\), where α > 0 is the learning rate. When desired conditions are met, training is terminated. Let \({{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell )={[\partial {{{{{{{\mathcalC}}}}}}}/\partial z_1^(\ell ),\cdots,\partial {{{{{{{\mathcalC}}}}}}}/\partial z_n_\ell ^(\ell )]}^T\in \mathcalR^n_\ell \) (in some literature δ^{(ℓ)} is defined as gradients with respect to a^{(ℓ)}, which does not affect our analysis) be the activation gradients, and \(\boldsymbol\sigma _\ell ^\prime =[\sigma _\ell,1^\prime ,\cdots,\sigma _\ell,n_\ell ^\prime ]^T\in {{{{{{{\mathcalR}}}}}}}^n_\ell \) be the derivatives of activation function σ for layer ℓ, with \(\sigma _\ell,k^\prime =\sigma _\ell ^\prime (a_k^(\ell )),1\le k\le n_\ell ,1\le \ell \le L\). To understand how the weights W^{(ℓ)} affect each other, we explicitly expand δ^{(ℓ)} and have \({{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell )=W^(\ell+1)T(W^(\ell+2)T(\cdots (W^(L1)T(W^(L)T({{{{{{{{\boldsymbolz}}}}}}}}^(L)\boldsymboly))\odot \boldsymbol\sigma _L1^\prime )\cdots )\odot \boldsymbol\sigma _\ell+2^\prime )\odot \boldsymbol\sigma _\ell+1^\prime \left.\right)\), where ⊙ is the Hadamard product. We find that W^{(ℓ)} is associated with all accessible parameters on downstream layers, and the recursive relation defines a highorder hypernetwork interaction^{72} between any W^{(ℓ)} and the other parameters. With the fact that x ⊙ y = Λ(y)x, where Λ(y) is a diagonal matrix with the entries of y on the diagonal, we have \({{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell )=W^(\ell+1)T\Lambda ({{\boldsymbol\sigma }}_\ell+1^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell+1)=W^(\ell+1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+1^\prime )W^(\ell+2)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+2^\prime )\cdots W^(L1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_L1^\prime )\)W^{(L)T}(z^{(L)} − y). For a ReLU σ_{ℓ}( ⋅ ), \({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime \) is binary depending on the sign of the input preactivation values a^{(ℓ)} of layer ℓ. If \(a_i^(\ell )\le 0\), then \(\sigma _\ell ^\prime (a_i^(\ell ))=0\), blocking a BP propagation route of the prediction deviations z^{(L)} − y and giving rise to vanishing gradients.
We intended to build direct interactions between synaptic connections. It can be done by identifying which units provide direct physical interactions to a given unit and appear on the righthand side of its differential equation \({{{{{{{\mathcalB}}}}}}}\) in Eq.(3), and how much such interactions come into play. There are multiple routes to build up a direct interaction between any pair of network weights from different layers, as presented by the product terms in δ^{(ℓ)}. However, the coupled interaction makes it an impossible task, which is wellknown as a credit assignment problem^{51,73}. We propose a remedy. The impacts of all the other units on W^{(ℓ)} is approximated by direct, local impacts from W^{(ℓ+1)}, and the others’ contribution as a whole is encoded in the activation gradient δ^{(ℓ+1)}. Moreover, we have the weight gradient (Supplementary Note 1)
$$\boldsymbol\nabla _W^(\ell )=\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell ){{{{{{{{\boldsymbolz}}}}}}}}^(\ell 1)T=\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime )W^(\ell+1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+1^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell+1){{{{{{{{\boldsymbolz}}}}}}}}^(\ell 1)T,$$
(5)
which shows the dependency of W^{(ℓ)} on W^{(ℓ+1)}, and itself can be viewed as an explicit description of the dynamical system \({{{{{{{\mathcalB}}}}}}}\) in Eq.(3). Put it in terms of a differential equation, we have
$$\fracdW^(\ell )dt=\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell ^\prime )W^(\ell+1)T\Lambda ({{{{{{{{\boldsymbol\sigma }}}}}}}}_\ell+1^\prime ){{{{{{{{\boldsymbol\delta }}}}}}}}^(\ell+1){{{{{{{{\boldsymbolz}}}}}}}}^(\ell 1)T \, \triangleq \, F(W^(\ell+1)).$$
(6)
Because of the mutual dependency of the weights and the activation values, it is hard to make an exact decomposition of the impacts of different parameters on W^{(ℓ)}. But, in the gradient \({{{{{\boldsymbol\nabla }}}}}_W^(\ell )\), W^{(ℓ+1)} presents as an explicit term and contributes the direct impact on W^{(ℓ)}. To capture such direct impact and derive the adjacency matrix P for G_{B}, we apply Taylor expansion on \({{{{{{{{\boldsymbol\nabla }}}}}}}}_W^(\ell )\) and have
$$P^(l,l+1)=\partial ^2{{{{{{{\mathcalC}}}}}}}/\partial W^(\ell )\partial W^(\ell+1),$$
(7)
which defines the interaction strength between each pair of weights from layer ℓ + 1 to layer ℓ. For a detailed derivation of P on MLP and general neural networks, see Supplementary Notes 2 and 3. Let w = (w_{1}, w_{2}, …) be a flattened vector of all trainable weights of G_{A}. Given a pair of weights w_{i} and w_{j}, one from layer ℓ_{1}, another from layer ℓ_{2}. If ℓ_{2} = ℓ_{1} + 1, the entry P_{ij} is defined according to Eq.(7), otherwise P_{ij} = 0. Considering the scale of the trainable parameters in G_{A}, P is very sparse. Let W^{(ℓ+1)*} be the equilibrium states (Supplementary Note 3), the training dynamics Eq.(6) is reformulated into the form of Eq.(1), and gives the edge dynamics \({{{{{{{\mathcalB}}}}}}}\) for G_{B}:
$$\dotw_i=f(w_i)+\sum_jP_ijg(w_i,w_j),$$
(8)
with \(f(w_i)=F(w_i^*)\) and \(g(w_i,w_j)=w_jw_j^*\). The value of weights at an equilibrium state \(\w_j^*\\) is unknown, but it is a constant and does not affect the computing of β_{eff}.
Property of the neural capacitance
According to Eq.(7), we have the weighted adjacency matrix P of G_{B} in place. The matrix P encodes rich information of the network, such as the topology, the weights, the gradients, and the training labels indirectly. Now we quantify the total impact that a trainable parameter (or synaptic connection) receives from itself and the others, corresponding to the weighted indegrees δ_{in} = P1. Applying \({{{{{{{{\mathcalL}}}}}}}}_P(\cdot )\) to δ_{in}, we get a “counterpart” metric \(\beta _{{{{{{{{\rmeff}}}}}}}}={{{{{{{{\mathcalL}}}}}}}}_P({{{{{{{{\boldsymbol\delta }}}}}}}}_{{{{{{{{\rmin}}}}}}}})\) to measure the predictive ability of a neural network G_{A}, as the resilience metric (Eq. (2)) does to a general network G. If G_{A} is an MLP, we can explicitly write the entries of P and β_{eff}. For details of how to derive P and β_{eff} of an MLP, see Supplementary Note 2. Moreover, we prove in Theorem 1 below that as G_{A} converges, \({{{{{{{{\boldsymbol\nabla }}}}}}}}_W^(\ell )\) vanishes, and β_{eff} approaches zero (see Supplementary Note 4).
Theorem 1
Let ReLU be the activation function of G_{A}. When G_{A} converges, then β_{eff} = 0.
To be noted that a small value is added to the denominator of Eq.(2) to avoid a possible 0/0.
Algorithm 1
Implement NCP and Computeβ_{eff}(t)
Input: Pretrained source model \(\mathcalF_s=\{\mathcalF_s^(1),\mathcalF_s^(2)\}\) with bottom \({{\mathcalF}}_s^(1)\) and output layer \({{{{{{{{\mathcalF}}}}}}}}_s^(2)\), target dataset D_{t}, maximum epoch T
1: Remove \({{{{{{{{\mathcalF}}}}}}}}_s^(2)\) from \({{{{{{{{\mathcalF}}}}}}}}_s\) and add on top of \({{{{{{{{\mathcalF}}}}}}}}_s^(1)\) an NCP unit \(\mathcalU\) with multiple layers (Fig. 1b)
2: Randomly initialize and freeze \({{{\mathcalU}}}\)
3: Train target model \({{{{{{{{\mathcalF}}}}}}}}_t=\{{{{{{{{{\mathcalF}}}}}}}}_s^(1),{{{{{{{\mathcalU}}}}}}}\}\) by finetuning \({{{{{{{{\mathcalF}}}}}}}}_s^(1)\) on D_{t} for epochs of T
4: Obtain P from \({{{{{{{\mathcalU}}}}}}}\) according to Eq.(7)
5: Compute β_{eff} with P according to Eq.(2)
For an MLP G_{A}, it is possible to derive an analytical form of β_{eff}. However, it becomes extremely complicated for a deep neural network with multiple convolutional layers. To realize β_{eff} for deep neural networks in any form, we take advantage of the automatic differentiation implemented in TensorFlow^{74}. Considering the number of parameters, it is still computationally prohibitive to calculate a β_{eff} for the entire G_{A}.
Therefore, we seek to derive a surrogate from a partial of G_{A}. Specifically, we insert a neural capacitance probe (NCP) unit, i.e., putting additional layers on top of the beheaded G_{A} (excluding the original output layer), and estimate the predictive ability of the entire G_{A} using β_{eff} of the NCP unit. Therefore, we call β_{eff} a neural capacitance.
Bayesian ridge regression
Ridge regression introduces an ℓ_{2}regularization to linear regression, and solves the problem
$$\arg \min _\boldsymbol\theta (\boldsymbolyX\boldsymbol\theta )^T(\boldsymbolyX\boldsymbol\theta )+\lambda \parallel \boldsymbol\theta \parallel _2^2,$$
(9)
where \(X\in {{{{{{{{\mathcalR}}}}}}}}^n\times d\), \({{{{{{{\boldsymboly}}}}}}}\in {{{{{{{{\mathcalR}}}}}}}}^n\), \(\boldsymbol\theta \in {{{{{{{{\mathcalR}}}}}}}}^d\) is the associated set of coefficients, the hyperparameter λ > 0 controls the impact of the penalty term \(\parallel {{{\boldsymbol\theta }}}\parallel _2^2\). Bayesian ridge regression introduces uninformative priors over the hyperparameters, and estimates a probabilistic model of the problem in Eq.(9). Usually, the ordinary least squares method posits the conditional distribution of y to be a Gaussian, i.e., \(p({{{{{{{\boldsymboly}}}}}}} X,{{{{{{{\boldsymbol\theta }}}}}}})=\mathcalN({{{{{{{\boldsymboly}}}}}}} X{{{{{{{\boldsymbol\theta }}}}}}},\sigma ^2I_d)\), where σ > 0 is a hyperparameter to be tuned, and I_{d} is a d × d identity matrix. Moreover, if we assume a spherical Gaussian prior θ, i.e., \(p({{{{{{{\boldsymbol\theta }}}}}}})={{{\mathcalN}}}({{{{{{{\boldsymbol\theta }}}}}}} 0,\tau ^2I_d)\), where τ > 0 is another hyperparameter to be estimated from the data at hand. According to Bayes’ theorem, p(θ∣X, y) ∝ p(θ)p(y∣X, θ), the estimates of the model are made by maximizing the posterior distribution p(θ∣X, y), i.e., \(\arg \max _{{{{{{{{\boldsymbol\theta }}}}}}}}\log p({{{{{{{\boldsymbol\theta }}}}}}} X,{{{{{{{\boldsymboly}}}}}}})=\arg \max _{{{{{{{{\boldsymbol\theta }}}}}}}}\log {{{{{{{\mathcalN}}}}}}}({{{{{{{\boldsymboly}}}}}}} X{{{{{{{\boldsymbol\theta }}}}}}},\sigma ^2I_d)+\log {{{{{{{\mathcalN}}}}}}}({{{{{{{\boldsymbol\theta }}}}}}} {{{{\bf0}}}},\tau ^2I_d)\), which is a maximumaposteriori (MAP) estimation of the ridge regression when λ = σ^{2}/τ^{2}. All θ, λ, and τ are estimated jointly during the model fitting, and \(\sigma=\tau \sqrt\lambda \). Based on the approach proposed by Tipping^{29} and MacKay^{75} to update the parameters λ and τ, we estimate I = h(β_{eff}; θ) with scikitlearn^{76}. We can summarize the application of Bayesian ridge regression to our framework as follows:

Inputs: (β_{eff,k}, I_{k})∣k = 1, 2, …, K is a set of observations, where β_{eff,k} is the proposed metric calculated from the training set, I_{k} represents the validation accuracy, K is the total number of observations collected from early stage of the model training.

Output: I − h(β_{eff}; θ) = 0, where θ is the fitting parameters in the Bayesian ridge regression.

Prediction: I^{*} = h(0, θ) as per Theorem 1.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link