Probabilistic alignment of multiple networks

Table of Contents

Probabilistic formulation of the problem of aligning multiple networks

Consider K network observations, with N nodes each and adjacency matrices {A^k; , k = 1, …, K }. We consider networks that are directed and with binary edges (that is, we just consider the presence or absence of connection between two nodes). Our hypothesis is that the observed networks are topologically similar, which allows us to map each node in a network to another node in each of the other networks. The goal is then to find the most plausible mapping of nodes across networks.

We follow a probabilistic approach and surmise that there is a latent underlying blueprint L, such that each network observation has been generated from that blueprint. Then, the network alignment problem becomes a problem of finding the blueprint L, and the permutations {π ^k, k = 1, …, K } that map nodes in each network to nodes in the blueprint. This formulation naturally allows the simultaneous alignment of several networks onto a single blueprint (Fig. 1).

**Fig. 1: Probabilistic approach to multiple network alignment.**

Formally, we consider a blueprint L with binary edges (L_ij ∈ {0, 1}) from which networks are copied with errors. Each entry A_ij in an observed network is assumed to be independently generated from L_ij with a copying error probability that depends on L_ij. Edges in the blueprint (L_ij = 1) are copied with error probability q, whereas non-edges (L_ij = 0) are copied with error probability p (see Supplementary Material for the model with uniform copy error probability q = p). This approach shares some features with the approach in Ref. ²², where the goal is to align a noisy network copy with the original network (the blueprint in our case) using a similar probabilistic framework. The crucial difference lies in that, in our case, the blueprint is unknown, so that we need to infer it. This difference is precisely what makes our approach symmetric (the alignment of two networks does not require that we choose one as reference) and what allows us to align multiple networks at the same time without the need to choose an arbitrary blueprint from the set of observed networks.

For simplicity, we consider first the case in which we have a single observation. Given L_ij, q, and p, the probability that we observe an edge (A_ij = 1) or a non-edge (A_ij = 0) is then:

$$p\, ({A}_{ij}=1| {L}_{ij},q,p)={p}^{(1-{L}_{ij})}{\left(1-q\right)}^{{L}_{ij}}$$

(1)

$$p\, ({A}_{ij}=0| {L}_{ij},q,p)={q}^{{L}_{ij}}{\left(1-p\right)}^{(1-{L}_{ij})}.$$

(2)

Since the probability of observing each edge is conditionally independent on the others, the total likelihood of observing an adjacency matrix A given a latent blueprint L, q, and p factorizes. Additionally, note that the likelihood depends on the unknown mapping of nodes in the observed network into nodes of the blueprint, which is given by the permutation π of the nodes in the observed network. With this, the likelihood is

$$p({{{\bf{A}}}}| {{{\bf{L}}}},q,p,\pi )={\prod}_{ij}p({A}_{ij}| {{{\bf{L}}}},q,p,\pi )={q}^{{o}_{10}}{p}^{{o}_{01}}{\left(1-q\right)}^{{o}_{11}}{\left(1-p\right)}^{{o}_{00}},$$

(3)

where the edge overlaps between the blueprint and the observations are ${o}_{XY}=\sum {\delta }_{{L}_{\pi (i)\pi (j)},X}{\delta }_{{A}_{ij},Y}$. For example, o₀₁ is the number of entries that are 0 in L and 1 in A, for mapping π.

To obtain the posterior distribution over all model parameters, we use Bayes rule

$$p({{{\bf{L}}}},\pi,q,p| {{{\bf{A}}}})=\frac{p({{{\bf{A}}}}| {{{\bf{L}}}},q,p,\pi )\,p({{{\bf{L}}}},q,p,\pi )}{p({{{\bf{A}}}})},$$

(4)

where p(L, q, p, π) is the prior over model parameters. To obtain our desired posterior p(L, π∣A), we consider a beta prior over p and q, and integrate over these variables, which yields (“Methods”)

$$p({{{\bf{L}}}},\pi | {{{\bf{A}}}})\propto \frac{\Gamma ({o}_{11}+{\beta }_{q})\Gamma \left({o}_{10}+{\alpha }_{q}\right)}{\Gamma \left({n}_{1}+{\alpha }_{q}+{\beta }_{q}\right)}\frac{\Gamma ({o}_{00}+{\beta }_{p})\Gamma \left({o}_{01}+{\alpha }_{p}\right)}{\Gamma \left({n}_{0}+{\alpha }_{p}+{\beta }_{p}\right)},$$

(5)

where ${n}_{1}={\sum }_{i}{\delta }_{{L}_{ij},1}$, ${n}_{0}={\sum }_{i}{\delta }_{{L}_{ij},0}$ are the number of edges and non-edges in the latent blueprint L, respectively, and (α_q, β_q) and (α_p, β_p) are the hyper-parameters of the prior distributions for q and p, respectively (Methods). Note that one expects β_q/p > α_q/p ≥ 1 so that priors favor small copy error probabilities, although the effect of the prior becomes negligible for even modest edge overlaps.

In the case in which our observation comprises K networks, Eq. (5) generalizes to (“Methods”)

$$p({{{\bf{L}}}},\{{\pi }^{k}\}| \{{{{{\bf{A}}}}}^{k}\})\propto \frac{\Gamma ({O}_{11}+{\beta }_{q})\Gamma \left({O}_{10}+{\alpha }_{q}\right)}{\Gamma \left(K{n}_{1}+{\alpha }_{q}+{\beta }_{q}\right)}\frac{\Gamma ({O}_{00}+{\beta }_{p})\Gamma \left({O}_{01}+{\alpha }_{p}\right)}{\Gamma \left(K{n}_{0}+{\alpha }_{p}+{\beta }_{p}\right)},$$

(6)

where ${O}_{XY}={\sum }_{k}{o}_{XY}^{k}$, so that the posterior depends on the overall overlap of edges and non-edges of all the networks with the blueprint. Note that because β > α, the (L, {π^k}) that maximize the posterior are the ones that maximize O₁₁ and O₀₀, and minimize O₁₀ and O₀₁. Also note that for a fixed permutation choice {π^k}, the L^⋆ that maximizes the posterior is such that ${L}_{ij}^{\star }$ is equal to the majority of all ${A}_{{\tilde{\pi }}^{k}(i),{\tilde{\pi }}^{k}(j)}^{k}$, where ${\tilde{\pi }}^{k}(i)$ is the inverse of π^k(i), so that ${\pi }^{k}({\tilde{\pi }}^{k}(i))=i$ (“Methods”).

Note that, in the limit p → 0 or, equivalently, β_p → ∞, our generative model is equivalent to generating networks from the blueprint by just sampling on the edges in the blueprint (that is L_ij = 1). The limits of recovery of node identities and group labels of these nodes in terms of the sampling probability have been extensively studied for correlated pairs of random graphs^31,32,33 and for correlated pairs of graphs with group structure³⁴. In this probabilistic framework, the best alignment corresponds to the blueprint-permutation pair that maximizes the posterior probability p(L, {π^k}∣{A^k}). In the limit p → 0, the posterior reduces to p(L, {π^k}∣{A^k}) ∝ Γ(O₁₁ + β_q)Γ(O₁₀ + α_q), so that the most plausible alignment corresponds to the blueprint-permutation pair that maximizes O₁₁, that is, the total number of edges in the graph that overlap with the blueprint. Formally, for the alignment of a pair of networks with binary adjacency matrices (A, B), if we assume that one of the two networks is equal to the blueprint as in Ref. ²², then maximizing O₁₁ is equivalent to maximizing the product ∑_ij A_ij B_π(i)π(j) as in the Koopmans-Beckmann QAP formulation of the graph matching problem¹¹, and, in general, to minimizing the Frobenius norm between two adjacency matrices with binary entries. When p ≠ 0, our approach considers the contribution of both aligned edges and aligned non-edges. Therefore, the most plausible alignment in our case does not necessarily coincide with that in the Koopmans-Beckmann QAP formulation of the network alignment problem.

This observation also highlights the fact that current structural pairwise approaches cannot be easily modified to allow for the alignment of multiple networks, since, in order to do that, we would have to decide which of the networks is the generating blueprint against which all other networks need to be aligned. Unfortunately, if all copies are subject to the same copy error, there is no reason to expect that one of the observations is closer to the blueprint than another a priori. In fact, even if we had that prior information, pairwise approaches would only consider independent alignments of each network against the selected blueprint, so that the information coming from the overlap between all other pairs of networks other than the blueprint would be lost, and we would have to resort to adding a method to assess consistency of pair alignments^29,30. Our probabilistic approach naturally circumvents these issues.

Sampling the space of plausible alignments

By analogy to statistical mechanics, we can associate an “energy” ${{{\mathcal{H}}}}=-\log \left(p({{{\bf{L}}}},\{{\pi }^{k}\},\{{{{{\bf{A}}}}}^{k}\})\right)$ to each blueprint-permutation pair (L, {π^k}), so that their posterior is written

$$p({{{\bf{L}}}},\{{\pi }^{k}\}| \{{{{{\bf{A}}}}}^{k}\})=\frac{\exp \left(-{{{\mathcal{H}}}}\right)}{p(\{{{{{\bf{A}}}}}^{k}\})},$$

(7)

and the energy ${{{\mathcal{H}}}}$ can be obtained directly from Eq. (6). In this interpretation, the best alignment is the blueprint-permutation pair that minimizes ${{{\mathcal{H}}}}$. Importantly, this equivalence enables us to sample over the space of alignments (L, {π^k}) using Markov chain Monte Carlo³⁵ (“Methods”; see Supplementary Material and Fig. fig:time_size for the algorithmic complexity of our tool). The sampling of equilibrium alignments from this space allows us to approach the network alignment problem in ways other than just finding the single best alignment.

In particular, by sampling alignments from the posterior distribution, we can estimate the probability p(π^k(i_k) = i_L∣{A^k}) that node i_k in network A^k is mapped to node i_L in the blueprint as the fraction of sampled permutations in which π^k(i_k) = i_L. Then, we can estimate the most probable mapping for each node π^k,⋆(i_k) as the one that maximizes p(π(i_k) = i_L∣{A^k}) (Methods). As we show below, in noisy observations, the most plausible alignment does not necessarily recover the ground truth mapping of nodes. This is because for nodes with few connections, copy error can introduce a degree of ambiguity and degeneration in the mappings of node identities. However, we find that the most probable mapping of each individual node recovers the ground truth for their mapping more reliably (Fig. 2). This is, in fact, an expected result from using a probabilistic approach: averages over the ensemble of possible alignments are more accurate at predicting hidden information (in this case, node mappings to the blueprint) than single point estimates (that is, the best alignment) (see for instance Ref. ³⁶ for a discussion in the context of link prediction).

**Fig. 2: Alignment of synthetic connectomes.**

Additionally, MCMC sampling allows us to leverage relevant information about the nodes. For instance, if we have access to node attributes such as group labels, we can constrain permutations to those in which only nodes with the same group labels can be mapped to the same node in the blueprint (Methods); or if we have information about the precise identity of some of the nodes (often called ‘anchors’ or ‘seeds’¹²; see Methods for details), our algorithm can be forced to only sample permutations that have a fixed mapping for these nodes.

Validation of the probabilistic approach on synthetic networks

We start by validating our approach on synthetic, but realistic, settings of increasing difficulty. Because network alignment is particularly relevant in the context of connectome analysis, we consider the connectome of the nematode C. elegans^25,37 as our benchmark, and perform two initial sets of experiments: (i) the alignment of multiple identical networks in which node identities have been shuffled; (ii) the alignment of several noisy copies of the same network.

In the first experiment, we use multiple copies of the connectome network provided in ref. ¹¹ (279 neurons), with node identities randomly shuffled in each copy. Because the networks are identical, the ground state of the energy in Eq. (7) coincides with the ground truth alignment, that is, with the desired mapping of node identities. However, the ground state is degenerate because any permutation of the nodes in the latent blueprint leaves the energy invariant. To break this symmetry, which greatly complicates the search in the space of possible alignments, we find that it is more efficient to first align two networks (chosen at random), and then add the remaining networks one by one to eventually obtain the desired global alignment of all the networks (Supplementary Fig. S2). We find that this approach enables us to perfectly align as many networks as desired, always converging to the ground state.

Except for the fact that we consider an arbitrary number of networks, as opposed to just two, the previous experiment is the standard validation test in the literature on network alignment and graph matching. In the second experiment, we move to a more realistic and challenging scenario. In particular, we use the adult A2 connectome of C. elega ns from ref. ²⁵ as a blueprint (224 neurons and 2,186 connections), and generate four noisy synthetic connectomes by swapping 327 edges with non-edges from the original adjacency matrix.

Using a maximum likelihood estimate, this process would be equivalent to generating connectomes using our copying mechanism with q ≈ 0.15 and p ≈ 0.007.

Since the four observations are not identical, now the global ground state may not correspond to that of the ground truth. Additionally, if due to observational noise there are neurons with few connections, new degeneracies or quasi-degeneracies may appear. Therefore, the energy landscape is likely to become rougher in this realistic scenario. Indeed, we find that, whereas alignment is still possible and most nodes are correctly matched, the MCMC finds alignments with lower energy than the ground truth, indicating that the exact ground truth has become undiscoverable. (Since in this experiment the synthetic networks are generated by the generative model underlying our approach, our approach is Bayes optimal and no other method would, in general, be able to identify the ground truth alignment in this case.)

Given the added difficulty of this scenario, we take advantage of the fact that network datasets often contain useful information about nodes, such as group labels; in the specific case of connectomes, one could typically have information about neuron types (depending on their neuroblast lineage). Therefore, we can restrict our Markov chain to permutations where each neuron is mapped to a neuron of the same type in the blueprint. This drastically reduces the space of possible alignment permutations, and allows us to sample the alignment space more thoroughly and efficiently (Fig. 2). (Alternatively, we can apply the same successive alignment strategy as before, wherein we start by aligning two networks add layers successively; Supplementary Fig. S3.) Our results confirm that there exist multiple alignment permutations (and the corresponding blueprints) that have energies lower than that of the ground truth alignment (Fig. 2a, c).

This observation is key for real-world scenarios, and shows that the ground-state alignment should not be taken as the best estimate for the identity of individual nodes. Rather, as previously discussed, for individual nodes one should rather use the mapping π^k,⋆(i_k) that maximizes the probability p(π^k(i_k) = i_L∣{A^k}) that node i_k in network A^k is mapped to node i_L in the blueprint (Methods). As we show in Fig. 2d, the most probable mapping for each node is more accurate at recovering the ground truth node identities. In fact, we find that some neurons are easy to align, so that the majority of sampled alignments have the same node mappings for those neurons; but there are other neurons, typically with few connections (see Supplementary Fig. S4), for which there are several possible alignments and only through averaging can we recover the underlying node identity.

To benchmark the performance of our approach, we compare it to other methods for network alignment: the Fast QAP^11,38 and the multi-way Kernel Graph Matching (KerGM)²³. Fast QAP is a method to align pairs of networks and offers the possibility to use anchors in the alignment but does not allow to use node attributes such as group labels; KerGM can align simultaneously multiple networks and also incorporate node attributes in the alignment process (see Methods for details). For a systematic comparison, we generate sets of K ∈ {2, 3, 4} noisy copies of the A2 connectome by swapping a fraction σ of the edges in the network with non-edges, with σ ∈ [0, 0.6]. For K = 2 we compare our approach without using group labels to the Fast QAP and KerGM, and using group labels against KerGM (Fig. 2g; Supplementary Table S1). For K = 3, 4, we compare our approach to KerGM, using group labels as node attributes in both cases (Fig. 2h, i; Supplementary Table S1; Methods). Our approach outperforms both QAP and KerGM, since it is able to recover accurate alignments for larger fractions of errors. Importantly, our approach has superior accuracy even when we reduce the number of sampled alignments with run-times comparable to those of KerGM (Supplementary Table S1).

Finally, to further showcase the potential uses of our approach, we note that, in some real-world scenarios, node group labels are only known for some of the nodes. For instance, in large neuronal connectomes^27,28,39,40 neuron type is not always straightforward to establish, and some neurons are left without annotation. In such cases, our probabilistic approach can be used to infer the labels of unannotated nodes. Specifically, for each alignment sampled from the posterior, each node i_k in network k is mapped to a labeled node in the blueprint i_L = π^k(i_k); and a node i_k in network k with unknown group label g(i_k) is automatically assigned the label of the node in the blueprint to which it is mapped g(i_k) = g(i_L). Therefore, by sampling alignments from the posterior, we can estimate the probability p(g(i_k) = g) that an unannotated node has label g as the fraction of sampled alignments in which g(i_k) = g. To assess the performance of our approach in this task, we consider the same noisy connectome networks as in Fig. 2a–d, and we assume that one of the networks is fully annotated, but the others have a fraction of unlabeled nodes. In Fig. 2e, f, we show that our approach is able to recover ground truth annotation for all nodes with unknown labels, both when unlabeled nodes are randomly chosen in each observed connectome, and in the harder case in which unlabeled neurons are the same across connectomes (Supplementary Figs. S5, S6).

The probabilistic approach correctly aligns real networks

Having validated our method on synthetic networks for which the underlying assumptions are fulfilled by construction, we now explore its performance on sets of real networks, for which the assumptions may not be fulfilled. We consider three real network datasets (see Data for details): (i) four neural connectomes of the C. elegans nematode corresponding to four different developmental stages²⁵; (ii) the left and right hemispheres of the brain of the larva of D. melanogaster²⁶; (iii) the yearly e-mail communication networks within an academic institution for four consecutive years⁴¹. In each of these cases, we make use of whatever information is available from the nodes, and incorporate it to to the MCMC sampling to make it more efficient; we compare our results to those of Fast QAP and KerGM under the same conditions. In Table 1, we show a summary of the comparison, and in what follows we discuss each experiment in detail.

Table 1 Comparison of accuracy and run-times for real-world networks

C. elegans connectome at different stages of development

We consider four connectomes of C. elegans at different stages of development²⁵: two late larval samples (L2, L3) and two adult samples (A1, A2). In this dataset, each connectome comprises 224 neurons and each neuron is associated to one of six different neuron types, so we can use this information to constrain the search of alignment permutations. Yet, the alignment of these connectomes is challenging because the density of neuron-to-neuron connections increases throughout development, so that L2 and L3 connectomes are sparser than A1 and A2 connectomes (Supplementary Fig. S4). This implies that the network observations are not fully consistent with our assumption that they have been generated from the blueprint with the same copy error probability. Nonetheless, we find that the most probable mapping of each neuron (the mapping that maximizes the probability p(π^k(i_k) = i_L∣{A^k}), obtained by averaging over alignments sampled in the equilibrium zone) recovers the ground truth identity for 94% of the neurons (Fig. 3a–b), compared to 56% identities for KerGM (Table 1). By contrast, as for the synthetic connectomes, the ground-state alignment is significantly further from the ground truth, as it misaligns 33.5% of the neurons due to the left-right symmetry in the data (an issue that is also apparent in the typical alignments obtained using KerGM; Supplementary Fig. S7). Therefore, in this case, it is critical to average over the ensemble of alignments in which different nodes in different connectomes are misaligned, to recover an alignment closer to the biological ground truth. Importantly, the advantage of sampling is already apparent even if we substantially reduce the number of samples and run times (Supplementary Fig. S7; Table 1).

**Fig. 3: Alignment of four real *C.elegans* connectomes corresponding to different developmental stages.**

To further benchmark the performance of our approach, we also looked at the alignment of two adult connectomes (A1, A2), in which case we can also compare to Fast QAP. In general, aligning pairs of networks happens to be a harder problem (because less information is available about nodes and their connections) but, again, the alignment obtained using our approach is better than those obtained using any of the other methods (Table 1; Supplementary Fig. S8). However, in contrast to the alignment of four connectomes, the most probable mapping does not improve over the ground state (Supplementary Fig. S8). This is because for only a pair of networks the alignment landscape is dominated by one deep minimum.

Finally, we investigate the ability of our approach to infer unknown group labels (that is, neuron types), assuming that A2 is fully annotated and that L2, L3, and A1 have 15% of the nodes without annotation (Fig. 3c–e). This is a harder problem, since unannotated nodes are unconstrained by group label and the number of possible permutations is, thus, larger. Nonetheless, we find that, again by sampling from the posterior distribution of alignments, the most probable mapping of each neuron recovers the ground truth identity in 92.8% of cases, and that our approach is able to fully recover unknown labels. Qualitatively, these results are robust to changes in the selection of the fully annotated connectome and the unannotated nodes, and to other variations in the experiment (Supplementary Figs. S9–S11).

Left and right brain hemispheres for the Drosophila melanogaster larva

Next, we consider the connectome for the full brain of the larva of the Drosophila melanogaster fly²⁶. Neurons in the fly brain can be classified into hemispheres that are assumed to be mirror copies of one another, which is consistent with our hypothesis of an underlying blueprint. Our goal is thus to use our probabilistic approach to align the two hemispheres and show that our method is valid to align connectomes comprising over a thousand neurons. For this task, we constrain the search space of permutations by using two pieces of information (Data): (i) the neuroblast lineage of each neuron; and (ii) a set of 292 anchors, that is, pairs of neurons for which the precise mapping is well established (also called “seeds“ in the literature¹²). In Fig. 4, we show that the most probable mapping of each neuron recovers the ground truth mapping in 79.2% of the cases (or 78.6% if we consider a smaller set of alignments). This is higher than just looking at the ground state (77.1%), the best alignment we obtained using a seeded QAP approach¹² (76.9%), and also the mean accuracy obtained using KerGM (69.5 %) (Table 1 and Supplementary Fig. S12).

**Fig. 4: Alignment of the right and left hemispheres of the brain of the *Drosophila melanogaster* larva.**

Note also that neither anchors nor mismatched neurons are homogeneously distributed. For example, some groups are hard to align because the topological overlap between connections of ground truth pairs and those with other neurons of the same type is often large, such as for Kenyon cells (KC; neurons in the mushroom body of the brain) or gustatory external neurons (Supplementary Figs. S13, S14). For KC neurons misalignment between neurons is to be expected, since these neurons establish many connections following a random pattern with well defined pre-synaptic and post-synaptic partners, and therefore they are topologically hard to distinguish⁴². Nonetheless, this is not necessarily the case for other neuron types such as gustatory neurons, for which there are clear topological partners. For some of these misaligned neuron pairs, we have checked that our tool finds the topologically optimal solution that does not correspond to the biological ground truth, suggesting that either the annotation is incorrect or that there is further biological information the we would need to incorporate into the tool to be able to recover the biological ground truth mapping.

E-mail communication networks

As a final case study, we consider a completely different network representing the emails exchanged between individuals within an academic institution during four consecutive years (2007-2010)⁴¹. For each year, we define a directed network where the nodes are individuals and links represent the existence of stable e-mail communication between them during the years we consider (Data). In particular, we consider two cases of stable yearly communication between individuals that result in two networks of 170 nodes (A_ij = 1 if i sends at least 25 emails to j in a year) and 356 nodes (A_ij = 1 if i sends at least 12 emails to j in a year), respectively. E-mail communication between two individuals can vary significantly from year to year⁴¹, which makes this a harder problem and serves as a benchmark for networks with large copy error.

The data includes the organizational unit to which each individual is affiliated, so we use this information to constrain the possible alignment permutations. In Fig. 5, we show that for both networks most of the alignment permutations we sample have an energy which is lower than that of the ground truth alignment. As in the case of C. elegans, we find that sampling over alignments increases significantly the ability to recover ground truth alignment. Using sampling we recover the ground truth identity for 95.9% and 96.1% of the nodes for the 170 and 356-node network, respectively (95.1% and 94% if we use a reduced sampling strategy), which is substantially higher than looking at the ground state alignment (92.8% and 89.3%) or using KerGM (average 80% and 59%) (Table 1 and Supplementary Fig. S15).

**Fig. 5: Alignment of university email networks corresponding to four consecutive years⁴¹.**

link

Facebook baixar gratis

Probabilistic alignment of multiple networks

Probabilistic formulation of the problem of aligning multiple networks

Sampling the space of plausible alignments

Validation of the probabilistic approach on synthetic networks

The probabilistic approach correctly aligns real networks

C. elegans connectome at different stages of development

Left and right brain hemispheres for the Drosophila melanogaster larva

E-mail communication networks

Scaling Medical Digital Transformation in Global Pharma While Ensuring Local Fit

Amazon, Etsy, e-commerce stocks climb

Elevating Distribution Excellence Through Full-Service Partnership

The 23+ Most Common Types of Fabrics Explained