Blog

Universal multilayer network exploration by random walk with restart

11:31 01 julio in Artículos por Website
0


Random walk with restart (RWR)

Let us consider an irreducible and aperiodic Markov chain, for instance a network composed of a giant component with undirected edges, G = (V, E), where V is the set of vertices and E (V × V) is the set of edges. In the case of irreducible and aperiodic Markov chains, a stationary probability p* exists and satisfies the following properties:

$$\left\{\begin{array}{ll}{{{{{{{{\bf{p}}}}}}}}}^{* }(i)\ > \ 0;\forall i\in V \hfill \\ {\sum }_{i\in V}{{{{{{{{\bf{p}}}}}}}}}^{* }(i)=1\hfill\end{array}\right.$$

(1)

We next introduce the probability defining the walk from one node to another. Let us define x, a particle that explores the network, xt its position at time t and xt+1 its position at time t + 1. Considering two nodes i and j:

$${\mathbb{P}}({x}_{t+1}=j\,| \,{x}_{t}=i)=\left\{\begin{array}{l}\frac{1}{{d}_{i}}\,\,\,\,\,{{{{{{{\rm{if}}}}}}}}\ (i,j)\in E\\ 0\,\,\,\,\,\,{{{{{{{\rm{Otherwise}}}}}}}}\end{array}\right.$$

(2)

with di being the degree of the node i. All the normalized possible transitions can be included in the Transition rate matrix. This Transition rate matrix, noted M, can be seen as the matrix of the degrees of freedom of the particle in the system. It is useful to note that the Transition rate matrix is equal to the column-normalized Adjacency matrix. The distribution denoted by \({{{{{{{{\bf{p}}}}}}}}}_{t}={({{{{{{{{\bf{p}}}}}}}}}_{t}(i))}_{i\in V}\) describes the probability of being in the node i at time t, and the stationary distribution p* is obtained thanks to the homogeneous linear difference equation [3]18,23:

$${{{{{{{{\bf{p}}}}}}}}}_{t+1}^{T}=M{{{{{{{{\bf{p}}}}}}}}}_{t}^{T}$$

(3)

with \({{{{{{{{\bf{p}}}}}}}}}_{t}^{T}\) denoting the transpose of the vector pt. Moreover, we can introduce a non-homogeneous linear difference equation [4]23 to take into account the restart on the seed(s). When the Transition rate matrix is a Stochastic matrix, the stationary distribution is reached18 (Supplementary Note 1.A.1 for elements of proof of convergence) and this distribution can be seen as a measure of proximity of all the network nodes with respect to the seed(s).

$${{{{{{{{\bf{p}}}}}}}}}_{t+1}^{T}=(1-r)M{{{{{{{{\bf{p}}}}}}}}}_{t}^{T}+r{{{{{{{{\bf{p}}}}}}}}}_{0}^{T}$$

(4)

The distribution p0 corresponds to the initial probability distribution, where only the seed(s) have non-zero values; r represents the restart probability.

RWR on multiplex networks

The RWR method has been extended to multiplex networks, i.e., multilayer networks with a one-to-one mapping between the (replica) nodes of the different layers (Fig. 1)1,13,14. Multiplex networks can be represented by Supra-adjacency matrices, which correspond to a generalization of the standard Adjacency matrix. In the following, we will use several multiplex networks, indexed by k. We denoted by \({{{{{{{{\mathcal{A}}}}}}}}}_{k}\) the Supra-adjacency matrix of the multiplex network indexed by k. The Adjacency matrix of the layer l of the multiplex network k is denoted by \({A}_{k}^{[l]}\). The element of this adjacency matrix from node i to node j is defined as \({({A}_{k}^{[l]})}_{i,j}\ge 0\). The dimension of the Supra-adjacency matrix \({{{{{{{{\mathcal{A}}}}}}}}}_{k}\) of the multiplex network k is equal to (Lk*nk)*(Lk*nk), with nk the number of nodes in each layer of the multiplex network k and Lk the number of layers in the multiplex network k. The Supra-adjacency matrix \({{{{{{{{\mathcal{A}}}}}}}}}_{k}\) is defined as follows:

$${({{{{{{{{\mathcal{A}}}}}}}}}_{k})}_{{i}_{l},{j}_{m}}=\left\{\begin{array}{ll}{\left({A}_{k}^{[l]}\right)}_{i,j}&{{{{{{{\rm{if}}}}}}}}\,l=m\\ {\delta }_{i,j}\hfill&{{{{{{{\rm{if}}}}}}}}\,l\ \ne\ m\hfill\end{array}\right.$$

(5)

where δ defines the Kronecker delta (i.e., 1 if i equal j and 0 otherwise), and l and m represent the layers of the multiplex network k. We can also define a multiplex network as a set of nodes, \({V}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}}\) and a set of edges, \({E}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}}\):

$$\left\{\begin{array}{l}{G}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}}=({V}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}},{E}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}})\hfill \\ {V}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}}=\{{v}_{i}^{l},i=1,\ldots ,{n}_{k},l=1,\ldots ,{L}_{k}\}\hfill\\ {E}_{{{{{{{{{\mathcal{A}}}}}}}}}_{k}}=\{{e}_{i,j}^{ll},i,j=1,\ldots ,{n}_{k},l=1,\ldots ,{L}_{k},{({A}_{k}^{[l]})}_{i,j}\ \ne\ 0\}\\ \cup \{{e}_{i,i}^{lm},i=1,\ldots ,{n}_{k},l\ \ne\ m\}\hfill\end{array}\right.$$

(6)

Importantly, we need to column-normalize the Supra-adjacency matrix defined in the equations [56] in order to converge to the steady-state, as defined in15. This normalization requires including the parameters δk related to the jumps from one layer to another inside the matrix representation, as described in13 (Fig. 2). In the next section, we need to index by k all the parameters that are dedicated to the multiplex network k. The Supra-adjacency matrix representing the multiplex network k can be written as described in equation [7]. The matrix Ik represents the Identity matrix of size nk.

$${{{{{{{{\mathcal{A}}}}}}}}}_{k}=\left[\begin{array}{cccc}(1-{\delta }_{k}){A}_{k}^{[1]}&\frac{{\delta }_{k}}{({L}_{k}-1)}{I}_{k}&\ldots &\frac{{\delta }_{k}}{({L}_{k}-1)}{I}_{k}\\ \frac{{\delta }_{k}}{({L}_{k}-1)}{I}_{k}&(1-{\delta }_{k}){A}_{k}^{[2]}&\ldots &\frac{{\delta }_{k}}{({L}_{k}-1)}{I}_{k}\hfill\\ \vdots &\vdots &\ddots &\vdots \\ \frac{{\delta }_{k}}{({L}_{k}-1)}{I}_{k}&\frac{{\delta }_{k}}{({L}_{k}-1)}{I}_{k}&\ldots &(1-{\delta }_{k}){A}_{k}^{[{L}_{k}]}\end{array}\right]$$

(7)

Fig. 2: MultiXrank Random Walk with Restart parameters.
figure 2

Parameters of the Random Walk with Restart allowing to explore universal multilayer networks composed of N multiplex networks (each composed of several layers containing the same set of (replica) nodes but different edges). The parameters δ are associated with the probability to jump from one layer to another in a given multiplex network, λ with the probability to jump from one multiplex network to another multiplex network, τ with the probability to restart in a given layer of a given multiplex network, and η with the probability to restart in a given multiplex network.

RWR on universal multilayer networks

We here define a RWR method that can be applied to universal multilayer networks. Universal multilayer networks are composed of any combination of multiplex networks, linked by any combination of bipartite networks (Fig. 1). All network edges can also be weighted and/or directed. The formalism for the application of RWR on multiplex networks is described in the previous section. We will now detail the Bipartite network matrices, and how to combine intra- and inter- multiplex networks information to obtain the Supra-heterogeneous adjacency matrix. The Supra-heterogeneous adjacency matrix will embed all the possible transitions in a universal multilayer network.

Bipartite networks connect heterogeneous nodes

The Bipartite network matrices contain the transitions between different types of nodes present in different networks. If the network α has nα nodes, and the network β has nβ nodes, the Bipartite network matrix denoted bα,β has a size equal to nα*nβ. Now, let us define \({{{{{{{{\mathcal{A}}}}}}}}}_{\alpha }\) and \({{{{{{{{\mathcal{A}}}}}}}}}_{\beta }\), two Supra-adjacency matrices representing the multiplex networks α and β. The Bipartite network matrix Bα,β represents the transitions from the nodes of the multiplex network α to the nodes of the multiplex network β. The size of the Bipartite network matrix Bα,β is equal to (Lα*nα)*(Lβ*nβ). The Bipartite network matrices are composed of (Lα*Lβ) times the Bipartite network matrix bα,β (equation [8]). The matrix bα,β is composed of all the transitions from one layer of the multiplex network α to one layer of the multiplex network β. We extended the formalism used in15 in order to consider more than two different multiplex networks.

$${B}_{\alpha ,\beta } = \underbrace{\left[\begin{array}{cccc}{b}_{\alpha ,\beta }&{b}_{\alpha ,\beta }&\ldots &{b}_{\alpha ,\beta }\\ {b}_{\alpha ,\beta }&{b}_{\alpha ,\beta }&\ldots &{b}_{\alpha ,\beta }\\ \vdots &\vdots &\ddots & \vdots \\ {b}_{\alpha ,\beta }&{b}_{\alpha ,\beta }&\ldots &{b}_{\alpha ,\beta } \end{array}\right]}_{{L}_{\beta } \, {{{{{{{\rm{times}}}}}}}}}\left.\vphantom{\begin{array}{c}1\\ 1\\ 1\\ 1\\ 1\\ 1\end{array}}\right\}\scriptstyle{{L}_{\alpha }} \, {{{{{{{\rm{times}}}}}}}}$$

(8)

The representation of the bipartite networks as a set of nodes \({V}_{{{{{{{{\mathcal{B}}}}}}}}}\) and a set of edges \({E}_{{{{{{{{\mathcal{B}}}}}}}}}\) can be written as:

$$\left\{\begin{array}{l}{G}_{{{{{{{{\mathcal{B}}}}}}}}}=({V}_{{{{{{{{\mathcal{B}}}}}}}}},{E}_{{{{{{{{\mathcal{B}}}}}}}}})\hfill\\ {V}_{{{{{{{{\mathcal{B}}}}}}}}}=\{{v}_{k}^{\alpha },k=1,\ldots ,{n}_{\alpha }\}\cup \{{v}_{l}^{\beta },l=1,\ldots ,{n}_{\beta }\}\hfill\\ {E}_{{{{{{{{\mathcal{B}}}}}}}}}=\{{e}_{k,l}^{\alpha \beta }\,k=1,\ldots ,{n}_{\alpha }\,,\,l=1,\ldots ,{n}_{\beta }\,;\,{({b}_{\alpha ,\beta })}_{k,l}\ \ne\ 0\}\end{array}\right.$$

(9)

It is to note that if the bipartite networks are undirected, \({b}_{\beta ,\alpha }^{T}={b}_{\alpha ,\beta }\) and \({B}_{\beta ,\alpha }^{T}={B}_{\alpha ,\beta }\).

Universal multilayer networks unify the representation of heterogeneous multiplex networks

We previously defined the Supra-adjacency matrices of each multiplex network and the Bipartite network matrices connecting the different multiplex networks. We now introduce the Supra-heterogeneous adjacency matrix, denoted by \({{{{{{{\mathcal{S}}}}}}}}\). This matrix, defined in equation [10], collects the N Supra-adjacency matrices representing each multiplex network, \({{{{{{{{\mathcal{A}}}}}}}}}_{1},{{{{{{{{\mathcal{A}}}}}}}}}_{2},\ldots ,{{{{{{{{\mathcal{A}}}}}}}}}_{N}\), and the N*(N − 1) Bipartite network matrices connecting each multiplex network, B1,2, B1,3, …, B1,N, B2,1, …, BN,N−1.

$${{{{{{{\mathcal{S}}}}}}}}=\left[\begin{array}{cccc}{{{{{{{{\mathcal{A}}}}}}}}}_{1}&{B}_{1,2}&\ldots &{B}_{1,N}\\ {B}_{2,1}&{{{{{{{{\mathcal{A}}}}}}}}}_{2}&\ldots &{B}_{2,N}\\ \vdots &\vdots &\ddots &\vdots \\ {B}_{N,1}&{B}_{N,2}&\ldots &{{{{{{{{\mathcal{A}}}}}}}}}_{N}\end{array}\right]$$

(10)

We can also define the Supra-heterogeneous adjacency matrix as a set of nodes and edges:

$$\left\{\begin{array}{rcl}&{G}_{{{{{{{{\mathcal{S}}}}}}}}}=\left({V}_{{{{{{{{\mathcal{S}}}}}}}}},{E}_{{{{{{{{\mathcal{S}}}}}}}}}\right)\hfill\\ &{V}_{{{{{{{{\mathcal{S}}}}}}}}}=\mathop{\bigcup }\limits_{k=1}^{N}\{{v}_{k,i}^{{\alpha }_{k}},i=1,\ldots ,{n}_{k},{\alpha }_{k}=1,\ldots ,{L}_{k}\}\hfill\\ &{E}_{{{{{{{{\mathcal{S}}}}}}}}}=\mathop{\bigcup }\limits_{k=1}^{N}\left(\{{e}_{i,j}^{{\alpha }_{k},{\alpha }_{k}},i,j=1,\ldots ,{n}_{k},{\left({A}_{k}^{[{\alpha }_{k}]}\right)}_{i,j}\ \ne\ 0\}\right.\hfill\\ &\quad\quad\ \cup \left.\{{e}_{i,i}^{{\alpha }_{k},{\beta }_{k}},i=1,\ldots ,{n}_{k},{\alpha }_{k}\ \ne\ {\beta }_{k}\,,\,{\alpha }_{k},{\beta }_{k}=1,\ldots ,{L}_{k}\}\right)\hfill\\ &\quad\quad\ \ \cup \mathop{\bigcup }\limits_{k,l=1;k\ne l}^{N}\{{e}_{i,j}^{{\alpha }_{k},{\alpha }_{l}},i=1,\ldots ,{n}_{k},j=1,\ldots ,{n}_{l},{\left({B}_{k,l}\right)}_{i,j}\ \ne\ 0\}\end{array}\right.$$

(11)

The normalization of the Supra-heterogeneous adjacency matrix ensures the convergence of the RWR to the steady-state

The most complex issue is the normalization of the Supra-heterogeneous adjacency matrix into a Transition rate matrix that can be used in equation [4]. The normalization allows obtaining a Stochastic matrix that guarantees the convergence of the RWR to the steady-state18 (see elements of proof in Supplementary Note 1.A.1). It is important to note that we have chosen a column normalization. The resulting normalized matrix, denoted by \(\widehat{{{{{{{{\mathcal{S}}}}}}}}}\) is defined in equation [12]. We generalized the formalism of Li and Patra12 established for two heterogeneous monoplex networks (Supplementary Note 1.D). This generalization to universal multilayer networks is done thanks to the intra- and inter- multiplex network normalizations defined in equations [1314], with α [[1, N]], β [[1, N]]. In addition, \({c}_{{i}_{\alpha }}\) is the number of bipartite networks in which the node iα appears as source of the multiplex network α denoted by Mα.

$$\widehat{{{{{{{{\mathcal{S}}}}}}}}}=\left[\begin{array}{cccc}{\widehat{S}}_{11}&{\widehat{S}}_{12}&\ldots &{\widehat{S}}_{1N}\\ {\widehat{S}}_{21}&{\widehat{S}}_{22}&\ldots &{\widehat{S}}_{2N}\\ \vdots &\vdots &\ddots &\vdots \\ {\widehat{S}}_{N2}&{\widehat{S}}_{N2}&\ldots &{\widehat{S}}_{NN}\end{array}\right]$$

(12)

In equation [13], \({\widehat{S}}_{\alpha \alpha }\) defines the transition probabilities inside a given multiplex network. In the case of a multiplex network, if a node has no bipartite interactions with nodes from another multiplex networks, we can use the standard normalization. If bipartite interactions exist, then the normalization takes into account the probability that the walker can stay in the multiplex network \((1-\mathop{\sum }\nolimits_{\beta = 1}^{{c}_{{i}_{\alpha }}}{\lambda }_{\alpha \beta })\). In equation [14], \({\widehat{S}}_{\alpha \beta }\) defines the transition probability between two different multiplex networks. There are here three possibilities. If the node has no bipartite interactions, the transition probability is equal to zero. If the node has bipartite interactions, the transition probability is equal to the standard normalization weighted by the jump probability (λαβ). Finally, if the node exists only in the bipartite network, the normalization corresponds to the standard normalization weighted by a modified jump probability. This normalization takes into account all the bipartite interactions of the considered node.

$${\widehat{S}}_{\alpha \alpha }({i}_{\alpha },{j}_{\alpha })=\left\{\begin{array}{l}\frac{{A}_{\alpha }({i}_{\alpha },{j}_{\alpha })}{\mathop{\sum }\limits_{{k}_{\alpha } = 1}^{{n}_{\alpha }}{A}_{\alpha }({i}_{\alpha },{k}_{\alpha })} \; \; \; {{{{{{{\rm{if}}}}}}}}\ \forall \beta \,:\,\mathop{\sum }\limits_{{k}_{\beta }=1}^{{n}_{\beta }}{B}_{\alpha ,\beta }({i}_{\alpha },{k}_{\beta })=0\\ \frac{\left(1-\mathop{\sum }\limits_{\beta = 1}^{{c}_{{i}_{\alpha }}}{\lambda }_{\alpha \beta }\right)* {A}_{\alpha }({i}_{\alpha },{j}_{\alpha })}{\mathop{\sum }\limits_{{k}_{\alpha } = 1}^{{n}_{\alpha }}{A}_{\alpha }({i}_{\alpha },{k}_{\alpha })} \; \; \; {{{{{{{\rm{Otherwise}}}}}}}}\hfill\end{array}\right.$$

(13)

$${\widehat{S}}_{\alpha \beta }({i}_{\alpha },{j}_{\beta })=\left\{\begin{array}{l}\frac{{\lambda }_{\alpha \beta }{B}_{\alpha ,\beta }({i}_{\alpha },{j}_{\beta })}{\mathop{\sum }\limits_{{k}_{\beta } = 1}^{{n}_{\beta }}{B}_{\alpha ,\beta }({i}_{\alpha },{k}_{\beta })} \; \; \; {{{{{{{\rm{if}}}}}}}}\mathop{\sum }\limits_{{k}_{\beta }=1}^{{n}_{\beta }}{B}_{\alpha ,\beta }({i}_{\alpha },{k}_{\beta })\ \ne\ 0\\ \frac{\frac{{\lambda }_{\alpha \beta }}{\mathop{\sum }\limits_{\beta = 1}^{c}{\lambda }_{\alpha \beta }}\mathop{\sum }\limits_{{i}_{\alpha }=1}^{c}{B}_{\alpha ,\beta }({i}_{\alpha },{j}_{\beta })}{\mathop{\sum }\limits_{{i}_{\alpha }=1}^{c}\mathop{\sum }\limits_{{k}_{\beta }=1}^{{n}_{\beta }}{B}_{\alpha ,\beta }({i}_{\alpha },{k}_{\beta })} \; \; \;{{{{{{{\rm{if}}}}}}}}\,{i}_{\alpha }\,{{{{{{{\rm{not}}}}}}}}\,{{{{{{{\rm{in}}}}}}}}\,{M}_{\alpha }\\ 0 \; \; \; {{{{{{{\rm{Otherwise}}}}}}}}\end{array}\right.$$

(14)

The normalization allows including the parameters λαβ to jump between the multiplex networks (Fig. 2). In other words, these parameters weight the jumps from one multiplex network α to another multiplex network β, if the bipartite interaction exists. Moreover, the standard probability condition of normalization imposes that \(\mathop{\sum }\nolimits_{\alpha = 1}^{N}{\lambda }_{\alpha \beta }=1,\forall \,\beta\), where N represents the number of multiplex networks. Finally, the RWR equation on universal multilayer networks is defined as:

$${{{{{{{{\bf{p}}}}}}}}}_{t+1}^{T}=(1-r)\widehat{S}{{{{{{{{\bf{p}}}}}}}}}_{t}^{T}+r{{{{{{{{\bf{p}}}}}}}}}_{0}^{T}.$$

(15)

RWR initial probability distribution in universal multilayer networks

The initial probability distribution p0 from equation [15], which contains the probabilities to restart on the seed(s), can be written in its general form as follows:

$${{{{{{{{\bf{p}}}}}}}}}_{0}^{T}=\left[\begin{array}{c}{\eta }_{1}{\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{1}\\ {\eta }_{2}{\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{2}\\ \ldots \\ {\eta }_{N}{\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{N}\end{array}\right]$$

(16)

where ηk is the probability to restart in one of the layers of the multiplex network k, and \({\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{k}\) is the initial probability distribution of the multiplex network k. The size of \({\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{k}\) is equal to (Lk*nk), where Lk is the number of layers in the multiplex network k and nk is the number of nodes in the multiplex network k. We constraint the parameter η with the standard condition of normalization of the probability that imposes \(\mathop{\sum }\nolimits_{k = 1}^{N}{\eta }_{k}=1\). We defined another parameter, τ, to take into account the probability of restarting in the different layers of a given multiplex network. This parameter includes τkj, where k corresponds to the index of the multiplex network, and j to the index of the layer of the multiplex network k (Fig. 2). In other words, τkj corresponds to the probability to restart in the jth layer of the multiplex network k. Finally, \({\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{k}\) is defined as follows: \({\bar{{{{{{{{\bf{v}}}}}}}}}}_{0}^{k}={[{\tau }_{k1}{{{{{{{{\bf{v}}}}}}}}}_{0}^{k},{\tau }_{k2}{{{{{{{{\bf{v}}}}}}}}}_{0}^{k},\ldots , {\tau }_{k{L}_{k}}{{{{{{{{\bf{v}}}}}}}}}_{0}^{k}]}^{T}\), with \({{{{{{{{\bf{v}}}}}}}}}_{0}^{k}\) being a vector with 1/ωk in the position(s) of seed(s) and zeros elsewhere, and ωk being the number of seeds in the multiplex network k. The standard condition of normalization of the probability gives the constraint: \(\mathop{\sum }\nolimits_{j = 1}^{{L}_{k}}{\tau }_{kj}=1\),   k.

Numerical implementation: multiXrank

Our RWR on universal multilayer networks is implemented as a Python package called MultiXrank (Supplementary Note 2). MultiXrank has an optimized implementation. Default parameters allow exploring homogeneously the multilayer network (Supplementary Note 1.B). The running time of the package depends on the number of edges of the multilayer network (complexity analyses in Supplementary Note 2.A). The package is available on GitHub https://github.com/anthbapt/multixrank, and can be installed with standard pip installation command: https://pypi.org/project/MultiXrank.

Evaluations

We evaluated the performances of MultiXrank using two different multilayer networks. The first one is a large biological multilayer network composed of two multiplex networks and one monoplex network. It contains a gene multiplex network gathering gene physical and functional relationships, a drug multiplex network containing drug clinical and chemical relationships, and a disease monoplex network representing disease phenotypic similarities. Each monoplex/multiplex network is connected to the others thanks to bipartite networks containing gene-disease, drug-gene, and drug-disease interactions (Supplementary Note 3.B). The second multilayer network is composed of three multiplex networks. It contains a French airports multiplex network, a British airports multiplex network, and a German airports multiplex network. In each multiplex network, the nodes represent the airports of each country and the edges represent the national flight connections between these airports for three different airline companies. The three multiplex networks are linked with bipartite networks corresponding to transnational flight connections (Supplementary Note 3.A).

We designed a Leave-One-Out Cross-Validation (LOOCV) protocol inspired by F.Mordelet and J.P.Vert24 and A.Valdeolivas et al.15. In this protocol, we systematically leave-out some known associations and assess the reconstruction of this left-out data using the data remaining in the network (Supplementary Note 4.A and Fig. S9). In the case of the biological multilayer network,we systematically left-out known gene-disease associations. More specifically, for each disease associated with at least two genes, each gene is remove one-by-one and considered as the left-out gene. The remaining gene(s) associated with the same disease are used as seed(s). When the disease network is considered in the evaluation, the disease node is used as seed together with the gene node(s). The RWR algorithm is then applied, and all the network nodes are scored according to their proximity to the seed(s). The rank of the gene node that was left-out in the ongoing run is recorded. The perfect ranking for the left-out gene is 1; the closer the rank is to 1, the better the prediction. The gene left-out process is repeated iteratively for all the genes. Finally, the Cumulative Distribution Function (CDF) of the ranks of the left-out genes is plotted (Fig. 3). The CDF displays the ratio of left-out genes that are ranked by the RWR within the top-K ranked gene nodes. The CDFs are used to evaluate and compare the performance of the RWR applied to different combinations of biological networks: the protein-protein interactions (PPI) network alone, the gene multiplex network, the multilayer network composed of the gene multiplex and the disease monoplex networks, and the multilayer network composed of the gene and drug multiplex networks and the disease monoplex network (Fig. 3a).

Fig. 3: Evaluation and comparison of multiXrank performances on different combinations of multilayer networks.
figure 3

a, b Cumulative Distribution Functions (CDFs) representing the ranks of the left-out nodes in the Leave-One-Out Cross-Validation (LOOCV) protocol. a: focus on different combinations of biological networks: protein-protein interactions network alone (PPI), gene multiplex network (multi-1), multilayer network composed of the gene multiplex network and the disease monoplex network (multi-2), and multilayer network composed of the gene and drug multiplex networks and the disease monoplex network, for two different sets of parameters (multi-3, multi-3 bis). The multilayer networks are connected by the bipartite networks described in the Evaluations section. b: focus on different combinations of airports networks: French multiplex network (multi-1), multilayer network composed of the French and British airports multiplex networks (multi-2), and multilayer network composed of the French, British, and German airports multiplex networks, for two different sets of parameters (multi-3, multi-3 bis). These multilayer networks are connected by the bipartite networks described in the Evaluations section. c, d CDFs representing the ranks of the left-out nodes in the LOOCV protocol for the multi-3 multilayer networks described previously with artificially increased connectivity in the gene-drug and disease-drug bipartite networks. c The connectivity is artificially increased thanks to the addition of 1 (multi3+1), 2 (multi3+2) or 5 (multi3+5) transit drug nodes for each gene-disease association. d: In the airport multilayer network, the connectivity is artificially increased in the French-German and British-German bipartite networks thanks to the addition of 1 (multi3+1), 2 (multi3+2) or 5 (multi3+5) transit German nodes for each French-British airports association. The parameters of the Random Walk with Restart (RWR) are detailed in Supplementary Tables S5S6.

We observed that considering multiple sources of network data is always better than considering the PPI alone. In addition, considering multilayer information is better than considering only the gene multiplex network. However, the increased performances in the LOOCV seem to arise only from combining the gene multiplex network with the disease monoplex network (and associated gene-disease bipartite network). Indeed, the addition of the drug multiplex network (and associated drug-gene and drug-disease bipartite networks) to the multilayer system does not increase the performances (Fig. 3a).

We repeated the same LOOCV protocol for the airports multilayer network, in which the left-out nodes are French airport nodes associated with a given British airport node. Here, the behavior is different, as adding the third multiplex network containing German airports connections (and associated French-German and British-German bipartite networks) increases the performances of the RWR to predict the associations between French and British airports (Fig. 3b).

To better understand these different behaviors, we examined in detail the amount of common nodes (called overlaps) existing between the nodes of the different bipartite networks. We observed that only 23% of the genes from the gene-disease bipartite network are present in the drug-gene bipartite network. Similarly, only 5% of the diseases from the gene-disease bipartite network are present in the disease-drug bipartite network (Fig. S10). Given these low overlaps, the drug multiplex network might not contribute significantly to connecting gene and disease nodes during the random walks. This might explain why adding the drug multiplex network does not improve the performances of the LOOCV. Contrarily, the bipartite networks of the airport multilayer network displays high overlaps (Fig. S10). These high overlaps might explain why the addition of the third multiplex network in this case increases the predictive power (Fig. 3b).

To validate the proposed central role of bipartite networks in the RWR performances, we artificially increased the connectivity of the gene-drug and disease-drug bipartite networks before applying the same LOOCV protocol. To this goal, we added artificial transit drug nodes linking existing gene-disease associations (strategy described in Supplementary Note 4C and Fig. S12). We observed that these artificially added transit nodes increased drastically the performances of the LOOCV (Fig. 3c). The same phenomenon is observed for the airports multilayer network (Fig. 3d). In addition, we checked if random perturbations in these artificially enhanced bipartite networks would decrease the performances of the LOOCV. To do so, we progressively randomized the edges in the bipartite networks with artificially increased connectivity, until obtaining completely random bipartite networks. We observed that the progressive randomization of the bipartite networks continuously decreases the predictive power of the RWR up to obtaining the same performances as with only two multiplex networks (Fig. S13.A for the airport multilayer networks and S13.B for the biological multilayer networks).

Finally, we repeated all these evaluations using a standard Link Prediction (LP) protocol (Supplementary Note 4.B). LP has already been used to measure the predictive power of RWR methods25. In the LP protocol, we systematically removed gene-disease edges from the gene-disease bipartite network, and predicted the rank of the removed gene using the disease as seed in the RWR. The LP protocol is applied on the airport multilayer network by removing a French-British edge from the French-British bipartite network, and predicting the rank of the French airport using the British airport node as a seed in the RWR. We overall observed similar behaviors as in the LOOCV (Fig. S11 and S14).

Importantly, the LOOCV and LP protocols can be used to evaluate the pertinence of adding new multiplex networks in a multilayer network or new network layers in a multiplex network. Both evaluation protocols are available within the MultiXrank package.

Parameter space exploration

We next evaluated the stability of MultiXrank output scores upon variations of the input parameters. We illustrate this exploration of the parameter space with the biological multilayer network composed of the gene multiplex network and the disease monoplex network. We first compared the top-5 and top-100 gene and disease nodes prioritized by MultiXrank using 125 different sets of parameters (see Supplementary Note 5 for the definition of the sets of parameters). We observed that the top-ranked gene nodes vary more depending on the input parameters than the top-ranked disease nodes (Fig. 4a).

Fig. 4: Exploration of multiXrank parameter space.
figure 4

a Comparison of the top-5 and top-100 nodes ranked by MultiXrank using a biological multilayer networks composed of the gene multiplex network and the disease monoplex network for 125 different sets of parameters. The top-5 or top-100 ranked nodes for each set of parameters are merged, and the number of occurrences of each node are counted. The nodes are represented in bars colored in red when the node is found in all top-5 or top-100 scores, and in blue otherwise. b Clustering in the Principal Component Analysis (PCA) space of the output scores obtained with MultiXrank on the biological multilayer network composed of the gene multiplex network and the disease monoplex network using 125 different sets of parameters. The zoom-in emphasizes the clusters number 4 and 6. c Comparison of the top-100 nodes retrieved for the sets of parameters belonging to clusters 4 and 6 defined in (b). The bar is colored in red when a node is found in all top-100 scores, and in blue otherwise. The parameters of the Random Walk with Restart (RWR) are detailed in Supplementary Table S7.

To better understand the stability of the output scores upon variations of the input parameters, we proposed a protocol based on 5 successive steps: (i) definition of the sets of parameters, (ii) construction of a matrix containing the similarities of the RWR output scores obtained with each set of input parameters, using a the similarity measure defined in equation [17]. The similarities are computed for each type of node independently (i.e., for gene and disease nodes independently).

$${{{\Theta }}}_{\gamma \sigma }^{k}=\mathop{\sum }\limits_{j=1}^{{n}_{k}}\frac{\sqrt{{\left(\frac{{1}}{\left[{\left({{{{{{{{\bf{r}}}}}}}}}_{\gamma }^{k}\right)}_{j}-{\left({{{{{{{{\bf{r}}}}}}}}}_{\gamma \sigma }^{k}\right)}_{j}\right]}\right)}^{2}+{\left(\frac{{1}}{\left[{\left({{{{{{{{\bf{r}}}}}}}}}_{\sigma }^{k}\right)}_{j}-{\left({{{{{{{{\bf{r}}}}}}}}}_{\sigma \gamma }^{k}\right)}_{j}\right]}\right)}^{2}}}{{\left(\frac{{\left({{{{{{{{\bf{r}}}}}}}}}_{\gamma }^{k}\right)}_{j}\ +\ {\left({{{{{{{{\bf{r}}}}}}}}}_{\sigma }^{k}\right)}_{j}}{{2}}\right)}^{2}}$$

(17)

where γ and σ define two sets of parameters, nk is the number of nodes associated with the multiplex network k. In addition, \({{{{{{{{\bf{r}}}}}}}}}_{\gamma }^{k}\) (resp. \({{{{{{{{\bf{r}}}}}}}}}_{\sigma }^{k}\)) is the rank output scores distribution that associates with each node its rank given by the RWR with the set of parameters γ (resp. σ) for the multiplex network k. Finally, \({{{{{{{{\bf{r}}}}}}}}}_{\gamma \sigma }^{k}\) (resp. \({{{{{{{{\bf{r}}}}}}}}}_{\sigma \gamma }^{k}\)) gives to each node of the output scores distribution obtained by the set of parameters γ (resp. σ) (in the multiplex network k) their rank in the distribution σ (resp. γ).

We next computed a consensus Similarity matrix with a normalized euclidean norm of each individual Similarity matrix (equation [18]).

$${{{\Theta }}}_{\gamma \sigma }=\sqrt{\mathop{\sum }\limits_{k=1}^{N}\frac{{({{{\Theta }}}_{\gamma \sigma }^{k})}^{2}}{{n}_{k}}}$$

(18)

where N is the number of multiplex networks.

The next step is (iii) projection of the consensus Similarity matrix into a Principal Component Analysis (PCA) space (Fig. 4b). In this PCA space, each dot represents the output scores resulting from a set of parameters. Then, (iv) clustering (using k-means on the two first principal components) to identify sub-regions containing similar RWR output scores. Finally, (v) comparing the top-ranked nodes obtained with the set of parameters belonging to each cluster (Fig 4c, Supplementary Note 5).

We applied this protocol to evaluate the output scores obtained by MultiXrank on the previously defined biological multilayer network composed of the gene multiplex network and the disease monoplex network, using 125 different combinations of parameters (Fig. 4, supplementary Fig. S16). We projected the consensus Similarity matrix into a PCA space and identified 8 clusters (Fig. 4b). To illustrate the behavior inside clusters, we concentrated our analyses on the two clusters defined in the bottom left subspace (clusters number 4 and 6, zoom-in Fig. 4b). The top-100 ranked gene and disease nodes inside each of the two clusters are overall similar (Fig. 4c). This means that, even if the node prioritization can be sensitive to input parameters, we can identify regions of stability in the parameter space. Moreover, the protocol allows identifying the monoplex/multiplex networks that generate most variability in the output scores upon changes in the input parameters.

We applied the parameter space exploration protocol to other multilayer networks and observed diverse behaviors, from highly variable top-rankings and scattered projections in the PCA space for the airport multilayer network (Supplementary Fig. S15) to robust top-rankings with well-clustered projections in the PCA space for the biological multilayer network composed of 3 types of nodes (genes, diseases and drugs, Supplementary Fig. S16). Overall, our parameter space study reveals different sensitivities to input parameters depending on the multilayer network explored. The protocol is available within the MultiXrank package and can be used to characterize in-depth the sensitivity to input parameters of any multilayer network.



Source link