dmfind: a network medicine tool for the analysis of omics data

Introduction

This vignette describes how to use the R package dmfind, which implements network diffusion-based analysis of omics data for the identification of differentially enriched modules. The tool implements functions for:

performing network diffusion (also known as network propagation) of one or more vectors of gene-weights throughout an interactome of gene-gene interactions;
defining ranked gene lists, where genes are prioritized according to the Network Smoothing Index (NSI), which is proportional to gene-weights and the network proximity of high gene-weights;
assessing the characteristics of the (top) networks defined by the top ranked genes, based on proximity of high scoring genes, enrichment in high scoring genes, modularity and number of communities;
obtaining the functional cartography of the top networks;
comparing multiple networks by topology;
visualizing the top networks and the relations between communities.

If you use this package please cite:

Bersanelli*, Mosca*, et al., Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules. Sci Rep 6, 34841 (2016). https://doi.org/10.1038/srep34841

Getting started

Installation

To install this package, start R and enter:

if (!requireNamespace("devtools", quietly = TRUE)){
  install.packages(devtools)
}
devtools::install_github("emosca-cnr/NPATools", build_vignettes=T)
devtools::install_github("emosca-cnr/dmfind", build_vignettes=T)

Overview of the workflow

The typical steps are the followings:

Definition of the input matrices;
Calculation of the network-smoothing index;
Definition of the top networks;
Topological analysis.

We will consider a toy modular network of 500 “genes” obtained using igraph::barabasi.game(), where a community is enriched in higher scores compared to the others.

Definition of the input matrices

The analysis requires the following inputs:

$\mathbf{W}$, an $N \times N$ normalized adjacency matrix, which represents the interactions between genes, e.g.:

#> 6 x 6 sparse Matrix of class "dgCMatrix"
#>            G1         G2         G3 G4         G5         G6
#> G1 .          0.06900656 0.04800154  . .          .         
#> G2 0.06900656 .          0.04637389  . .          .         
#> G3 0.04800154 0.04637389 .           . 0.04356068 0.05184758
#> G4 .          .          .           . .          .         
#> G5 .          .          0.04356068  . .          0.07001400
#> G6 .          .          0.05184758  . 0.07001400 .

$\mathbf{X}_0$, a $N \times m$ matrix with column vectors $\mathbf{x_0}$ that contain gene-related quantities;
a vector to assign each column of $\mathbf{X}_0$ to any of two classes, only for differential NSI.

IMPORTANT:

$\mathbf{W}$ and $\mathbf{X}_0$ must be defined over the same identifiers;
the normalization of the adjacency matrix must be done by means of the function normalize_adj_mat(), which requires a binary adjacency matrix;
the relevance of the genes corresponding to the rows of $\mathbf{X}_0$ is proportional to their values, that is, the higher the better; negative values are not allowed.

The network smoothing index (NSI)

The network smoothing index (NSI) compares the initial state $\mathbf{x}_0$ with the steady-state $\mathbf{x}_s$ reached after network diffusion:

\[S_j(\mathbf{x_0}) = \frac{x_{sj}}{x_{0j}+\epsilon}\] where the parameter $\epsilon$ tunes the relevance of the initial state. The impact of $\epsilon$ can be assessed by means of the function eval_eps(), which finds an optimal value of $\epsilon$ that takes into account the network proximity of $\mathbf{x_0}$ values and $\mathbf{x_s}$ values in the networks composed of the top $n$ genes ranked by $S$. The network proximity is quantified by $\Omega$ function (Bersanelli et al. 2016):

\[\Omega (n) = \mathbf{y}^T(n) \cdot \mathbf{A}(n) \cdot \mathbf{y}(n)\] calculated over $\mathbf{y} = \mathbf{x_0}$ or $\mathbf{y} = \mathbf{x_s}$. Let $O_{\mathbf{x}_0}(\epsilon)$ and $O_{\mathbf{x}_s}(\epsilon)$ be sums of $\Omega$ values over the top ranking genes $\{1, 2, ..., n\}$, calculated using $\mathbf{x_0}$ and $\mathbf{x_s}$ respectively. The optimal $\epsilon$ value is found as follows: \[\text{arg}\,\max\limits_{\epsilon}\, \Bigg( \frac{O_{\mathbf{x}_0}(\epsilon)}{\text{max} (O_{\mathbf{x}_0}(\epsilon))} + \frac{O_{\mathbf{x}_s}(\epsilon)}{\text{max}(O_{\mathbf{x}_s}(\epsilon)}\Bigg)\]

we will see below specific examples for the permutation-adjusted $S$ and differential $S$.

Permutation-adjusted $S$;

The Permutation-adjusted $S$ is defined as: \[Sp_j = S_j \cdot -log_{10}(p_j)\] where $p$ is a statistical assessment of $S$ based o vertex permutations applied to $\mathbf{X}_0$ and $\mathbf{W}$. The input matrix $\mathbf{X}_0$ contains one (typically) or more columns of any type of gene weights (e.g., $−\text{log}_{10}(p)$):

#> 6 x 1 sparse Matrix of class "dgCMatrix"
#>             M
#> G1 .         
#> G2 0.02777778
#> G3 0.01388889
#> G4 .         
#> G5 0.19444444
#> G6 .

As mentioned above, we must find a value for$\epsilon$. The default behavior of eval_eps() consists in testing $\epsilon$ values that lie from two orders of magnitude below the minimum of strictly positive $\mathbf{X}_0$ values to two orders of magnitude above the maximum of $\mathbf{X}_0$:

ee <- eval_eps(X0 = X0, W = W)

Generating epsilon based on X0 values...
# orders of magnitude:  6 
# values considered:  24
eps: 0.0001388889 0.0002496515 0.0004487462 0.000806617 0.001449887 0.002606157 0.004684544 0.008420423 0.01513563 0.02720616 0.04890281 0.08790234 0.1580036 0.28401 0.5105054 0.9176286 1.649429 2.964833 5.329259 9.579293 17.21869 30.95043 55.63311 100 
Performing ND...
optimal eps: 0.5105054

In our case study we obtain a value of about 0.51, which is stored in the element ee$opt_eps. It is possible to perform a visual inspection of the effect of $\epsilon$ over $\Omega$ via plot_omega_eps()

plot_omega_eps(ee)

which shows the values of $\Omega$ by rank and the values of $O_{\mathbf{x}_0}(\epsilon))$ and $O_{\mathbf{x}_s}(\epsilon))$

The permutation-based NSI is calculated via calc_adjND(), where we use the optimal $\epsilon$ value obtained above and consider 99 permutations between vertices matched by degree:

resSp <- calc_adjND(X0 = X0, W = W, eps = ee$opt_eps, bin_type = "number", method = "d", cut_par = 2, k=99)
BPPARAM
class: SerialParam
  bpisup: FALSE; bpnworkers: 1; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
  bpexportglobals: TRUE; bpforceGC: FALSE
  bplogdir: NA
  bpresultdir: NA
Permutation type: degree 
Total permutations: 100 
Minimum possible FDR: 0.01 
cut_par: 2 
bin_type: number 
vertex sets and bins: 307 144 
Min possible k: 5.550294e+249 
ND over X0 permutations
ND over W permutations
calculation of S
calculation of p-values

The result is a list that contains input and output quantities.

#>  [1] "X0"   "Xs"   "eps"  "S"    "pX0"  "pW"   "SpX0" "SpW"  "p"    "Sp"

The main stuff that one should consider tuning is the permutation approach. Please read the documentation ?calc_adjND().

Differential NSI (cases vs controls)

The differential NSI is calculated comparing the NSI of two classes of subjects, based on their average “profiles” $\bar{\mathbf{x}_{o1}$ and $\bar{\mathbf{x}_{o2}}$:

\[\Delta S_j = S_j (\bar{\mathbf{x}_{o2}}) - S_j (\bar{\mathbf{x}_{o1}})\] In this case each olumn of the input matrix $\mathbf{X}_0$ carries sample-level information (e.g., mutation profiles of patients)

#> 6 x 10 sparse Matrix of class "dgCMatrix"
#>                       
#> G1 . . . . . . 1 . . .
#> G2 . . . . . . . . . .
#> G3 1 1 1 1 1 . . . . .
#> G4 . . . . . . . 1 . .
#> G5 1 1 1 1 1 . . . . .
#> G6 1 1 1 1 . . . . . .

which will be averaged to obtain $\bar{\mathbf{x}_{o1}}$ and $\mathbf{x}_{o2}$ according to a given partition of subjects in two classes. To find an appropriate $\epsilon$ values for each of the two classes, we run eval_eps() on $\bar{\mathbf{x}_{o1}}$ and $\bar{\mathbf{x}_{o2}}$ :

X0ds_means <- calc_X0_mean(X0 = X0ds, classes = classes)
eeds <- eval_eps(X0 = X0ds_means, W = W, top = 100)
enerating epsilon based on X0 values...
# orders of magnitude:  5 
# values considered:  20 
eps: 0.002 0.003534632 0.006246812 0.01104009 0.01951133 0.03448268 0.0609418 0.1077034 0.190346 0.3364015 0.5945277 1.050718 1.856951 3.28182 5.800013 10.25046 18.11579 32.01633 56.58298 100 
Performing ND...
optimal eps: 1.050718 10.25046

In our case study we obtain a value of about 1 and 10 for respectively class 1 and 2. We can now run the analysis by:

resdS <- calc_dS(X0 = X0ds, W = W, classes = classes, eps = eeds$opt_eps)
network diffusion
calculation of dS

The result will contain the following quantities:

#> [1] "X0"  "Xs"  "eps" "S"   "dS"

Visualization of the NSI values

From now on, we will focus only on the results of $Sp$ analysis, because the functions that will be described below work analogously for $Sp$ and $\Delta S$.

The results NSI calculation can be visually explored throug the function plot_NSI(). In the case of $Sp$ two interesting plots are the distribution of $S$ values and p-value (that is “effect size” vs significance):

plot_NSI(resSp, x="S", y="p")

and the relation between $S_p$ and the initial state:

plot_NSI(resSp, x = "X0", y="Sp")

Note that genes with a high initial value gets high $S_p$. In addition some genes with low initial value get a high $S_p$, which implies a network proximity of these genes to those with high $\mathbf{x}_0$.

Definition of the top networks

The package dmfind provides two types of analysis to guide the choice of the “top” networks, that is the networks that contain the interactions between genes with high $\mathbf{X}_0$:

network resampling (NR), which scores the presence of genes with high NSI values in network proximity;
network enrichment (NE), which scores network enrichment in genes with high $\mathbf{X}_0$;
modularity analysis, which scores network modularity

These analysis are applied over a number of top genes by $S_p$ or $\Delta S$. In human genome-wide analysis, a typical number could be 500.

Network resampling

Network resampling can be used to asses the presence of significantly connected modules

nr_res <- NR(G = bg, rankedVector = sort(resSp$Sp[, 1], decreasing = T)[1:50], k = 99)
plot_NR(nr_res)

The element nr_res$NRsummary contains the NR results by rank

#>        id rank rankingScore    omega p_val
#> G5     G5    1    1.0000000 0.000000  1.00
#> G61   G61    2    0.6668873 0.000000  1.00
#> G286 G286    3    0.6622229 0.441628  1.00
#> G379 G379    4    0.6576295 1.099258  1.00
#> G284 G284    5    0.6263519 1.516964  0.52
#> G455 G455    6    0.5458936 1.878467  0.46

The visual representation is obtained plot_NR():

plot_NR(nr_res)

We observe that the real $\Omega$ values begins to be clearly higher than permuted values after approximately 20 genes, which indicates the presence of a high scoring connected gene module:

Network Enrichment

The enrichment analysis assess to which extent the top networks are enriched in high $\mathbf{x}_0$ values. Depending on how $\mathbf{x}_0$ is defined, this task can be done by means of Over Representation Analysis (ORA) or Gene Set Enrichment Analysis (GSEA). In the following example, we explore the enrichment of the networks formed by the top 50 genes by $S_p$, using GSEA:

ne_res <- assess_enrichment(G = bg, topList = names(sort(resSp$Sp[, 1], decreasing = T)[1:50]), X0Vector = setNames(X0[, 1], rownames(X0)), type = "gsea", minNetSize = 5, ranks = 1:50)

The table ne_res$en_summary cotains the results, which, as expected, show that the top networks are highly enriched in top scoring genes:

#>    rank size        es      nes nperm p_val  adj_p_val q_val       FDRq tags tags_perc list_top
#> 9    13   13 0.9868501 1.647037   100  0.01 0.01043084    NA 0.01002616    9 0.6923077       11
#> 20   24   24 0.9624399 1.645404   100  0.01 0.01043084    NA 0.01002616   11 0.4583333       14
#> 8    12   12 0.9886723 1.640034   100  0.01 0.01043084    NA 0.01002616    8 0.6666667        9
#> 14   18   18 0.9705192 1.638376   100  0.01 0.01043084    NA 0.01002616   10 0.5555556       14
#> 19   23   23 0.9586542 1.635739   100  0.01 0.01043084    NA 0.01002616   10 0.4347826       14
#> 23   27   27 0.9568215 1.633151   100  0.01 0.01043084    NA 0.01002616   13 0.4814815       15
#>    list_top_perc lead_edge
#> 9     0.02439024 0.6954689
#> 20    0.03104213 0.4690671
#> 8     0.01995565 0.6712225
#> 14    0.03104213 0.5606877
#> 19    0.03104213 0.4439252
#> 23    0.03325942 0.4951083

Comparison of NR and NE

The results of NR and NE can be combined to find possible networks that are optimal based on the two analyses. In our case study we found that the network composed of the top 24 genes is associated with the most significant p-values from NR and NE:

nr_ne_cmp <- cmp_top_net_scores(NRRes = nr_res, netEnrRes = ae_res)
Smallest highest scoring network at rank:  24 
Overlapping points detected, adding some jitter

The result is a table the reports the statistics rank by rank:

#>   rank p_val_NR   p_val_NE    score
#> 1    5     0.52 0.02105263 1.219172
#> 2    6     0.46 0.01020408 1.524909
#> 3    7     0.32 0.01020408 1.658422
#> 4    8     0.17 0.01020408 1.894081
#> 5    9     0.17 0.01010101 1.897891
#> 6   10     0.15 0.01000000 1.948715

The function also produces the figure NR_vs_NE.png.

Modularity assessment

The function assess_modularity() quantifies the community structure and the modularity among the networks composed of the top ranking genes. This is another piece of information that can be useful to select the top networks. In this example, we study the modularity between the top 50 genes sorted by $Sp$:

assess_mod_res <- assess_modularity(G = bg, vertices = rownames(resSp$Sp)[order(-resSp$Sp[, 1])[1:50]], ranks = 1:50)

The output is a table with the trend of the modularity and number of communities at various ranks between the top genes considered.

#>   rank n_vertex n_edge  algorithm modularity n_community n_vertex_max n_edge_max algorithm_max
#> 1    1        0      0 fastgreedy        NaN        -Inf            1          0    fastgreedy
#> 2    2        0      0 fastgreedy        NaN        -Inf            1          0    fastgreedy
#> 3    3        2      1   multilev  0.0000000           1            2          1      multilev
#> 4    4        4      2 fastgreedy  0.5000000           2            2          1      multilev
#> 5    5        5      3 fastgreedy  0.4444444           2            3          2      multilev
#> 6    6        6      4 fastgreedy  0.4062500           3            4          3    fastgreedy
#>   modularity_max n_community_max
#> 1            NaN               1
#> 2            NaN               1
#> 3      0.0000000               1
#> 4      0.0000000               1
#> 5      0.0000000               1
#> 6      0.1666667               2

These results can be plotted through plot_modu_trend():

plot_modu_trend(assess_mod_res)

Extraction of the top networks

Once the rank that is associated to the top networks that one considers optimal, these can be extracted by means of extract_module(). Here we consider the top 120 genes by $S_p$:

top_nets <- extract_module(graph = bg, selectedVertices = rownames(resSp$Sp)[order(-resSp$Sp[, 1])[1:20]], NSIRes = resSp, minSubnetSize = 2)

#> IGRAPH 9e6700b UN-- 20 21 -- Barabasi graph
#> + attr: name (g/c), power (g/n), m (g/n), zero.appeal (g/n), algorithm (g/c), name (v/c),
#> | comm_id (v/n), X0 (v/n), Sp (v/n), S (v/n), p (v/n), subnetId (v/n), subnetSize (v/n)
#> + edges from 9e6700b (vertex names):
#>  [1] G3  --G5   G3  --G6   G5  --G6   G6  --G29  G5  --G58  G6  --G150 G150--G230 G58 --G284
#>  [9] G61 --G284 G61 --G286 G5  --G298 G5  --G379 G3  --G388 G286--G398 G388--G413 G5  --G417
#> [17] G286--G417 G29 --G445 G152--G455 G286--G455 G388--G476

Topological charcaterization of top networks

Community structure

The function find_communities() runs a series of algorithms to detect the presence of communities:

res_comm <- find_communities(top_nets)

The element res_comm$info contains the modularity score $Q$ and the number of communities:

#>    algorithm modularity n
#> 1 fastgreedy  0.5079365 5
#> 2   multilev  0.4682540 5

In this case the fastgreedy algorithm found a better partition with the same number of community, and therefore we assign its results to the igraph object containing the top networks:

top_nets <- assign_communities(top_nets, res_comm$comm$fastgreedy)

Functional cartography

The function functional_cartography() performs the functional cartography analysis (Guimera and Nunes Amaral 2005), where each gene is scored by its participation coefficient $P$ and the within module degree score, which is a z-score (Guimera and Nunes Amaral 2005) or a probability of interaction (Joyce et al. 2010). These values are used to classify genes in 7 classes:

ultra peripheral nodes (R1): nodes with all their links within their module;
peripheral nodes (R2): nodes with most links within their module;
non-hubs connector nodes (R3): nodes with many links to other modules;
non-hubs kinless nodes (R4): nodes with links homogeneously distributed among all modules.
provincial nodes (R5): hub nodes with the vast majority of links within their module (pc <= 0.3);
connector hubs (R6): hubs with many links to most of the other modules (0.30 < pc <= 0.75);
kinless hubs (R7): hubs with links homogeneously distributed among all modules (pc > 0.75).

top_nets <- functional_cartography(top_nets)

The output is the igraph object given in input with the addition of vertex attributes R, P, and wmd_score.

#> IGRAPH 9e6700b UN-- 20 21 -- Barabasi graph
#> + attr: name (g/c), power (g/n), m (g/n), zero.appeal (g/n), algorithm (g/c), name (v/c),
#> | comm_id (v/n), X0 (v/n), Sp (v/n), S (v/n), p (v/n), subnetId (v/n), subnetSize (v/n),
#> | R (v/c), P (v/n), wmd_score (v/n)
#> + edges from 9e6700b (vertex names):
#>  [1] G3  --G5   G3  --G6   G5  --G6   G6  --G29  G5  --G58  G6  --G150 G150--G230 G58 --G284
#>  [9] G61 --G284 G61 --G286 G5  --G298 G5  --G379 G3  --G388 G286--G398 G388--G413 G5  --G417
#> [17] G286--G417 G29 --G445 G152--G455 G286--G455 G388--G476

A dedicated function exist to obtain the functional cartography:

plot_fc(top_nets, labelBy = "name")

Visualization

The function plot_network() provides a means to produce network visualizations. Please read its documentation (?plot_network) for further details. For example, here we plot the top network producing a layout that takes into account the community structure and color by community:

lo_tn <- plot_network(top_nets, vertex.size=5, colorQuant = T, colorBy = "comm_id", community = res_comm$comm$fastgreedy, pal=pals::alphabet(max(res_comm$comm$fastgreedy$membership)), labelBy = "name",  vertex.label.degree=pi/2, vertex.label.dist=1)

while in this second example we color by $\mathbf{X_0}$

plot_network(top_nets, vertex.size=5, colorQuant = T, colorBy = "X0", pal=pals::brewer.purples(3), lo = lo_tn, labelBy = "name", vertex.label.degree=pi/2, vertex.label.dist=1)

and lastly by $S_p$:

plot_network(top_nets, vertex.size=5, colorQuant = T, colorBy = "Sp", pal=pals::brewer.purples(3), lo = lo_tn, labelBy = "name", vertex.label.degree=pi/2, vertex.label.dist=1)

Interactome derived from STRING v12

This package contains an interactome derived from STRING db v12.0 (Szklarczyk et al. 2022), consisting of 17288 genes and 174962 interactions. Only interactions with confidence >=700 and the top 3 interactions with confidence >=400 & < 700 were considered. Confidence score was calculated without text mining evidence, as described at the URL https://string-db.org. Original protein identifiers were mapped to Entrez Gene identifiers using mappings provided by STRING (https://string-db.org) and NCBI (https://ftp.ncbi.nlm.nih.gov, download date 2023-09-19). The interactome is available as an igraph object:

data(dmfind)
string.v12.Entrez.ntm.700.400.k3

#> IGRAPH 6f011b4 UN-- 17288 174962 -- 
#> + attr: name (v/c), symbol (v/c), neighborhood (e/n), neighborhood_transferred (e/n),
#> | fusion (e/n), cooccurence (e/n), homology (e/n), coexpression (e/n),
#> | coexpression_transferred (e/n), experiments (e/n), experiments_transferred (e/n),
#> | database (e/n), database_transferred (e/n), combined_score (e/n)
#> + edges from 6f011b4 (vertex names):
#>  [1] 1066--1571   1066--4353   1066--9      1066--978    1066--7365   1066--2990   1066--8824  
#>  [8] 1066--7363   1066--1576   1066--10     1066--79799  1544--1548   1544--1571   1544--7498  
#> [15] 1544--9      1544--1558   1544--56603  1544--1551   1544--8854   1544--7364   1544--316   
#> [22] 1544--412    1544--220    1544--8644   1544--1559   1544--1562   1544--1555   1544--216   
#> [29] 1544--1579   1544--3284   1544--6820   1544--29785  1544--284541 1544--340665 1544--1586  
#> + ... omitted several edges

The matrix $W$ can be derived as follows:

A <- get.adjacency(string.v12.Entrez.ntm.700.400.k3, type = "both")
W <- normalize_adj_mat(A)

References

Bersanelli, Matteo, Ettore Mosca, Daniel Remondini, Gastone Castellani, and Luciano Milanesi. 2016. “Network Diffusion-Based Analysis of High-Throughput Data for the Detection of Differentially Enriched Modules.” Scientific Reports 6 (1): 34841.

Guimera, Roger, and Luı́s A Nunes Amaral. 2005. “Functional Cartography of Complex Metabolic Networks.” Nature 433 (7028): 895–900.

Joyce, Karen E, Paul J Laurienti, Jonathan H Burdette, and Satoru Hayasaka. 2010. “A New Measure of Centrality for Brain Networks.” PloS One 5 (8): e12200.

Szklarczyk, Damian, Rebecca Kirsch, Mikaela Koutrouli, Katerina Nastou, Farrokh Mehryary, Radja Hachilif, Annika L Gable, et al. 2022. “The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest.” Nucleic Acids Research 51 (D1): D638–46. https://doi.org/10.1093/nar/gkac1000.