margheRita.Rmd
Untargeted metabolomics studies by mass spectrometry technologies generate huge numbers of metabolite signals, requiring computational analyses for post-acquisition processing and dedicated databases for metabolite identification. Web-based data processing solutions frequently include only a part of the entire workflow thus implying the use of different platforms. The R package margheRita addresses the complete workflow for metabolomic profiling in untargeted studies based on liquid chromatography (LC) coupled with tandem mass spectrometry (MS/MS). This pipeline is specially advantageous in the case of data-independent acquisition (DIA), where all MS/MS spectra are acquired with high resolution. Interestingly, the R package margheRita enhances fragment matching accuracy by providing an original metabolite spectral library acquired in both polarities using different chromatographic columns.
The package provides:
Source code: https://github.com/emosca-cnr/margheRita
Citation: Ettore Mosca, Marynka Ulaszewska, Zahrasadat Alavikakhki, Edoardo Niccolò Bellini, Valeria Mannella, Gianfranco Frigerio, Denise Drago, Annapaola Andolfo. MargheRita: an R package for LC-MS/MS SWATH metabolomics data analysis and confident metabolite identification based on a spectral library of reference standards. bioRxiv 2024.06.20.599545; doi: https://doi.org/10.1101/2024.06.20.599545
Contacts:
The package requires a series of other R packages, which are available in CRAN, Bioconductor or github, namely:
## Biobase, ComplexHeatmap, LSD, car, circlize, clusterProfiler, devtools, grDevices, graphics, openxlsx, pals, pcaMethods, stats, stringr, utils
In most of the cases, the following instructions guarantee that all such dependencies are installed:
install.packages("devtools")
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install(c("Biobase", "ComplexHeatmap", "clusterProfiler", "pcaMethods"))
devtools::install_github("emosca-cnr/margheRita", dependencies = T)The functions as.metaboset() and
as.PomaSummarizedExperiment enable margheRita data
structure export in formats that are compatible with, respectively,
packages notame and
POMA.
If interested in these functions the user must install such
packages:
BiocManager::install("POMA")
devtools::install_github("antonvsdata/notame")margheRita is intended to be used after having done a number of data acquisition steps through MS-Dial (Tsugawa et al. 2015). It requires two text files in tab-delimited format:
A couple of remarks:
short sample ids yield better figures;
the collapse of technical replicates (see below) takes place over all samples that share the same concatenation of “class” and “biological_rep” values.
The function read_input_file() read the two files and
creates the “mRList” object, which is used by most of the margheRita
functions as main input/output:
mL <- read_input_file(feature_file = "MS-Dial_file.txt", sample_file = "sample_info.txt")Please, also consider that, to properly split QC samples, you must indicate exactly “QC” as class of a sample in the sample_file.
The initial mRList object contains the following elements:
| Element of mRList | Description |
|---|---|
| data | matrix containing metabolite abundances |
| metab_ann | metabolite annotation |
| sample_ann | sample annotation |
| QC | matrix containing metabolite abundances of QC samples |
| QC_ann | annotation of QC samples |
## DD_mealC_t7_1 MM_mealB_t0_1 AA_mealA_t0_1
## F506 108.1681 141.9220 314.8479
## F507 462.1838 466.3308 810.1950
## F511 52674.7600 79656.2100 24772.5900
## F513 2300.2920 9686.2540 696.9471
## F515 52686.7100 79673.6700 24777.8200
## F576 685.8444 2175.1690 2713.2550
## Feature_ID MSDialName
## F506 F506 Unknown
## F507 F507 Unknown
## F511 F511 Trimethylamine N-oxide; CE30; UYPYRKYUKCHHIB-UHFFFAOYSA-N
## F513 F513 Unknown
## F515 F515 Trimethylamine??N-oxide
## F576 F576 Unknown
## MSDialSMILES rt mz
## F506 null 1.260 76.03783
## F507 null 0.868 76.03820
## F511 CN(=O)(C)C 0.888 76.07437
## F513 null 1.148 76.07460
## F515 O=N(C)(C)C 0.855 76.07492
## F576 null 4.846 81.06902
## id injection_order batch class technical_rep
## DD_mealC_t7_1 DD_mealC_t7_1 14 1 DD 1
## MM_mealB_t0_1 MM_mealB_t0_1 15 1 MM 1
## AA_mealA_t0_1 AA_mealA_t0_1 16 1 AA 1
## AA_mealC_t1_1 AA_mealC_t1_1 17 1 AA 1
## DD_mealA_t6_1 DD_mealA_t6_1 18 1 DD 1
## MM_mealC_t8_1 MM_mealC_t8_1 19 1 MM 1
## biological_rep subj_meal_time meal time subj_meal
## DD_mealC_t7_1 mealC_t07 DD_mealC_t07 mealC t07 DD_mealC
## MM_mealB_t0_1 mealB_t00 MM_mealB_t00 mealB t00 MM_mealB
## AA_mealA_t0_1 mealA_t00 AA_mealA_t00 mealA t00 AA_mealA
## AA_mealC_t1_1 mealC_t01 AA_mealC_t01 mealC t01 AA_mealC
## DD_mealA_t6_1 mealA_t06 DD_mealA_t06 mealA t06 DD_mealA
## MM_mealC_t8_1 mealC_t08 MM_mealC_t08 mealC t08 MM_mealC
## QC01 QC07 QC23
## F506 347.2336 162.1097 216.9347
## F507 857.4278 481.7928 968.1245
## F511 71377.8900 38373.6100 61439.0300
## F513 6684.7820 2336.0220 4889.8950
## F515 71396.5500 38378.4600 60577.0500
## F576 1095.8860 699.0059 994.5408
## id injection_order batch class technical_rep biological_rep
## QC01 QC01 9 1 QC 1 QC
## QC07 QC07 25 1 QC 7 QC
## QC23 QC23 131 1 QC 23 QC
## QC26 QC26 144 1 QC 26 QC
## QC29 QC29 167 1 QC 29 QC
## QC30 QC30 168 1 QC 30 QC
## subj_meal_time meal time subj_meal
## QC01 QC_QC QC QC QC_QC
## QC07 QC_QC QC QC QC_QC
## QC23 QC_QC QC QC QC_QC
## QC26 QC_QC QC QC QC_QC
## QC29 QC_QC QC QC QC_QC
## QC30 QC_QC QC QC QC_QC
The full version of the “Urine” dataset, which was used for margheRita assessment and to generate this documentation, is available at https://doi.org/10.5281/zenodo.11243781, files “Urine_RP_NEG_norm.txt” and “Urine_RP_POS_norm.txt”. The corresponding sample information files can be accessed as follows:
sample_file_NEG <- system.file("extdata", "Urine_RP_NEG_norm_metadata.txt", package = "margheRita")
sample_file_POS <- system.file("extdata", "Urine_RP_POS_norm_metadata.txt", package = "margheRita")To support interoperability with other packages with a focus on metabolomics, the margheRitaList can be reorganized as a “MetaboSet” object, used by package “notame” (Klåvus et al. 2020) (see above the installation instruction to enable this opportunity):
ms <- as.metaboset(mRList)## MetaboSet object with 303 features and 253 samples.
## 10 QC samples included
## 303 non-flagged features, 0 flagged features.
##
## class:
## DD: 81, MM: 81, AA: 81, QC: 10
##
## The object has the following parts (splits):
## FALSE: features
or as “PomaSummarizedExperiment” object, used by package “POMA” (Castellano-Escuder et al. 2021):
se <- as.PomaSummarizedExperiment(mRList)## class: SummarizedExperiment
## dim: 303 253
## metadata(0):
## assays(1): ''
## rownames(303): F506 F513 ... F42170 F42216
## rowData names(0):
## colnames(253): DD_mealC_t7_1 MM_mealB_t0_1 ... QC45 QC55
## colData names(10): class biological_rep ... injection_order
## technical_rep
The function filtering() runs filters
to exclude features/sample with many missing values, features with wrong
m/z values and, lastly, performs imputation of missing values:
mL <- filtering(mL)
# Samples with >= 100 metabolites 243 / 243
# Features occurring in >= 3 samples 604 / 604
# Features with appropriate m/z values: 548
# Features without appropriate m/z values: 56
No NAs: imputation not performed.These three steps can be called independently through the function
filter_NA(), m_z_filtering() and
imputation(), respectively. In particular,
m_z_filtering() remove features with m/z that have decimal
value within [4, 8] (by default), while the imputation is performed
replacing NA values with a random number, calculated between 10%-25% of
the minimum value of the feature.


The function heatscatter_chromatography() creates a
graphic overview of the mz and rt values in the dataset:

margheRita provides three ways for normalizing metabolite profiles:
For “reference” and “pqn” methods, the column reference
must be present in mRList$metab_ann. If missing, the
function calc_reference() sets up such column using average
metabolite values and medians of QC samples. For example, here’s a call
to normalize_profiles() using pqn:
mL_norm <- normalize_profiles(mL, method = "pqn")
PQN normalization
No reference profile found, using calc_reference() function...
Using QC...The comparison of the coefficient of variation of a metabolite in relation to QC samples provides a means to exclude low quality features. In particular, only features that have a CV ratio between no-QC samples and QC sample higher than a given threshold (by default 1) are kept:
mL_norm <- CV_ratio(mRList = mL_norm)
Summary of CV ratio (samples / QC):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3593 0.8025 1.1032 1.4645 1.6516 13.4896
# Metabolites with appropriate CV 303 / 539 The distributions of metabolite relative log-abundances can be calculated and visualized by means of:
mL <- RLA(mRList = mL)Typically, after normalization, the various samples should have
similar distributions of relative log-abundance, a powerful
tool to visualize unwanted variation in high-dimensional data (Gandolfo 2018).

margheRita performs Principal Component Analysis (PCA) using the
function mR_pca(), which relies on the package pca_methods
(Stacklies et al.
2007). Besides choosing the scaling method (argument
scaling) and number of PCs (nPcS), it allows
to include/exclude quality control samples by means of argument
include_QC:
mL_norm <- mR_pca(mRList = mL_norm, nPcs=5, scaling="uv", include_QC=FALSE)The results are added to the mRList in the element pca.
It also provides some plots, like the visualization of distribution of
loadings for all-pairs of the top 5 PCs. The plots are directly saved in
the current working directory (or in the sub directory created with the
argument dirout). The results of PCA can be plotted using
Plot2DPCA() function. The argument col_by
enables the choice of the mRList$sample_ann column to be
used to color samples:
Plot2DPCA(mRList = mL_norm, pcx=1, pcy=2, col_by="class", include_QC=TRUE)
It is common that the inspection of the similarity among samples
(e.g. distribution over the top PCs, RLA) rises concerns about the
quality of some sample. The function remove_samples()
allows the user to remove one or more samples from the
mRList. Here, for example we remove all “Blank”
samples:
mL <- remove_samples(mRList = mL, ids = "Blank", column = "class")In this case, the function removes all samples with value “Blank” in the column “class” of sample annotation.
The definition of mean metabolite abundance for every biological
replicate is performed by means of collapse_tech_rep()
function:
mL_norm_bio <- collapse_tech_rep(mRList = mL_norm)## AA_mealA_t00 AA_mealA_t01 AA_mealA_t02 AA_mealA_t03 AA_mealA_t04
## F506 372.5212 314.9164 641.2731 328.3019 183.8177
## F513 686.5940 583.5344 575.2296 517.7881 438.2448
## F576 1859.9782 3005.7498 2431.0017 1601.8178 2010.3795
## F848 384.4542 407.5416 431.8409 297.6953 204.5562
## F958 806.3609 718.5354 677.2341 629.7116 648.8627
## F1016 3728.8124 3083.9761 4084.9936 2558.5845 2531.7107
## class_biorep class biological_rep
## AA_mealA_t00 AA_mealA_t00 AA mealA_t00
## AA_mealA_t01 AA_mealA_t01 AA mealA_t01
## AA_mealA_t02 AA_mealA_t02 AA mealA_t02
## AA_mealA_t03 AA_mealA_t03 AA mealA_t03
## AA_mealA_t04 AA_mealA_t04 AA mealA_t04
## AA_mealA_t05 AA_mealA_t05 AA mealA_t05
MargheRita provides some functions to calculate mean and variability, fold changes and to test for metabolite variations.
The function mean_median_stdev_samples() calculates
mean, median and standard deviation of metabolite abundance according to
the sample classes specified in the column “class” of sample
annotation:
mean_median_stdev_samples(mL_norm_bio)
According to dataset size, this might take a few minutes.
Calculating means...
Calculating medians...
Calculating standard deviations...The function univariate() performs dataset-wide
statistical tests (Student t-tests, Wilcoxon test, Anova and
Kruskal-Wallis test) between levels of a particular factor defined in
the sample annotation:
mL_norm_bio <- univariate(mL_norm_bio, test_method="anova", exp.levels = c("AA", "DD", "MM"), exp.factor = "class")## F p q DD-AA MM-AA
## F506 56.603887 6.500895e-16 9.379863e-15 0.000000e+00 0.000000e+00
## F513 11.464627 4.315428e-05 1.379828e-04 6.623790e-02 2.340794e-05
## F576 27.699625 8.141555e-10 6.167228e-09 8.774088e-06 6.781009e-10
## F848 1.299677 2.784574e-01 3.486471e-01 3.061504e-01 9.812010e-01
## F958 144.803482 5.517268e-27 2.786220e-25 0.000000e+00 3.929312e-01
## F1016 17.463922 5.400686e-07 2.517551e-06 3.059389e-02 4.030582e-03
## MM-DD
## F506 8.275582e-01
## F513 3.658063e-02
## F576 7.333725e-02
## F848 4.027080e-01
## F958 0.000000e+00
## F1016 2.703868e-07
The full results of the analysis are saved to the text file
We found useful providing a function to retrieve the list of significant features:
significant_features <- select_sign_features(mL_norm_bio, test_method="anova", test_value = "q", cutoff_value = 0.05)## [1] "F3957" "F18426" "F19199" "F10248" "F9507" "F958"
Metabolite identification in margheRita is performed by means of the
function metabolite_identification(), which requires an
mRList object and a reference library with MS and MS/MS metabolite
information. The identification is possible up to level-1, provided that
the required information are available in the reference library. The
identification is based on the quantification of the following
quantities:
Such quantities are used to score the similarity among precursors of features and metabolites, as well as their MS/MS spectra.
The function select_library() provides a means to select
any of two sources:
margheRita, which contains MS and MS/MS information for about 800 metabolites spanning several biological functions; these libraries provideup to level 1 identifications in positive and negative modalities for “HILIC”, “LipC8”, “pZIC”, “RPLong” and “RPShort” chromatographic columns, that are acquired following the methods reported in the supplementary material of Mosca et al. (2024).
MS-Dial, which covers a much larger set of metabolites (\(10^5\)), but is limited to level 2 identifications in positive and negative modalities.
In this example, we load the margheRita library in positive modality with retention times of RPShort columns and we discard all peaks with relative intensity less than 10:
mR_library <- select_library(column = "RPShort", mode = "POS", accept_RI=10)The resulting mR_library is a list that contains
metadata and MS information about precursors
## ID CAS Name rt mz PubChemCID
## L10 L10 485-80-3 S-(5'-Adenosyl)-L-methionine 0.80 399.14452 34756
## L14 L14 61-19-8 Adenosine monophosphate 1.40 348.07037 6083
## L17 L17 979-92-0 S-Adenosylhomocysteine 1.28 385.12887 439155
## L20 L20 56-86-0 L-Glutamic acid 0.90 148.06044 33032
## L30 L30 56-40-6 Glycine 0.85 76.03931 750
## L33 L33 56-41-7 L-Alanine 0.88 90.05496 5950
## SMILES
## L10 C[S+](CC[C@H](N)C(O)=O)C[C@H]1O[C@H]([C@H](O)[C@@H]1O)N1C=NC2=C1N=CN=C2N
## L14 NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(O)=O)[C@@H](O)[C@H]1O
## L17 N[C@@H](CCSC[C@H]1O[C@H]([C@H](O)[C@@H]1O)N1C=NC2=C(N)N=CN=C12)C(O)=O
## L20 N[C@@H](CCC(O)=O)C(O)=O
## L30 NCC(O)=O
## L33 C[C@H](N)C(O)=O
and MS/MS peaks
## $M1
## [,1] [,2]
## [1,] 45.03236 10.34602
## [2,] 70.02762 12.46120
## [3,] 96.00721 13.50730
## [4,] 99.04249 17.68019
## [5,] 113.03337 100.00000
## [6,] 117.05313 20.43913
##
## $M2
## [,1] [,2]
## [1,] 57.03256 11.18379
## [2,] 60.07970 58.46973
## [3,] 85.02673 100.00000
## [4,] 95.08432 12.06547
## [5,] 109.09995 11.77031
## [6,] 144.10080 38.14543
## [7,] 183.17293 27.40629
## [8,] 285.20408 26.57983
## [9,] 344.27686 33.27335
##
## $M3
## [,1] [,2]
## [1,] 55.01692 100.00000
## [2,] 56.04866 58.49046
## [3,] 70.06436 54.29821
## [4,] 98.05859 90.19182
## [5,] 116.06901 16.49188
Once the library is selected, metabolite identification can be
performed by the homonymous function, where the argument
features specifies the features to be considered (all
features if it is left features=NULL, as in the following
example):
mL_norm_bio <- metabolite_identification(mL_norm_bio, library_list = mR_library)The function metabolite_identification() has a series of
parameters that can be adjusted to optimize the identification process
(see its documentation). When metabolite identification is applied on a
large number of features (e.g., \(10^4\)), it is common to obtain
many-to-many associations between features and metabolite. By default,
this redundancy is addressed following an algorithm that considers the
classification (Level 1, Level 2, Level 3a and Level 3b), the errors
(see above) and a series of quantitative and qualitative scores (see
below and Mosca et al. (2024) for further details). This
function adds/modifies the following items in the mRList and saves each
of them to a sheet of the xlsx file
“metabolite_identification.xlsx”:
mL_norm_bio$metabolite_identification$associations: all
the resulting associations (sheet “associations”)## Feature_ID rt mz RT_err RT_class RT_flag ppm_error mass_flag
## 684 F14798 2.931 282.1190 0.431 super TRUE 2.3543199 TRUE
## 165 F16399 5.698 299.1265 1.362 unacceptable FALSE 4.4438533 TRUE
## 731 F9130 1.708 218.1370 0.388 super TRUE 7.6480152 TRUE
## 296 F10165 4.346 231.1129 0.776 acceptable TRUE 0.1823525 TRUE
## 317 F17972 7.157 316.2458 0.293 super TRUE 7.6487003 TRUE
## mass_status ID_peaks peaks_found_ppm_RI matched_peaks_ratio
## 684 super M228 1 1.0000000
## 165 super M1110 5 0.7142857
## 731 acceptable M505 2 0.6666667
## 296 super M1028 9 0.6000000
## 317 acceptable M1988 3 0.6000000
## precursor_in_MSMS ID Name rt_lib
## 684 FALSE L800 1-Methyladenosine 2.50
## 165 FALSE L1351 Enterolactone 7.06
## 731 FALSE L905 Propionyl-L-carnitine 1.32
## 296 FALSE L1659 1,2,3,4-Tetrahydroharmane-3-carboxylic acid 3.57
## 317 FALSE L1706 Decanoyl-L-carnitine 7.45
## mz_lib SMILES Level
## 684 282.1197 CN1C=NC2=C(N=CN2[C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O)C1=N 1
## 165 299.1278 C1C(C(C(=O)O1)CC2=CC(=CC=C2)O)CC3=CC(=CC=C3)O 2
## 731 218.1387 CCC(=O)OC(CC([O-])=O)C[N+](C)(C)C 1
## 296 231.1128 CC1NC(CC2=C1NC1=CC=CC=C21)C(O)=O 1
## 317 316.2482 CCCCCCCCCC(=O)OC(CC([O-])=O)C[N+](C)(C)C 1
## Level_note
## 684
## 165
## 731
## 296
## 317
mL_norm_bio$metabolite_identification$associations_summary:
a summary of the associations (“summary”):## Feature_ID ID Name Level
## 684 F14798 L800 1-Methyladenosine 1
## 165 F16399 L1351 Enterolactone 2
## 731 F9130 L905 Propionyl-L-carnitine 1
## 296 F10165 L1659 1,2,3,4-Tetrahydroharmane-3-carboxylic acid 1
## 317 F17972 L1706 Decanoyl-L-carnitine 1
## Level_note RT_err ppm_error peaks_found_ppm_RI matched_peaks_ratio
## 684 0.431 2.3543199 1 1.0000000
## 165 1.362 4.4438533 5 0.7142857
## 731 0.388 7.6480152 2 0.6666667
## 296 0.776 0.1823525 9 0.6000000
## 317 0.293 7.6487003 3 0.6000000
mL_norm_bio$metab_ann: columns are added with main
annotation info for each feature (here below we omit MS/MS spectra
column for clarity) (“metab_ann”):## Feature_ID MSDialName
## 94 F9130 Unknown
## 101 F10165 Unknown
## 111 F14798 Unknown
## 118 F16399 w/o MS2:Enterolactone
## 123 F17972 Unknown
## MSDialSMILES
## 94 null
## 101 null
## 111 null
## 118 OC1=CC=CC(C[C@@H]2COC(=O)[C@H]2CC2=CC(O)=CC=C2)=C1
## 123 null
## Ontology rt mz quality reference ID
## 94 null 1.708 218.1370 0.13702 158.9981 L905
## 101 null 4.346 231.1129 0.11285 1751.5100 L1659
## 111 null 2.931 282.1190 0.11902 1903.7580 L800
## 118 Dibenzylbutyrolactone lignans 5.698 299.1265 0.12646 5182.2360 L1351
## 123 null 7.157 316.2458 0.24582 6871.8990 L1706
## Name Level Level_note RT_err
## 94 Propionyl-L-carnitine 1 0.388
## 101 1,2,3,4-Tetrahydroharmane-3-carboxylic acid 1 0.776
## 111 1-Methyladenosine 1 0.431
## 118 Enterolactone 2 1.362
## 123 Decanoyl-L-carnitine 1 0.293
## ppm_error peaks_found_ppm_RI matched_peaks_ratio
## 94 7.64801518144863 2 0.666666666666667
## 101 0.182352507340912 9 0.6
## 111 2.3543199472254 1 1
## 118 4.44385325480016 5 0.714285714285714
## 123 7.64870030115516 3 0.6
mL_norm_bio$data_ann (“data_ann”): the column “Name” is
added :## Feature_ID Name AA_mealA_t00 AA_mealA_t01 AA_mealA_t02
## 1 F850 Sarcosine;L-Alanine 2742.611 4025.7484 5473.4923
## 2 F1013 2-Hydroxypyridine 1764.002 1591.1996 748.3421
## 3 F1433 Malonic acid 118.468 100.5956 101.8788
## 4 F1514 Benzaldehyde 5348.574 6054.4923 4316.2360
## 5 F1686 2-Aminophenol 3393.074 1973.3114 2113.1814
Importantly, the feature-metabolite association filtering can
be disabled (e.g. for exploratory analysis) setting
filter=FALSE.
The spectra from all the features that match a metabolite can be inspected creating the following plot:
visualize_associated_spectra(mRList = mL_norm_bio, mR_library = mR_library, metabolite_id = "L1660")
The function h_map_MSMS_comparison() draws heatmaps to
visually compare ppm errors and RI differences between feature and
metabolite spectra:
h_map_MSMS_comparison(mL_norm_bio, metab_id = "L1660", feature_id = "F10165")

The function metab_boxplot() draws boxplots of feature
abundances grouped by the levels of a given factor:
metab_boxplot(mRList = mL_norm_bio, col_by="class", group="class", features = "F3957")
The function h_map() provides heatmaps based on package
ComplexHeatmap (Gu, Eils,
and Schlesner 2016). Here we show the abundance of the most
significant metabolites according to ANOVA test:
significant_features <- select_sign_features(mL_norm_bio, test_method="anova", test_value = "q", cutoff_value = 10e-10, feature_id = "Name")
h_map(mL_norm_bio, scale_features=TRUE, column_ann="class", data.use = "data_ann", col_ann= pals::trubetskoy(3), features = significant_features, top = Inf, use_top_annoteted_metab = TRUE, test_method="anova", test_value = "q", show_column_names=FALSE)
The function merge_stats_w_features() joins the results
of metabolite annotation and statistical analysis:
metab_stats <- merge_stats_w_features(mRList = mL_norm_bio, test_method = "anova", test_value = "q", cutoff_value = 0.05)It produces a list of data frames and an xlsx file creating with the following itmes/sheets:
metab_stat: full table with feature annotation and
statistical analysis results;
metab_stat_feat: same as metab_stat,
with the addition of feature intensities;
metab_stat_feat_known: same as
metab_stat_feat, but restricted to features with an
assigned compound;
metab_stat_feat_sig: same as
metab_stat_feat, but only for statistically significant
features;
metab_stat_feat_known_sig: same as
metab_stat_feat, but only for statistically significant
features with an assigned compound.
margheRita implements both Over Representation Analysis (ORA) and
Metabolite Set Enrichment Analysis (MSEA), based on clusterProfiler
(Wu et al. 2021)
over BioCyc, KEGG and Reactome pathway databases. These analyses can be
run by means of function pathway_analysis(). In case of ORA
the minimum requirements are the vector of PubChemCID to be tested and
the reference universe of all PubChemCIDs in the dataset. In the
following example we extract the PubChemCID of the most significat
features according to anova and consider all the PubChemCID found in the
dataset as reference universe:
significant_CIDs <- unique(metab_stat_w_features_known_only_sign$PubChemCID)
significant_CIDs <- sub(";.*", "", significant_CIDs)
all_PubChemCID <- unique(metab_stat_w_features$PubChemCID)[which(!is.na(unique(metab_stat_w_features$PubChemCID)))]
all_PubChemCID <- sub(";.*", "", all_PubChemCID)
pa_res <- pathway_analysis(in_list = significant_CIDs, universe = all_PubChemCID, type = "ora")In case of MSEA, a named ranked vector of scores for all PubChemCIDs in the dataset, in decreasing order of importance:
ranked_vector <- select_sign_features(mRList = mL_norm_bio, test_method="anova", test_value = "q", cutoff_value = Inf, feature_id = "Feature_ID", values = TRUE)
names(ranked_vector) <- vapply(names(ranked_vector), function(x) metab_stat_w_features$PubChemCID[which(metab_stat_w_features$Feature_ID==x)][1], character(1))
names(ranked_vector) <- sub(";.*", "", names(ranked_vector))
ranked_vector <- sort(-log10(ranked_vector), decreasing = T)
msea_res <- pathway_analysis(in_list = ranked_vector, type = "msea")The result is a list that contains:
Here’s an example of resulting tables for ORA:
## ID Description GeneRatio BgRatio
## 83035 83035 ABC transporters 6/16 11/88
## 1270158 1270158 Metabolism of amino acids and derivatives 7/16 19/88
## 790012 790012 Biosynthesis of amino acids 5/16 11/88
## 1270189 1270189 Biological oxidations 4/16 11/88
## 1269956 1269956 Metabolism 11/16 47/88
## pvalue p.adjust qvalue
## 83035 0.003893364 0.05061373 0.04098278
## 1270158 0.024860995 0.10926921 0.08847709
## 790012 0.025215971 0.10926921 0.08847709
## 1270189 0.108983357 0.35419591 0.28679831
## 1269956 0.139224418 0.36198349 0.29310404
## geneID Count
## 83035 750/6287/247/1123/6262/5962 6
## 1270158 247/1123/439227/586/6262/439258/60961 7
## 790012 750/6287/6262/5962/439258 5
## 1270189 2519/79034/60961/5753 4
## 1269956 247/1123/439227/586/6262/439216/2519/79034/439258/60961/5753 11
and MSEA:
## ID Description setSize
## 1270158 1270158 Metabolism of amino acids and derivatives 22
## 790012 790012 Biosynthesis of amino acids 13
## 83035 83035 ABC transporters 12
## 132956 132956 Metabolic pathways 73
## 172847 172847 Protein digestion and absorption 11
## enrichmentScore NES pvalue p.adjust qvalues rank
## 1270158 0.7412484 1.473106 0.000999001 0.007568113 0.004779861 36
## 790012 0.8204099 1.571580 0.001009082 0.007568113 0.004779861 22
## 83035 0.7754533 1.471869 0.011122346 0.055611729 0.035123197 22
## 132956 0.6058509 1.276509 0.015984016 0.057750760 0.036474164 43
## 172847 0.7679051 1.443829 0.019250253 0.057750760 0.036474164 20
## leading_edge
## 1270158 tags=45%, list=20%, signal=42%
## 790012 tags=46%, list=12%, signal=44%
## 83035 tags=50%, list=12%, signal=47%
## 132956 tags=33%, list=23%, signal=42%
## 172847 tags=36%, list=11%, signal=34%
## core_enrichment
## 1270158 439258/439227/247/586/1123/6262/60961/86/5816/936
## 790012 5962/439258/6287/750/6267/6262
## 83035 5962/247/6287/1123/750/6262
## 132956 5962/439258/439227/247/6287/586/1123/750/6844/111/6267/6262/5753/60961/439216/2519/444539/86/5816/936/6037/4687/126/33032
## 172847 5962/6287/750/6267
A plot for the pathways analysis generated with ORA can be generated using the following function:
barplot_ora(pa_res, p = "p.adjust", filename = "pathway_analysis_ora")
See the documentation of clusterProfiler for further information.