scMuffin package vignettes • scMuffin

Installation

scMuffin requires R >= 4.0.0, due to some of its dependencies, like Seurat (Hao et al. 2021). R can be installed from CRAN at the URL https://cran.r-project.org/index.html.

To succesfully install scMuffin you need some packages from Bioconductor (https://bioconductor.org) and github (https://github.com/). These packages can be installed using the following commands:

if (!require("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")
}
BiocManager::install(c("BiocStyle", "ComplexHeatmap", "DESeq2", "org.Hs.eg.db"))

if (!require("devtools", quietly = TRUE)){
  install.packages("devtools")
}
devtools::install_github("theislab/destiny")

The other dependencies, if missing, should be automatically installed using the following command:

devtools::install_github("emosca-cnr/scMuffin", build_vignettes = TRUE)

To load the package:

library(scMuffin)

Input

scMuffin is intended to be used downstream general purpose tasks like quality control, normalization, cell clustering and dataset integration, for which there are dedicated tools, such as Seurat (Hao et al. 2021). scMuffin requires three inputs:

genes-by-cells raw counts matrix;
genes-by-cells normalized expression matrix;
a partition of cells (cell clusters).

The rownames and colnames of the two matrices must be, respectively, gene identifiers and cell identifiers. Currently, we recommend to use official gene symbols. Some analysis, like CNV inference, works only with gene symbols.

Typically, the two genes-by-cells matrices have already been filtered to exclude low quality cells and genes that could negatively affect the analyses. However, the characterization of cells that can be achieved with scMuffin offers insights that can be used to (further) filter a dataset and/or to decide on which cells apply particular analyses (e.g. biomarker identification). In general, according to research questions and experimental design strong or mild filters should be applied before using scMuffin.

Obtaining the data used in this vignette

Here’s an example of how to obtain a publicly available single cell dataset from Gene Expression Omnibus (GEO). In particular, we will consider a sample from a study on High-Grade Glioma (J. Yuan et al. 2018) (GSE103224):

download the filtered genes-by-cells mtrices from the URLs:

sample PJ016: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2758471;

read the text files into R and use the Seurat package (Hao et al. 2021) to perform a series of typical tasks (e.g. normalization, clustering); here’s a concise example using sample PJ016:

library(Seurat)
gbc_counts <- read.table("GSM2758471_PJ016.filtered.matrix.txt", stringsAsFactors = F)
sym <- gbc_counts[, "V2"]
gbc_counts[, c("V1", "V2")] <- NULL
gbc_counts <- keep_strongest_representative(genes_by_cells = gbc_counts, sym)
    
seu_obj_1 <- CreateSeuratObject(counts = gbc_counts, project = "PJ016",
                                min.cells = 100, min.features = 1000)
seu_obj_1[["pMito"]] <- PercentageFeatureSet(seu_obj_1, pattern = "^MT-")
seu_obj_1 <-  subset(seu_obj_1,  subset = nFeature_RNA > 200 &
                       nFeature_RNA < 8000 & nCount_RNA < 30000 & pMito < 10)

seu_obj_1 <- NormalizeData(seu_obj_1)
seu_obj_1 <- FindVariableFeatures(seu_obj_1, selection.method = "vst",
                                  nfeatures = 2000)
seu_obj_1 <- ScaleData(seu_obj_1, features = rownames(seu_obj_1))
seu_obj_1 <- RunPCA(seu_obj_1, npcs = 10,
                    features = VariableFeatures(seu_obj_1))
seu_obj_1 <- FindNeighbors(seu_obj_1, dims = 1:10)
seu_obj_1 <- FindClusters(seu_obj_1)
seu_obj_1 <- RunUMAP(seu_obj_1, dims = 1:10)
seu_obj_1 <- CellCycleScoring(object = seu_obj_1,
                              g2m.features = cc.genes.updated.2019$g2m.genes,
                              s.features = cc.genes.updated.2019$s.genes)

The genes-by-cells count matrix available at NCBI GEO is provided with Ensembl identifiers and gene symbols, which are not unique. So, after having read the table, we used the function keep_strongest_representative, which defines a genes-by-cells matrix with symbols as row names. In case of multiple Ensembl ids mapped to the same symbol, the row with the highest average count is kept as representative of the gene.

In a real scenario, the functions listed above should be run using appropriate parameter values that take into account the specificities of the dataset under consideration and the objectives of the analysis to perform. Here we just wanted to provide a means to obtain a dataset to follow this vignette. Please read the Seurat documentation for further details about the functions listed above.

The scMuffinList data structure

All data and results of scMuffin are stored in the scMuffinList, so that every function takes in input such structure and returns a modified version of it. The scMuffinList can be created as follows:

scML <- create_scMuffinList(counts = GetAssayData(seu_obj_1, assay = "RNA", slot = "counts"), 
                            normalized = GetAssayData(seu_obj_1, assay = "RNA", slot = "data"))

Many analyses require cell clusters. Any partition can be added as follows:

scML <- add_partitions(scML, 
                       clusters = seu_obj_1$seurat_clusters, 
                       partition_id = "global_expr")

where seu_obj_1$seurat_clusters is a named vector that contains cells clusters and whose names are cell labels. The element partitions of scML will contain cell clusters:

head(scML$partitions[, 1, drop = F])

##    global_expr
## V3           2
## V4           2
## V5           6
## V6           2
## V7           6
## V8           6

The function add_features() can be used to add custom results to the scMuffinList. In the following example we add the cell cycle phase:

scML <- add_features(scML, name = "CC_Phase", 
                     summary = cbind(CC_Phase = seu_obj_1$Phase))

head(scML$CC_Phase$summary)

##    CC_Phase
## V3 "G2M"   
## V4 "S"     
## V5 "G1"    
## V6 "G2M"   
## V7 "G1"    
## V8 "G1"

The summary element of the scMuffinList is a data.frame intended to contain data about the feature, while element full can be used to store other supporting feature data (see other examples below).

Gene set scoring

scMuffin provides functions to set up one or more collections of gene sets and perform cell-level estimation of gene set expression in relation to an empirical null model. This can be applied to any gene set and can therefore be used to estimate various cells’ phenotypes, like pathway activities or markers expression.

Assembling input gene sets

The function prepare_gsls retrieves gene sets from CellMarker (Zhang et al. 2019), PanglaoDB (Franzén, Gan, and Björkegren 2019), CancerSEA (H. Yuan et al. 2019) and MSigDB (Subramanian et al. 2005), and accepts custom gene set as well. The full list of gene sets available whitin CellMarker, PanglaoDB and CancerSEA collections can be listed using:

data("gsls_EntrezID")
data("gsls_Symbol")

while in the case of MSigDB we can use its dedicated functions:

msigdbr::msigdbr_collections()
msigdbr::msigdbr_species()

The gene sets of interest can be selected acting on the corresponding arguments CM_tissues, PNDB_tissues, and msigdb_hs_cat_subcat. In the case of CellMarker and PanglaoDB we can specify a list of desired tissues. The full set can be listed by means of show_tissues. Here we show just the first part of the output:

lapply(show_tissues(), head)

## $PNDB
## [1] "Adrenal_glands"    "Blood"             "Bone"             
## [4] "Brain"             "Connective_tissue" "Embryo"           
## 
## $CM_normal
## [1] "Abdominal_adipose_tissue" "Adipose_tissue"          
## [3] "Adrenal_gland"            "Adventitia"              
## [5] "Airway_epithelium"        "Amniotic_fluid"          
## 
## $CM_cancer
## [1] "Adipose_tissue" "Ascites"        "Bladder"        "Blood"         
## [5] "Blood_vessel"   "Bone"

Here is an example:

gsc <- prepare_gsls(gs_sources = c("CancerSEA", "PNDB"), 
                    PNDB_tissues = c("Brain"), 
                    scMuffinList = scML, 
                    genes_min = 3)

Note that in the case of MSigDB we have to set up a data.frame to specify species, category and sub_category of the gene set collections.

Calculate Gene set scores at cell and cluster level

In the following example, we estimate the scores for CancerSEA gene sets:

scML <- calculate_gs_scores(scMuffinList = scML, 
                            gs_list = gsc$CancerSEA)

The results of any analysis are stored as elements of the scML. The gene set scoring engine stores its results in scML$gene_set_scoring, where summary is a cells-by-features data.frame with cell scores, and full contains a series of additional details for every gene set:

head(scML$gene_set_scoring$summary)
head(scML$gene_set_scoring$full$Angiogenesis)

##    Angiogenesis    Apoptosis  Cell_Cycle Differentiation DNA_damage
## V3  -0.10918589 -0.027672315  0.14448128       0.2177449 0.10625520
## V4  -0.02872848  0.065879018  0.28261124       0.2546351 0.28273733
## V5   0.08863655 -0.003487231 -0.04785548       0.1568680 0.14740373
## V6  -0.10802785 -0.010414814  0.36535484       0.2201350 0.16137969
## V7   0.09552221  0.214845193 -0.13056901       0.1659778 0.01677083
## V8   0.14137853  0.267301590 -0.10286473       0.1054560 0.09103363
##      DNA_repair        EMT     Hypoxia Inflammation  Invasion Metastasis
## V3  0.148348585 0.05265334  0.09262748  0.009633798 0.3169777  0.2093022
## V4  0.181192508 0.13265908  0.12957070  0.050349840 0.5034789  0.2555932
## V5 -0.011344590 0.22409580  0.41523704  0.111271235 0.2040137  0.3528524
## V6  0.163828298 0.26049885 -0.03046142  0.005814202 0.5683223  0.1564824
## V7 -0.071774957 0.35638003  0.39819922  0.004184621 0.3586581  0.3512004
## V8  0.001900824 0.23221609  0.32485036 -0.015495107 0.2552769  0.4519555
##    Proliferation  Quiescence  Stemness
## V3    0.19624494 -0.10718702 0.4275265
## V4    0.31782667  0.14023972 0.6338283
## V5    0.13657056  0.00557050 0.2942666
## V6    0.16174960  0.06201775 0.5979552
## V7    0.17320576  0.03903302 0.2387523
## V8   -0.03581073  0.07383093 0.2218311

##         case case.N case.AV nmark_min avg_control control.AV null_ok
## V3 0.6209460     40      21      TRUE   0.7301319        100    TRUE
## V4 0.7282336     40      20      TRUE   0.7569621        100    TRUE
## V5 0.8418314     40      23      TRUE   0.7531949        100    TRUE
## V6 0.6632416     40      16      TRUE   0.7712694        100    TRUE
## V7 0.8363020     40      23      TRUE   0.7407798        100    TRUE
## V8 0.8964932     40      19      TRUE   0.7551147        100    TRUE
##    avg_delta_score delta_score
## V3     -0.10918589 -0.10918589
## V4     -0.02872848 -0.02872848
## V5      0.08863655  0.08863655
## V6     -0.10802785 -0.10802785
## V7      0.09552221  0.09552221
## V8      0.14137853  0.14137853

The values of cell-level scores can be used to color UMAP visualizations, which are automatically generated for every gene set using:

plot_umap_colored_features(scMuffinList = scML, 
                           Seu_obj = seu_obj_1, 
                           feature_name = "gene_set_scoring")

Here below, an example of UMAP visualization where cells are colored by CancerSEA “Hypoxia” gene set score.

caption

The function calculate_gs_scores_in_clusters defines the median values of gene set scores in every cluster of a given partition id:

scML <- calculate_gs_scores_in_clusters(scMuffinList = scML, 
                                        partition_id = "global_expr")

scML$cluster_data$global_expr$gene_set_scoring$summary

##   Angiogenesis    Apoptosis  Cell_Cycle Differentiation DNA_damage   DNA_repair
## 0  0.005206236  0.019249002 -0.06447346      0.10079093 0.08386659 -0.015634375
## 1  0.022524297  0.052477813 -0.02173267      0.16058731 0.15353017 -0.016262480
## 2  0.016285304 -0.077046033  0.09287226      0.16931117 0.14490958  0.104084184
## 3 -0.011020301 -0.004544935 -0.06073645      0.09458853 0.11932908 -0.001509004
## 4  0.022337656  0.021403192 -0.04759109      0.14014722 0.12942893 -0.024110561
## 5 -0.031611017 -0.042869916  0.02839297      0.15244780 0.22201017  0.067711722
## 6  0.132965431  0.028446423 -0.05078115      0.17077833 0.11716633 -0.035208636
## 7  0.083809559 -0.027496365  0.01100746      0.15875134 0.20662126  0.027456269
## 8  0.009743315 -0.072038549  0.28456623      0.24816174 0.16453499  0.118506158
##         EMT      Hypoxia Inflammation   Invasion Metastasis Proliferation
## 0 0.1958425  0.152775367  0.007683201 0.02659866 0.24715672    0.01822001
## 1 0.2445520  0.300890417  0.129862284 0.15558955 0.31501115    0.02251154
## 2 0.1556018 -0.011922956 -0.003003220 0.38545997 0.10925866    0.16892359
## 3 0.2323454  0.267548254  0.086860333 0.09265066 0.24611511    0.12064441
## 4 0.2101481  0.262389267  0.056573420 0.17563272 0.29081642    0.03197286
## 5 0.1895798 -0.046625305  0.028563842 0.20228425 0.06788279    0.03974853
## 6 0.3314966  0.287708428  0.080396645 0.20701318 0.37491821   -0.02531225
## 7 0.1290949 -0.044477324 -0.003007760 0.19486268 0.10988028    0.05104020
## 8 0.1226556  0.004651646  0.035257815 0.49988275 0.14005943    0.39134694
##     Quiescence  Stemness
## 0 -0.014092199 0.2370518
## 1  0.086151276 0.2942516
## 2  0.029009746 0.3854771
## 3  0.148241530 0.3829096
## 4 -0.003897101 0.2378097
## 5 -0.054270405 0.3662737
## 6  0.073751645 0.2388921
## 7  0.040777466 0.4215733
## 8  0.006650415 0.4348696

These mean values are useful to obtain a concise visualization of gene set expression throughout the dataset, using the function plot_heatmap_features_by_clusters:

plot_heatmap_features_by_clusters(scMuffinList = scML, 
                                  feature_source = "gss")

caption

CNV inference

CNV calculation

CNV inference is performed by the function CNV_analysis.

scML <- CNV_analysis(scML)

The results are stored in scML$CNV, where:

summary contains the CNV score;
scML$CNV$full$CNV contains the regions-by-cells matrix of CNV profiles;
scML$CNV$full$regions2genes is important to map the original data into the CNV regions;
scML$CNV$full$detected_cnv_regions is a data.frame that list the CNV regions detected in each chromosome and cell cluster:

head(scML$CNV$summary)
head(scML$CNV$full$CNV)
head(scML$CNV$full$regions2genes)
head(scML$CNV$full$detected_cnv_regions$chr1)

##    CNV_score
## V3  42.61237
## V4  49.29867
## V5  40.00239
## V6  45.42658
## V7  37.62340
## V8  45.69161

##                                                            V3         V4
## chr1__826205__LINC00115__chr1__15807161__UQCRHL   -0.09701447 -0.1335597
## chr1__827590__LINC01128__chr1__15834214__FLJ37453 -0.09701447 -0.1335597
## chr1__923922__SAMD11__chr1__15847706__SPEN        -0.10029463 -0.1368559
## chr1__944202__NOC2L__chr1__15941868__ZBTB17       -0.10029463 -0.1291166
## chr1__998963__HES4__chr1__16246839__FBXO42        -0.10357480 -0.1233517
## chr1__1013496__ISG15__chr1__16367241__SZRD1       -0.10357480 -0.1208830
##                                                            V5          V6
## chr1__826205__LINC00115__chr1__15807161__UQCRHL   -0.08801707 -0.08020141
## chr1__827590__LINC01128__chr1__15834214__FLJ37453 -0.08801707 -0.08020141
## chr1__923922__SAMD11__chr1__15847706__SPEN        -0.08801707 -0.08020141
## chr1__944202__NOC2L__chr1__15941868__ZBTB17       -0.08801707 -0.07690028
## chr1__998963__HES4__chr1__16246839__FBXO42        -0.08801707 -0.08267282
## chr1__1013496__ISG15__chr1__16367241__SZRD1       -0.08801707 -0.07937168
##                                                            V7         V8
## chr1__826205__LINC00115__chr1__15807161__UQCRHL   -0.09343743 -0.1312566
## chr1__827590__LINC01128__chr1__15834214__FLJ37453 -0.09013095 -0.1312566
## chr1__923922__SAMD11__chr1__15847706__SPEN        -0.09013095 -0.1254576
## chr1__944202__NOC2L__chr1__15941868__ZBTB17       -0.09013095 -0.1221395
## chr1__998963__HES4__chr1__16246839__FBXO42        -0.09343743 -0.1188214
## chr1__1013496__ISG15__chr1__16367241__SZRD1       -0.09343743 -0.1155033
##                                                           V9        V10
## chr1__826205__LINC00115__chr1__15807161__UQCRHL   -0.1363821 -0.1140620
## chr1__827590__LINC01128__chr1__15834214__FLJ37453 -0.1397023 -0.1140620
## chr1__923922__SAMD11__chr1__15847706__SPEN        -0.1363821 -0.1106926
## chr1__944202__NOC2L__chr1__15941868__ZBTB17       -0.1363821 -0.1140620
## chr1__998963__HES4__chr1__16246839__FBXO42        -0.1363821 -0.1140620
## chr1__1013496__ISG15__chr1__16367241__SZRD1       -0.1363821 -0.1106926
##                                                          V11         V12
## chr1__826205__LINC00115__chr1__15807161__UQCRHL   0.03868385 -0.08410502
## chr1__827590__LINC01128__chr1__15834214__FLJ37453 0.04206464 -0.08410502
## chr1__923922__SAMD11__chr1__15847706__SPEN        0.04796102 -0.08410502
## chr1__944202__NOC2L__chr1__15941868__ZBTB17       0.05385741 -0.08067802
## chr1__998963__HES4__chr1__16246839__FBXO42        0.05637300 -0.08067802
## chr1__1013496__ISG15__chr1__16367241__SZRD1       0.05975379 -0.07471005

## $chr1__826205__LINC00115__chr1__15807161__UQCRHL
##         chr      pos    symbol
## X11.1  chr1   826205 LINC00115
## X12.1  chr1   827590 LINC01128
## X13.1  chr1   923922    SAMD11
## X14.1  chr1   944202     NOC2L
## X15.1  chr1   998963      HES4
## X16.1  chr1  1013496     ISG15
## X17.1  chr1  1216930      SDF4
## X18.1  chr1  1232236   B3GALT6
## X19.1  chr1  1253911    UBE2J2
## X110.1 chr1  1280435    SCNN1D
## X111.1 chr1  1292390     ACAP3
## X112.1 chr1  1308579     PUSL1
## X113.1 chr1  1324801      CPTP
## X114.1 chr1  1335277      DVL1
## X115.1 chr1  1352688     MXRA8
## X116.1 chr1  1373735  AURKAIP1
## X117.1 chr1  1385710     CCNL2
## X118.1 chr1  1401908    MRPL20
## X119.1 chr1  1449688    ATAD3C
## X120.1 chr1  1471764    ATAD3B
## X121.1 chr1  1512161    ATAD3A
## X122.1 chr1  1534777   TMEM240
## X123.1 chr1  1541672     SSU72
## X124.1 chr1  1615854      MIB2
## X125.1 chr1  1635224    CDK11B
## X126.1 chr1  1661477  SLC35E2B
## X127.1 chr1  1751231      NADK
## X128.1 chr1  1785284      GNB1
## X129.1 chr1  2189548    FAAP20
## X130.1 chr1  2228318       SKI
## X131.1 chr1  2391840      RER1
## X132.1 chr1  2403973     PEX10
## X133.1 chr1  2508536     PANK4
## X134.1 chr1  2528744      HES5
## X135.1 chr1  2556364  TNFRSF14
## X136.1 chr1  3625014    TPRG1L
## X137.1 chr1  3630769    WRAP73
## X138.1 chr1  3778558    LRRC47
## X139.1 chr1  3812085    CEP104
## X140.1 chr1  3889132  C1orf174
## X141.1 chr1  6185019     RPL22
## X142.1 chr1  6221192      ICMT
## X143.1 chr1  6247352    GPR153
## X144.1 chr1  6264271     ACOT7
## X145.1 chr1  6460785  TNFRSF25
## X146.1 chr1  6467121   PLEKHG5
## X147.1 chr1  6521346      NOL9
## X148.1 chr1  6579993    ZBTB48
## X149.1 chr1  6590723    KLHL21
## X150.1 chr1  6613730     PHF13
## X151.1 chr1  6625149     THAP3
## X152.1 chr1  6634169   DNAJC11
## X153.1 chr1  6785453    CAMTA1
## X154.1 chr1  7771295     VAMP3
## X155.1 chr1  7784428      PER3
## X156.1 chr1  7961710     PARK7
## X157.1 chr1  8011726    ERRFI1
## X158.1 chr1  8352403      RERE
## X159.1 chr1  8860999      ENO1
## X160.1 chr1  9148010  MIR34AHG
## X161.1 chr1  9234773      H6PD
## X162.1 chr1  9292893     SPSB1
## X163.1 chr1  9539464  SLC25A33
## X164.1 chr1  9588910   TMEM201
## X165.1 chr1  9728925    CLSTN1
## X166.1 chr1  9848275  CTNNBIP1
## X167.1 chr1  9922117      LZIC
## X168.1 chr1  9942922    NMNAT1
## X169.1 chr1 10032957     UBE4B
## X170.1 chr1 10210705     KIF1B
## X171.1 chr1 10399063       PGD
## X172.1 chr1 10460545      DFFA
## X173.1 chr1 10474949     PEX14
## X174.1 chr1 10647210     CASZ1
## X175.1 chr1 11012653    TARDBP
## X176.1 chr1 11054589       SRM
## X177.1 chr1 11066618   EXOSC10
## X178.1 chr1 11106534      MTOR
## X179.1 chr1 11273197    UBIAD1
## X180.1 chr1 11654406    FBXO44
## X181.1 chr1 11674479    MAD2L2
## X182.1 chr1 11691709    DRAXIN
## X183.1 chr1 11736084    AGTRAP
## X184.1 chr1 11785725     MTHFR
## X185.1 chr1 11806095     CLCN6
## X186.1 chr1 11845709      NPPA
## X187.1 chr1 11919590  KIAA2013
## X188.1 chr1 11934716     PLOD1
## X189.1 chr1 11980443      MFN2
## X190.1 chr1 12019497      MIIP
## X191.1 chr1 12230029    VPS13D
## X192.1 chr1 12567909     DHRS3
## X193.1 chr1 13749414     PRDM2
## X194.1 chr1 14945918      KAZN
## X195.1 chr1 15409887     EFHD2
## X196.1 chr1 15491400     CASP9
## X197.1 chr1 15526847   DNAJC16
## X198.1 chr1 15617457      DDI2
## X199.1 chr1 15684319   PLEKHM2
## X1100  chr1 15758794    FBLIM1
## X1101  chr1 15807161    UQCRHL

##     chr                                            start start.loc
## 1  chr1  chr1__826205__LINC00115__chr1__15807161__UQCRHL    826205
## 3  chr1   chr1__113929323__HIPK1__chr1__153633981__CHTOP 113929323
## 5  chr1   chr1__169132530__NME7__chr1__203127725__ADORA1 169132530
## 6  chr1  chr1__826205__LINC00115__chr1__15807161__UQCRHL    826205
## 35 chr1  chr1__826205__LINC00115__chr1__15807161__UQCRHL    826205
## 38 chr1 chr1__93180715__CCDC18__chr1__144887287__SRGAP2B  93180715
##                                               stop  stop.loc cluster    length
## 1    chr1__9588910__TMEM201__chr1__25884178__STMN1  25884178       0  25057973
## 3   chr1__153990761__RPS27__chr1__162497804__UHMK1 162497804       0  48568481
## 5  chr1__169792531__METTL18__chr1__203305518__BTG2 203305518       0  34172988
## 6   chr1__213988532__PROX1__chr1__244653125__DESI2 244653125       1 243826920
## 35     chr1__9234773__H6PD__chr1__25616790__MAN1C1  25616790       2  24790585
## 38    chr1__154161812__TPM3__chr1__162561898__UAP1 162561898       2  69381183

Importantly, CNV inference adds the “CNV” partition:

head(scML$partitions)

##    global_expr CNV
## V3           2   1
## V4           2   1
## V5           6   0
## V6           2   1
## V7           6   0
## V8           6   2

The calculation can be demanding. For example, it requires approximately 10 minutes on 2 cores (dual Intel(R) Xeon(R), 2.60GHz).

CNV visualization

scMuffin provides two visualizations for CNVs: an heatmap and the cluster average profile plot. The heatmap is based on ComnplexHeatmap package and enables the visualization of the genomic location of a series of given genes (specified in the argument genes), or, alternatively, the location of detected CNVs (argument mark.detected.cnv = T)

col_fun <- circlize::colorRamp2(seq(-0.2, 0.2, length.out = 11), 
                                rev(pals::brewer.rdylbu(11)))
heatmap_CNV(scMuffinList = scML, 
            genes = c("YBX1", "HNRNPM"), 
            genes.labels = T, 
            col = col_fun)

heatmap_CNV(scMuffinList = scML, mark.detected.cnv = T, col=col_fun)

caption

The function plot_profile_CNV plots the median CNV profile of a cluster:

plot_profile_CNV(scMuffinList = scML, cluster = 0, cex.points = 0.5)

Transcriptional complexity

The transcriptional complexity (TC) can be quantified by means of the function transcr_compl. The corresponding summary element contains a data.frame with the number of cell transcripts, the number of genes detected in every cell, the TC-ratio (C), the TC-LMR (linear model residual) and the TC-H (entropy):

scML <- transcr_compl(scML)

head(scML$transcr_compl$summary)

##    tot_counts n_genes         C        H          LM
## V3      16228    1051 0.8628900 6.522372 -0.02561130
## V4      14286    1168 1.0893060 6.736431  0.06164645
## V5      15857    1217 1.0225564 6.602594  0.04559174
## V6      15157    1208 1.0618701 6.763307  0.05703916
## V7      15609    1171 0.9995385 6.535546  0.03398037
## V8      15286    1218 1.0616250 6.685640  0.05786558

Cell proliferation rate

Cell proliferation is quantified considering the maximum between the two scores of G1/S and G2/M gene sets:

scML <- proliferation_analysis(scMuffinList = scML)

The proliferation score is stored in the summary element of scML$proliferation:

head(scML$proliferation$summary)

##    Proliferation_score
## V3          0.22333804
## V4          0.48095696
## V5         -0.21546692
## V6          0.33213053
## V7         -0.08829706
## V8         -0.11826046

Cell state trajectories

Diffusion maps identify differentiation trajectories. scMuffin relies on the diffusion pseudo time calculation available in the R package “destiny” (Angerer et al. 2016). Here we calculate the diffusion map over the first 50 PC (to speed up computation), using a random cell to obtain diffusion pseudotimes:

scML <- diff_map(scML, 
                 root_cell = "random",
                 n_pcs = 50)

A data.frame with the most important features of the analysis (the first two eigenvectors, pseudotime, branch information and whether a cell is a tip of the branch or not) are stored scML$diffusion_map_pseudo_t$summary:

scML$diffusion_map_pseudo_t$summary

##            DC1         DC2       dpt branch  tips
## V3  0.02440015 0.027054071 1.3328347     NA FALSE
## V4  0.03072120 0.038239911 1.4752607      2 FALSE
## V5 -0.01926577 0.005752026 0.3735439      1 FALSE
## V6  0.03191479 0.047083928 1.5085980      2 FALSE
## V7 -0.01855966 0.004971247 0.3738625      1 FALSE
## V8 -0.01856063 0.005047917 0.3926285      1 FALSE

The full DPT object from destiny is stored in the element scML$diffusion_map_pseudo_t$full. This allows the user to take advantage of destiny functions, such as plot(), e.g.

destiny::plot(scML$diffusion_map_pseudo_t$full, col_by = 'branch')

The function plot_diff_map() visualizes a diffusion map where cells are colored by any other set of cell-level values. In the following example we plot the diffusion map colored by CNV-based cell clusters:

plot_diff_map(scMuffinList = scML, 
              col_data = setNames(scML$partitions$CNV, rownames(scML$partitions)))

Cluster enrichment assessment

Cell clusters can be assessed for enrichment in quantitative and categorical values. The appropriate statistical test is automatically chosen according to feature type. In this example, we assess cluster enrichment for the feature gene_set_scoring (quantitative one) for the cluster defined by the partition global_expr:

scML <- assess_cluster_enrichment(scML, 
                                  feature_name = "gene_set_scoring", 
                                  partition_id = "global_expr")

In case of quantitative features the result is a list named CSEA, placed under the elements cluster_data$global_expr (partition id). The list contains the gene set table and the leading edge results. Here’s the result related to the gene set Angiogenesis:

scML$cluster_data$global_expr$CSEA$Angiogenesis

##   id size         es        nes nperm      p_val  adj_p_val q_val       FDRq
## 1  0  588 -0.1912822         NA     2         NA         NA    NA         NA
## 2  1  538  0.1957418  0.8894305    95 0.80000000 0.93333333    NA 0.80000000
## 3  2  500  0.2160336  0.9846048    94 0.56382979 0.78936170    NA 1.00000000
## 4  3  331 -0.3133212         NA     7         NA         NA    NA         NA
## 5  4  257  0.2206047  0.9748262    89 0.48314607 0.78936170    NA 0.78624535
## 6  5  186  0.2309628  0.9659241    81 0.53086420 0.78936170    NA 0.66022305
## 7  6  176  0.6179953  2.4796553    83 0.01204819 0.07954545    NA 0.01204819
## 8  7  140  0.4205973  1.6200568    88 0.02272727 0.07954545    NA 0.02272727
## 9  8  115 -0.1547180 -0.7373612    14 1.00000000 1.00000000    NA 1.00000000
##   tags tags_perc list_top list_top_perc lead_edge
## 1  123 0.2525667      505     0.2056189 0.2502578
## 2   95 0.1779026      477     0.1942182 0.1831786
## 3   62 0.1593830      334     0.1359935 0.1636240
## 4   95 0.3356890      503     0.2048046 0.3017030
## 5   85 0.3333333      768     0.3127036 0.2556414
## 6   23 0.1729323      387     0.1575733 0.1540237
## 7  107 0.6079545      633     0.2577362 0.4860970
## 8   37 0.3700000      676     0.2752443 0.2795416
## 9   31 0.3131313      662     0.2695440 0.2383358

In this this example, we assess cluster enrichment in relation to the categorical feature CC_Phase, namely the cell cycle phase calculated by means of Seurat function CellCycleScoring() and added to scML by means of add_features() (see above):

scML <- assess_cluster_enrichment(scML, 
                                  feature_name = "CC_Phase", 
                                  partition_id = "global_expr")

As for categorical features the result is a list named ORA, placed under the elements cluster_data$global_expr (partition id). The list contains ORA results for every categorical value. Here’s the enrichment of clusters in terms of cells with value “G1”:

scML$cluster_data$global_expr$ORA$CC_Phase$G1

##   id    N   wb  bb  bd wbd       exp       er            p        p_adj
## 0  0 2831 2056 775 588 547 427.03214 1.280934 4.898697e-43 2.204414e-42
## 1  1 2831 2056 775 538 516 390.71989 1.320639 4.370181e-53 3.933163e-52
## 2  2 2831 2056 775 500   0 363.12257 0.000000 1.000000e+00 1.000000e+00
## 3  3 2831 2056 775 331 299 240.38714 1.243827 4.638303e-17 8.348946e-17
## 4  4 2831 2056 775 257 255 186.64500 1.366230 1.095704e-34 3.287111e-34
## 5  5 2831 2056 775 186 148 135.08160 1.095634 1.546645e-02 1.988543e-02
## 6  6 2831 2056 775 176 174 127.81915 1.361298 1.102600e-22 2.480850e-22
## 7  7 2831 2056 775 140 117 101.67432 1.150733 1.320545e-03 1.980817e-03
## 8  8 2831 2056 775 115   0  83.51819 0.000000 1.000000e+00 1.000000e+00

The enrichment analysis results appearing in these tables can be easily extracted and organized in a clusters-by-values table by means of extract_cluster_enrichment_table. For instance, here we extract CSEA NES and FDRq values, and ORA er (enrichment ratio) values:

csea_fdr_table <- extract_cluster_enrichment_table(scML,  
                                                   partition_id = "global_expr", 
                                                   type = "CSEA", 
                                                   quantity = "FDRq")
csea_nes_table <- extract_cluster_enrichment_table(scML,  
                                                   partition_id = "global_expr", 
                                                   type = "CSEA",
                                                   quantity = "nes")
ora_p_table <- extract_cluster_enrichment_table(scML, 
                                                partition_id = "global_expr", 
                                                type = "ORA", 
                                                quantity = "p")
csea_fdr_table

##   Angiogenesis
## 0           NA
## 1   0.80000000
## 2   1.00000000
## 3           NA
## 4   0.78624535
## 5   0.66022305
## 6   0.01204819
## 7   0.02272727
## 8   1.00000000

These tables can be plotted with the function plot_heatmap_features_by_clusters. In the following example we plot NES values with asterisks according to their significance:

plot_heatmap_features_by_clusters(feature_source = csea_nes_table, 
                                  significance_matrix = csea_fdr_table, 
                                  sig_threshold = 0.05)

Similarly, it’s possible to extract the most significant “tags” of any clusters; for example, here we extract the top 3 tags by FDRq (CSEA) and p_adj (ORA):

scML <- extract_cluster_enrichment_tags(scML, 
                                        partition_id = "global_expr", 
                                        n_max_per_cluster = 3, 
                                        CSEA_selection_criterion = "FDRq", 
                                        only_pos_nes = TRUE, 
                                        CSEA_selection_threshold = 0.05, 
                                        ORA_selection_criterion = "p_adj", 
                                        ORA_selection_threshold = 0.25)

The results are placed under the cluster_tags element:

head(scML$cluster_data$global_expr$cluster_tags$CSEA)

## $`0`
## [1] "Cell_Cycle"
## 
## $`1`
## [1] "Hypoxia"      "Inflammation" "Metastasis"  
## 
## $`2`
## [1] "DNA_repair"    "Invasion"      "Proliferation"
## 
## $`3`
## [1] "Hypoxia"       "Proliferation" "Quiescence"   
## 
## $`4`
## [1] "Hypoxia"    "Metastasis" "Cell_Cycle"
## 
## $`5`
## [1] "DNA_damage" "Stemness"   "DNA_repair"

The results of cluster enrichment can be visualized by barplots and boxplots for, respectively, categorical values and quantitative values.

barplot_cluster(scML, 
                partition_id = "global_expr", 
                feature_name = "CC_Phase", 
                feature_id = "CC_Phase")
boxplot_cluster(scML, 
                feature_name = "gene_set_scoring", 
                partition_id = "global_expr")

caption caption

Comparison between clusters

Intra-dataset

The function overlap_matrix calculates the overlap coefficient between all-pairs of clusters of two or more partitions of the same cells (same dataset):

scML <- overlap_matrix(scML)

The results are stored under the element `cluster_comparison```:

head(scML$cluster_comparison$overlap_matrix)

##                   CNV_0       CNV_1     CNV_2      CNV_3 global_expr_0
## CNV_0         1.0000000 0.000000000 0.0000000 0.00000000   0.389455782
## CNV_1         0.0000000 1.000000000 0.0000000 0.00000000   0.005102041
## CNV_2         0.0000000 0.000000000 1.0000000 0.00000000   0.559523810
## CNV_3         0.0000000 0.000000000 0.0000000 1.00000000   0.082568807
## global_expr_0 0.3894558 0.005102041 0.5595238 0.08256881   1.000000000
## global_expr_1 0.6115242 0.000000000 0.3401487 0.07951070   0.000000000
##               global_expr_1 global_expr_2 global_expr_3 global_expr_4
## CNV_0             0.6115242    0.00400000    0.12688822   0.665369650
## CNV_1             0.0000000    0.97800000    0.05740181   0.000000000
## CNV_2             0.3401487    0.00600000    0.17824773   0.326848249
## CNV_3             0.0795107    0.01834862    0.64525994   0.007782101
## global_expr_0     0.0000000    0.00000000    0.00000000   0.000000000
## global_expr_1     1.0000000    0.00000000    0.00000000   0.000000000
##               global_expr_5 global_expr_6 global_expr_7 global_expr_8
## CNV_0            0.00000000     0.7386364   0.007142857   0.000000000
## CNV_1            0.79569892     0.0000000   0.850000000   0.973913043
## CNV_2            0.02150538     0.2613636   0.007142857   0.008695652
## CNV_3            0.18279570     0.0000000   0.135714286   0.017391304
## global_expr_0    0.00000000     0.0000000   0.000000000   0.000000000
## global_expr_1    0.00000000     0.0000000   0.000000000   0.000000000

Inter-dataset

Inter-dataset comparison is performed through the function inter_dataset_comparison, which require a list of Seurat objects and a gene set list. The activity of this gene set list will be assessed over the clusters of all datasets. In needed, the function prepare_cluster_markers_list provides the possibility to prepare a gene set list of cluster markers, starting

dataset_cmp_res <- inter_dataset_comparison(seu_obj_list = seu_obj_list, 
                                            gsl = cluster_markers_list, 
                                            genes_max = 500, genes_min = 5)
plot_heatmap_dataset_comparison(dataset_cmp_res, 
                                outfile = "heatmap_ds_cmp.png")
plot_heatmap_dataset_comparison(dataset_cmp_res, 
                                outfile = "heatmap_ds_cmp_p.png", 
                                type = "significance")

caption caption

The function can be used to analyze any gene set. In the following example we assess the expression of CancerSEA gene sets across datasets:

gsl <- prepare_gsls(gs_sources = "CancerSEA", 
                    genes = unlist(lapply(seu_obj_list, rownames)))
dataset_cmp_res_cancersea <- inter_dataset_comparison(seu_obj_list = seu_obj_list, 
                                                      gsl = gsl$CancerSEA, 
                                                      genes_max = 500, 
                                                      genes_min = 5)
plot_heatmap_dataset_comparison(dataset_cmp_res_cancersea, 
                                outfile = "heatmap_ds_cmp_cancer.png")

caption

References

Angerer, Philipp, Laleh Haghverdi, Maren Büttner, Fabian J. Theis, Carsten Marr, and Florian Buettner. 2016. “destiny : diffusion maps for large-scale single-cell data in R.” Bioinformatics 32 (8): 1241–43. https://doi.org/10.1093/bioinformatics/btv715.

Franzén, Oscar, Li-Ming Gan, and Johan L M Björkegren. 2019. “PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data.” Database 2019 (January). https://doi.org/10.1093/database/baz046.

Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated analysis of multimodal single-cell data.” Cell 184 (13): 3573–3587.e29. https://doi.org/10.1016/j.cell.2021.04.048.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Yuan, Huating, Min Yan, Guanxiong Zhang, Wei Liu, Chunyu Deng, Gaoming Liao, Liwen Xu, et al. 2019. “CancerSEA: a cancer single-cell state atlas.” Nucleic Acids Research 47 (D1): D900–908. https://doi.org/10.1093/nar/gky939.

Yuan, Jinzhou, Hanna Mendes Levitin, Veronique Frattini, Erin C. Bush, Deborah M. Boyett, Jorge Samanamud, Michele Ceccarelli, et al. 2018. “Single-cell transcriptome analysis of lineage diversity in high-grade glioma.” Genome Medicine 10 (1): 57. https://doi.org/10.1186/s13073-018-0567-9.

Zhang, Xinxin, Yujia Lan, Jinyuan Xu, Fei Quan, Erjie Zhao, Chunyu Deng, Tao Luo, et al. 2019. “CellMarker: a manually curated resource of cell markers in human and mouse.” Nucleic Acids Research 47 (D1): D721–28. https://doi.org/10.1093/nar/gky900.