distinct 1.0.0
distinct is a statistical method to perform differential testing between two or more groups of distributions; differential testing is performed via hierarchical non-parametric permutation tests on the cumulative distribution functions (cdfs) of each sample. While most methods for differential expression target differences in the mean abundance between conditions, distinct, by comparing full cdfs, identifies, both, differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean (e.g., unimodal vs. bi-modal distributions with the same mean). distinct is a general and flexible tool: due to its fully non-parametric nature, which makes no assumptions on how the data was generated, it can be applied to a variety of datasets. It is particularly suitable to perform differential state analyses on single cell data (e.g., differential analyses within sub-populations of cells), such as single cell RNA sequencing (scRNA-seq) and high-dimensional flow or mass cytometry (HDCyto) data.
At present, covariates are not allowed, and only 2-group comparisons are implemented. In future releases, we will allow for covariates and for differential testing between more than 2 groups.
A pre-print will follow in the coming months.
To access the R code used in the vignettes, type:
browseVignettes("distinct")
Questions relative to distinct should be either written to the Bioconductor support site, tagging the question with “distinct”, or reported as a new issue at BugReports (preferred choice).
To cite distinct, type:
citation("distinct")
distinct is available on Bioconductor and can be installed with the command:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("distinct")
To install the package from github, use devtools
(available here):
devtools::install_github("SimoneTiberi/distinct")
To install the package jointly with its vignette remove --no-build-vignettes
from build_opts
:
devtools::install_github("SimoneTiberi/distinct",
build_opts = c("--no-resave-data", "--no-manual"))
Differential state analyses aim at investigating differential patterns between conditions in sub-populations of cells. To use distinct one needs data from two or more groups of samples (i.e., experimental conditions), with at least 2 samples (i.e., biological replicates) per group. Given a single-cell RNA-sequencing (scRNA-seq) or a high dimensional flow or mass cytometry (HDCyto) dataset, cells need first to be clustered in groups via some form of clustering algorithms; distinct is then applied to identify differential patterns between groups, within each cluster of cells.
Load the example dataset, consisting of a subset of 6 samples (3 individuals observed across 2 conditions) and 100 genes selected from the Kang18_8vs8()
object of the muscData package.
data("Kang_subset", package = "distinct")
Kang_subset
## class: SingleCellExperiment
## dim: 100 9517
## metadata(1): experiment_info
## assays(2): counts logcounts
## rownames(100): ISG15 SYF2 ... MX2 PDXK
## rowData names(0):
## colnames: NULL
## colData names(4): ind stim cell sample_id
## reducedDimNames(1): TSNE
## altExpNames(0):
Columns ind
and stim
of the colData
indicate the indivual id and the experimental condition (control or stimulated) of each cell, while column sample_id
shows the sample id, needed for the differential anlyses.
Column cell
represents the cell type, which defines the clustering structure of cells: differential testing between conditions is performed separately for each cluster of cells.
Note that, if cell clustering label was unknown, we would need to cluster cells into groups via some clustering algorithm.
colData(Kang_subset)
## DataFrame with 9517 rows and 4 columns
## ind stim cell sample_id
## <integer> <factor> <factor> <factor>
## 1 107 ctrl CD4 T cells ctrl_107
## 2 1015 ctrl CD14+ Monocytes ctrl_1015
## 3 1015 ctrl NK cells ctrl_1015
## 4 107 ctrl CD4 T cells ctrl_107
## 5 1015 ctrl CD14+ Monocytes ctrl_1015
## ... ... ... ... ...
## 9513 1015 stim CD14+ Monocytes stim_1015
## 9514 101 stim CD4 T cells stim_101
## 9515 101 stim CD14+ Monocytes stim_101
## 9516 107 stim CD14+ Monocytes stim_107
## 9517 107 stim CD4 T cells stim_107
The experimental design compares two groups (stim vs ctrl) with 3 biological replicates each.
metadata(Kang_subset)$experiment_info
## sample_id stim
## 1 ctrl_107 ctrl
## 2 ctrl_1015 ctrl
## 3 ctrl_101 ctrl
## 4 stim_101 stim
## 5 stim_1015 stim
## 6 stim_107 stim
Visually inspect the data, via tSNE plots, coloured by cell-type.
library(scater)
plotTSNE(Kang_subset, colour_by = "cell")
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): semi-
## transparency is not supported on this device: reported only once per page
Load distinct.
library(distinct)
Perform differential state testing between conditions.
Parameter name_assays_expression
specifies the input data (counts) in assays(x)
, while name_cluster
, name_group
, name_sample
define the column names of colData(x)
containing the clustering of cells (cell), the grouping of samples (stim) and the id of individual samples (sample_id).
As today distinct does not accept sparce matrices:
set.seed(61217)
res = distinct_test(
x = Kang_subset,
name_assays_expression = "counts",
name_cluster = "cell",
name_group = "stim",
name_sample = "sample_id",
P = 10^3,
min_non_zero_cells = 20)
## Data loaded, starting differential testing
## Differential testing completed, returning results
Results are reported as a data.frame
, where columns gene
and cluster_id
contain the gene and cell-cluster name, while p_val
, p_adj.loc
and p_adj.glb
report the raw p-values, locally and globally adjusted p-values, via Benjamini and Hochberg (BH) correction.
In locally adjusted p-values (p_adj.loc
) BH correction is applied in each cluster separately, while in globally adjusted p-values (p_adj.glb
) BH correction is performed to the results from all clusters.
head(res)
## gene cluster_id p_val p_adj.loc p_adj.glb
## 1 ISG15 B cells 0.001998002 0.01958042 0.0111414
## 2 SYF2 B cells 0.627372627 0.76975524 0.8054043
## 3 LAPTM5 B cells 0.056943057 0.21030821 0.1726660
## 4 EIF3I B cells 0.588411588 0.75270184 0.7743497
## 5 PRPF38A B cells 0.052947053 0.20755245 0.1643357
## 6 GTF2B B cells 0.676323676 0.80828927 0.8271765
Visualize the concordance of differential results between cell clusters. We select as significant genes with globally adjusted p-value below 0.01.
library(UpSetR)
res_by_cluster = split( ifelse(res$p_adj.glb < 0.01, 1, 0), res$cluster_id)
upset(data.frame(do.call(cbind, res_by_cluster)))
## Warning in grid.Call.graphics(C_rect, x$x, x$y, x$width, x$height,
## resolveHJust(x$just, : semi-transparency is not supported on this device:
## reported only once per page
Violin plots of significant genes in Dendritic cells
cluster.
# select cluster of cells:
cluster = "Dendritic cells"
sel_cluster = res$cluster_id == cluster
sel_column = Kang_subset$cell == cluster
# select significant genes:
sel_genes = res$p_adj.glb < 0.01
genes = as.character(res$gene[sel_cluster & sel_genes])
# make violin plots:
library(scater)
plotExpression(Kang_subset[,sel_column],
features = genes, exprs_values = "logcounts",
log2_values = FALSE,
x = "sample_id", colour_by = "stim", ncol = 3) +
guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] UpSetR_1.4.0 distinct_1.0.0
## [3] scater_1.16.0 ggplot2_3.3.0
## [5] SingleCellExperiment_1.10.0 SummarizedExperiment_1.18.0
## [7] DelayedArray_0.14.0 matrixStats_0.56.0
## [9] Biobase_2.48.0 GenomicRanges_1.40.0
## [11] GenomeInfoDb_1.24.0 IRanges_2.22.0
## [13] S4Vectors_0.26.0 BiocGenerics_0.34.0
## [15] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 rsvd_1.0.3
## [3] lattice_0.20-41 assertthat_0.2.1
## [5] digest_0.6.25 plyr_1.8.6
## [7] R6_2.4.1 evaluate_0.14
## [9] pillar_1.4.3 zlibbioc_1.34.0
## [11] rlang_0.4.5 irlba_2.3.3
## [13] magick_2.3 Matrix_1.2-18
## [15] rmarkdown_2.1 labeling_0.3
## [17] BiocNeighbors_1.6.0 BiocParallel_1.22.0
## [19] stringr_1.4.0 RCurl_1.98-1.2
## [21] munsell_0.5.0 compiler_4.0.0
## [23] vipor_0.4.5 BiocSingular_1.4.0
## [25] xfun_0.13 pkgconfig_2.0.3
## [27] ggbeeswarm_0.6.0 htmltools_0.4.0
## [29] tidyselect_1.0.0 tibble_3.0.1
## [31] gridExtra_2.3 GenomeInfoDbData_1.2.3
## [33] bookdown_0.18 codetools_0.2-16
## [35] viridisLite_0.3.0 crayon_1.3.4
## [37] dplyr_0.8.5 withr_2.2.0
## [39] bitops_1.0-6 grid_4.0.0
## [41] gtable_0.3.0 lifecycle_0.2.0
## [43] magrittr_1.5 scales_1.1.0
## [45] stringi_1.4.6 farver_2.0.3
## [47] XVector_0.28.0 viridis_0.5.1
## [49] DelayedMatrixStats_1.10.0 ellipsis_0.3.0
## [51] vctrs_0.2.4 cowplot_1.0.0
## [53] tools_4.0.0 glue_1.4.0
## [55] beeswarm_0.2.3 purrr_0.3.4
## [57] yaml_2.2.1 colorspace_1.4-1
## [59] BiocManager_1.30.10 knitr_1.28