What CellDEEP does

CellDEEP reduces scRNA-seq sparsity by pooling cells into pseudocells before DE testing.

Load package and example data

library(CellDEEP)
data("sim")

Step 1: Run DE directly with FindMarker.CellDEEP

FindMarker.CellDEEP includes metadata preparation internally. Key parameters to set: - group_id, sample_id, cluster_id: metadata column names in your Seurat object - ident.1, ident.2: two groups to compare - cell_selection: how to select cells for pooling ("kmean" or "random") - readcounts: how to aggregate counts in pooled cells ("sum" or "mean") - min_cells_per_subgroup: minimum cells required in each sample-cluster subgroup for pooling

de.test <- FindMarker.CellDEEP(
  sim,
  group_id = "Status",
  sample_id = "DonorID",
  cluster_id = "cluster_id",
  Pool = TRUE,
  test.use = "wilcox",
  n_cells = 3,
  min_cells_per_subgroup = 1,
  cell_selection = "random",
  readcounts = "sum",
  logfc.threshold = 0.25,
  ident.1 = "Case",
  ident.2 = "Control"
)
#> Start Pooling.....
#> Pooling...
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
#> FindMarker running.....
#> 1st ident is:
#> Case
#> 2nd ident is:
#> Control
#> group by:
#> group_id
#> Normalizing layer: counts
#> Finding variable features for layer counts
#> Centering and scaling data matrix
#> For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
#> (default method for FindMarkers) please install the presto package
#> --------------------------------------------
#> install.packages('devtools')
#> devtools::install_github('immunogenomics/presto')
#> --------------------------------------------
#> After installation of presto, Seurat will automatically use the more 
#> efficient implementation (no further action necessary).
#> This message will be shown once per session
#> 20
#> Gene1728Gene1992Gene1626Gene1864Gene1715Gene1807

Step 2: Pool cells only (optional)

Use these functions if you want pooled objects without running DE immediately.

min_cells_per_subgroup means the minimum number of cells required in each sample_id x cluster_id subgroup before pooling is performed.

Pooling functions use standardized metadata fields (sample_id, group_id, cluster_id), so prepare once before pooling:

pool_input <- prepare_data(
  sim,
  sample_id = "DonorID",
  group_id = "Status",
  cluster_id = "cluster_id"
)

K-means pooling

pooled_kmean <- CellDEEP.Kmean(
  pool_input,
  readcounts = "sum",
  n_cells = 3,
  min_cells_per_subgroup = 1,
  assay_name = "RNA"
)
#> Pooling...
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
#> Drop out cell number during kmean pooling is:
#> 24
pooled_kmean
#> An object of class Seurat 
#> 2000 features across 56 samples within 1 assay 
#> Active assay: RNA (2000 features, 0 variable features)
#>  1 layer present: counts

Random pooling

pooled_random <- CellDEEP.Random(
  pool_input,
  readcounts = "sum",
  n_cells = 5,
  min_cells_per_subgroup = 1,
  assay_name = "RNA"
)
#> Pooling...
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
pooled_random
#> An object of class Seurat 
#> 2000 features across 32 samples within 1 assay 
#> Active assay: RNA (2000 features, 0 variable features)
#>  1 layer present: counts

If no genes pass the adjusted p-value filter in this small example dataset, try a larger dataset or set full_list = TRUE.