This function uses a combination of the cell UMI counts and the nuclear fraction score to assign each cell one of two values; "cell" or "damaged_cell". This is based on the idea that damaged cells have a lower UMI count and higher nuclear fraction than whole cells. The expected input is a data frame with four columns. The first three columns should contain; the nuclear fraction score, total UMIs and a character vector describing each cell as "cell" or "empty_droplet". This is the format output by the identify_empty_drops function. The fourth column should be a character vector with user-assigned cell types. Internally, the provided data frame is split by cell type and a Gaussian mixture model with a maximum of two components is fit to the umi counts and nuclear fraction scores. The parameters of the model are estimated using expectation maximisation (EM) with the mclust package. The best model is selected using the Bayesian Information Criterion (BIC). The two populations (cells and damaged cells) are assumed to have equal variance (mclust model name "EEI").

identify_damaged_cells(
  nf_umi_ed_ct,
  nf_sep = 0.15,
  umi_sep_perc = 50,
  output_plots = FALSE,
  verbose = TRUE
)

Arguments

nf_umi_ed_ct

data frame, with four columns. The first three columns should match the output from the identify_empty_drops function. The fourth column should contain cell type names.

nf_sep

numeric, the minimum separation of the nuclear fraction score required between the cell and damage cell populations

umi_sep_perc

numeric, this is the minimum percentage of UMIs which the damaged cell population is required to have compared to the cell population. For example, if the mean UMI of the distribution fit to the whole cell population is 10,000 UMIs, the mean of the distribution fit to the damaged cell population must be at less than 7,000 UMIs if the umi_sep parameter is 30 (%)

output_plots

logical, whether or not to return plots

verbose

logical, whether to print updates and progress while fitting with EM

Value

list, of length two. The first element in the list contains a data frame with the same dimensions input to the nf_umi_ed_ct argument, with "damaged_cell" now recorded in the third column. The second element is NULL unless output_plots=TRUE. If requested, three plots are returned for each cell type in a named list, combined using ggpubr::ggarrange. For each cell type, the first plot illustrates the cell and damaged cell populations (if any) in a plot of nuclear fraction vs log10(UMI counts). Damaged cells are expected to be in the lower right portion of the plot(lower UMI counts and higher nuclear fraction). The second and third plots show the model fits to the nuclear fraction and UMI count distributions respectively. Solid lines indicate the distribution mean, while dashed lines indicate the positions of the thrsholds controlled by the nf_sep and umi_sep parameters.

Examples

#1 data("qc_examples") gbm <- qc_examples[qc_examples$sample=="MB",] gbm.ed <- gbm[,c("nuclear_fraction_droplet_qc","umi_count")] gbm.ed <- identify_empty_drops(nf_umi = gbm.ed) gbm.ed$cell_type <- gbm$cell_type gbm.ed.dc <- identify_damaged_cells(gbm.ed, verbose=FALSE) gbm.ed.dc <- gbm.ed.dc[[1]] head(gbm.ed.dc)
#> nuclear_fraction_droplet_qc umi_count cell_status #> AAACCCACAAGAATAC-1 0.4176170 12557 cell #> AAACCCACAATAAGGT-1 0.3744720 10958 cell #> AAACCCACAGCCCACA-1 0.6492798 6095 cell #> AAACCCACAGTAACGG-1 0.3900973 12802 cell #> AAACCCACATAAGATG-1 0.1312698 13672 cell #> AAACCCACATAATCGC-1 0.4443030 26653 cell #> cell_type #> AAACCCACAAGAATAC-1 neuron_unresolved_2 #> AAACCCACAATAAGGT-1 migrating_interneuron #> AAACCCACAGCCCACA-1 neuron_hippocampus #> AAACCCACAGTAACGG-1 migrating_interneuron #> AAACCCACATAAGATG-1 neuron_unresolved_1 #> AAACCCACATAATCGC-1 migrating_interneuron
table(gbm.ed.dc$cell_status)
#> #> cell damaged_cell empty_droplet #> 7389 1349 920