Identify damaged cells — identify_damaged

This function uses a combination of the cell UMI counts and the nuclear fraction score to assign each cell one of two values; "cell" or "damaged_cell". This is based on the idea that damaged cells have a lower UMI count and higher nuclear fraction than whole cells. The expected input is a data frame with four columns. The first three columns should contain; the nuclear fraction score, total UMIs and a character vector describing each cell as "cell" or "empty_droplet". This is the format output by the identify_empty_drops function. The fourth column should be a character vector with user-assigned cell types. Internally, the provided data frame is split by cell type and a Gaussian mixture model with a maximum of two components is fit to the umi counts and nuclear fraction scores. The parameters of the model are estimated using expectation maximisation (EM) with the mclust package. The best model is selected using the Bayesian Information Criterion (BIC). The two populations (cells and damaged cells) are assumed to have equal variance (mclust model name "EEI").

identify_damaged_cells(
  nf_umi_ed_ct,
  nf_sep = 0.15,
  umi_sep_perc = 50,
  output_plots = FALSE,
  verbose = TRUE
)

Arguments

nf_umi_ed_ct	data frame, with four columns. The first three columns should match the output from the `identify_empty_drops` function. The fourth column should contain cell type names.
nf_sep	numeric, the minimum separation of the nuclear fraction score required between the cell and damage cell populations
umi_sep_perc	numeric, this is the minimum percentage of UMIs which the damaged cell population is required to have compared to the cell population. For example, if the mean UMI of the distribution fit to the whole cell population is 10,000 UMIs, the mean of the distribution fit to the damaged cell population must be at less than 7,000 UMIs if the umi_sep parameter is 30 (%)
output_plots	logical, whether or not to return plots
verbose	logical, whether to print updates and progress while fitting with EM

Value

list, of length two. The first element in the list contains a data frame with the same dimensions input to the nf_umi_ed_ct argument, with "damaged_cell" now recorded in the third column. The second element is NULL unless output_plots=TRUE. If requested, three plots are returned for each cell type in a named list, combined using ggpubr::ggarrange. For each cell type, the first plot illustrates the cell and damaged cell populations (if any) in a plot of nuclear fraction vs log10(UMI counts). Damaged cells are expected to be in the lower right portion of the plot(lower UMI counts and higher nuclear fraction). The second and third plots show the model fits to the nuclear fraction and UMI count distributions respectively. Solid lines indicate the distribution mean, while dashed lines indicate the positions of the thrsholds controlled by the nf_sep and umi_sep parameters.

Examples

#1
data("qc_examples")
gbm <- qc_examples[qc_examples$sample=="MB",]
gbm.ed <- gbm[,c("nuclear_fraction_droplet_qc","umi_count")]
gbm.ed <- identify_empty_drops(nf_umi = gbm.ed)
gbm.ed$cell_type <- gbm$cell_type
gbm.ed.dc <- identify_damaged_cells(gbm.ed, verbose=FALSE)
gbm.ed.dc <- gbm.ed.dc[[1]]
head(gbm.ed.dc)
#>                    nuclear_fraction_droplet_qc umi_count cell_status
#> AAACCCACAAGAATAC-1                   0.4176170     12557        cell
#> AAACCCACAATAAGGT-1                   0.3744720     10958        cell
#> AAACCCACAGCCCACA-1                   0.6492798      6095        cell
#> AAACCCACAGTAACGG-1                   0.3900973     12802        cell
#> AAACCCACATAAGATG-1                   0.1312698     13672        cell
#> AAACCCACATAATCGC-1                   0.4443030     26653        cell
#>                                cell_type
#> AAACCCACAAGAATAC-1   neuron_unresolved_2
#> AAACCCACAATAAGGT-1 migrating_interneuron
#> AAACCCACAGCCCACA-1    neuron_hippocampus
#> AAACCCACAGTAACGG-1 migrating_interneuron
#> AAACCCACATAAGATG-1   neuron_unresolved_1
#> AAACCCACATAATCGC-1 migrating_interneuron
table(gbm.ed.dc$cell_status)
#> 
#>          cell  damaged_cell empty_droplet 
#>          7389          1349           920