This function uses the region type tags in a provided BAM file to calculate for each input cell barcode the nuclear fraction statistic. This is just the fraction of reads that are intronic:

nuclear fraction = # intronic reads / (# intronic reads + # of exonic reads)

The row names of the returned data frame will match the order and name of the supplied barcodes. As a minimum you can provide as input a directory containing cellranger output (outs).

nuclear_fraction_tags(
  outs = NULL,
  bam = NULL,
  bam_index = paste0(bam, ".bai"),
  barcodes = NULL,
  cores = future::availableCores() - 1,
  tiles = 100,
  cell_barcode_tag = "CB",
  region_type_tag = "RE",
  exon_tag = "E",
  intron_tag = "N",
  verbose = TRUE
)

Arguments

outs

character, the path to the 'outs' directory created by Cell Ranger. We assume outs is structured this way:

├── filtered_feature_bc_matrix

│ ├── barcodes.tsv.gz

│ ├── features.tsv.gz

│ └── matrix.mtx.gz

├── possorted_genome_bam.bam

├── possorted_genome_bam.bam.bai

├── raw_feature_bc_matrix

│ ├── barcodes.tsv.gz

│ ├── features.tsv.gz

│ └── matrix.mtx.gz

Note that there will probably be other files in the directory as well. We don't need to worry about those, as the only three files that the function will require are; possorted_genome_bam.bam, possorted_genome_bam.bam.bai and filtered_feature_bc_matrix/barcodes.tsv.gz. This is the only required argument for the function. If your directory structure doesn't match the one created by Cell Ranger you can provide the file paths directly using the bam, bam_index and barcodes arguments.

bam

character, the path to the input bam file. Not required if an 'outs' directory is provided.

bam_index

character, the path to the input bam file index. Not required if an 'outs' directory is provided.

barcodes

character, either a vector of barcode names or the path to the barcodes.tsv.gz file output by Cell Ranger. If providing the cell barcodes as a vector, make sure that the format matches the one in the BAM file - e.g. be mindful if there are integers appended to the end of the barcode sequence. This argument isn't required if an 'outs' directory is provided - the function will just look for "barcodes.tsv.gz" in outs/filtered_feature_bc_matrix.

cores

numeric, runs the function in parallel using furrr:future_map() with the requested number of cores. Setting cores=1 will cause future_map to run sequentially.

tiles

numeric, to speed up the processing of the BAM file we can split the genome up into tiles and process reads in chunks

cell_barcode_tag

character, the BAM tag containing the cell barcode sequence

region_type_tag

character, the BAM tag containing the region type

exon_tag

character, the character string that defines a read as exonic

intron_tag

character, the character string that defines a read as intronic

verbose

logical, whether or not to print progress

Value

data.frame, the function returns a 1-column data frame containing the calculated nuclear fraction statistic for each input barcode. The order and names of the rows will match those of the input cell barcodes.

Examples

nf1 <- nuclear_fraction_tags( outs = system.file("extdata", "outs", package = "DropletQC"), tiles = 1, cores = 1, verbose = FALSE) head(nf1)
#> nuclear_fraction #> AAAAGTCACTTACTTG-1 0.9032698 #> AAAAGTGGATCTCTAA-1 0.4032761 #> AAAGCAGTTACGAAGA-1 0.3957704 #> AACGACTTCAATATGT-1 0.4004525 #> AACGGCGTCATCTGGA-1 0.8845109 #> AAGCAGGGGTCGCGAA-1 0.3929376
nf2 <- nuclear_fraction_tags( bam = system.file("extdata", "outs","possorted_genome_bam.bam", package = "DropletQC"), barcodes = c("AAAAGTCACTTACTTG-1", "AAAAGTGGATCTCTAA-1", "AAACACGTTCTCATCG-1"), tiles = 1, cores = 1, verbose = FALSE) nf2
#> nuclear_fraction #> AAAAGTCACTTACTTG-1 0.9032698 #> AAAAGTGGATCTCTAA-1 0.4032761 #> AAACACGTTCTCATCG-1 0.0000000