Calculate the nuclear fraction statistic using BAM tags — nuclear_fraction

This function uses the region type tags in a provided BAM file to calculate for each input cell barcode the nuclear fraction statistic. This is just the fraction of reads that are intronic:

nuclear fraction = # intronic reads / (# intronic reads + # of exonic reads)

The row names of the returned data frame will match the order and name of the supplied barcodes. As a minimum you can provide as input a directory containing cellranger output (outs).

nuclear_fraction_tags(
  outs = NULL,
  bam = NULL,
  bam_index = paste0(bam, ".bai"),
  barcodes = NULL,
  cores = future::availableCores() - 1,
  tiles = 100,
  cell_barcode_tag = "CB",
  region_type_tag = "RE",
  exon_tag = "E",
  intron_tag = "N",
  verbose = TRUE
)

Arguments

outs	character, the path to the 'outs' directory created by Cell Ranger. We assume outs is structured this way: ├── filtered_feature_bc_matrix │ ├── barcodes.tsv.gz │ ├── features.tsv.gz │ └── matrix.mtx.gz ├── possorted_genome_bam.bam ├── possorted_genome_bam.bam.bai ├── raw_feature_bc_matrix │ ├── barcodes.tsv.gz │ ├── features.tsv.gz │ └── matrix.mtx.gz Note that there will probably be other files in the directory as well. We don't need to worry about those, as the only three files that the function will require are; possorted_genome_bam.bam, possorted_genome_bam.bam.bai and filtered_feature_bc_matrix/barcodes.tsv.gz. This is the only required argument for the function. If your directory structure doesn't match the one created by Cell Ranger you can provide the file paths directly using the bam, bam_index and barcodes arguments.
bam	character, the path to the input bam file. Not required if an 'outs' directory is provided.
bam_index	character, the path to the input bam file index. Not required if an 'outs' directory is provided.
barcodes	character, either a vector of barcode names or the path to the barcodes.tsv.gz file output by Cell Ranger. If providing the cell barcodes as a vector, make sure that the format matches the one in the BAM file - e.g. be mindful if there are integers appended to the end of the barcode sequence. This argument isn't required if an 'outs' directory is provided - the function will just look for "barcodes.tsv.gz" in outs/filtered_feature_bc_matrix.
cores	numeric, runs the function in parallel using furrr:future_map() with the requested number of cores. Setting `cores=1` will cause future_map to run sequentially.
tiles	numeric, to speed up the processing of the BAM file we can split the genome up into tiles and process reads in chunks
cell_barcode_tag	character, the BAM tag containing the cell barcode sequence
region_type_tag	character, the BAM tag containing the region type
exon_tag	character, the character string that defines a read as exonic
intron_tag	character, the character string that defines a read as intronic
verbose	logical, whether or not to print progress

Value

data.frame, the function returns a 1-column data frame containing the calculated nuclear fraction statistic for each input barcode. The order and names of the rows will match those of the input cell barcodes.

Examples

nf1 <- nuclear_fraction_tags(
    outs = system.file("extdata", "outs", package =
    "DropletQC"),
     tiles = 1, cores = 1, verbose = FALSE)
head(nf1)
#>                    nuclear_fraction
#> AAAAGTCACTTACTTG-1        0.9032698
#> AAAAGTGGATCTCTAA-1        0.4032761
#> AAAGCAGTTACGAAGA-1        0.3957704
#> AACGACTTCAATATGT-1        0.4004525
#> AACGGCGTCATCTGGA-1        0.8845109
#> AAGCAGGGGTCGCGAA-1        0.3929376

nf2 <- nuclear_fraction_tags(
   bam = system.file("extdata", "outs","possorted_genome_bam.bam", package =
   "DropletQC"),
   barcodes = c("AAAAGTCACTTACTTG-1",
                "AAAAGTGGATCTCTAA-1",
                "AAACACGTTCTCATCG-1"),
   tiles = 1, cores = 1,
   verbose = FALSE)
nf2
#>                    nuclear_fraction
#> AAAAGTCACTTACTTG-1        0.9032698
#> AAAAGTGGATCTCTAA-1        0.4032761
#> AAACACGTTCTCATCG-1        0.0000000