4 Split data
For data integration and cell type classification purposes, it is recommended to split the data into smaller subsets that reflects a single homogeneous experimental unit. In the context of cell type classification we define an experimental unit as the group of cells sharing similar biological and technical variance components. A batch (e.g pool/well from a single cell experiment) represents the most common experimental unit.
Therefore, we recommend splitting the data for:
- Faster classification/integration. Splitting the data avoids extensive search for cell anchors and neighbors in the azimuth approach
- Parallelization. Each experimental unit can be processed and analyzed independently using multiple CPU workers
- Avoiding intra-batch effects in the query dataset. Cell differences driven by batch effects within the query dataset may influence the performance of the integration/alignment.
We recommend splitting the data by experimental units (e.g. batches). If the Seurat query object represents a single experimental unit, no splitting is necessary and this section can be skipped.
To split the data, we run the following code. First, let’s create an output directory
mkdir step1_split
Let’s assume we have a Seurat RDS object called query.RDS
with a metadata column name
called pool
containing the values 1
and 2
corresponding to two independent pools. We can split the data by pool using the split.R
module.
Let’s examine the parameter information:
singularity exec -B $PWD cell_classification.sif \
Rscript /split.R --help
Usage: split.R [options]
Options:
--file=CHARACTER
RDS object file name
--batch=CHARACTER
Batch column to split the data
--out=CHARACTER
Output file name [default= split_object]
--path=CHARACTER
Output path to store results [default= .]
-h, --help
Show this help message and exit
We can now split the data as follows
singularity exec -B $PWD cell_classification.sif \
\
Rscript /split.R --file query.RDS \
--batch pool \
--out query \
--path step1_split
split.R
takes two main arguments. --file
corresponds to the seurat object
filename in .RDS
format. --batch
expects a string corresponding to a column name
in the @metadata
slot. A group of RDS files following the *{out}_{batch levels}.RDS*
filename pattern are created. In this example, two files called query_1.RDS
and query_2.RDS
are stored in the step1_split
directory.
We strongly recommend storing results in different directories to avoid file overwriting.