scNavigator processing pipeline
This repository is a Snakemake pipeline that we use to process public scRNA-seq data in scNavigator. The scope of the main branch is to process the raw data and run basic seurat analysis.
Read the full documentation at https://scn-pipeline.readthedocs.io/.
To install the pipeline, please, clone the main branch and install snakemake
$ git clone https://github.com/ctlab/scn-pipeline.git
$ cd scn-pipeline
$ conda install -n base -c conda-forge mamba
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
$ conda activate snakemake
For all the steps in this pipeline we have specified the minimum environment required to run the step, so please consider running this pipeline using:
Configuration
You first have to configure the project and provide paths to relevant
files and folder. The configuration is storen in configs/config.yaml file and
consists of several fields. The only two necessary fields that are required to fill are out_dir and ncbi_dir.
out_dir is a directory that will be used to store results, preliminary results, resources, and logs.
ncbi_dir is a directory that is configured via vdb-config
(see README.vdb-config in https://github.com/ncbi/sra-tools)
and path to this directory is usually stored in ~/.ncbi/user-settings.mkfg
out_dir: '/path/to/out/dir' # output directory with all the results
ncbi_dir: '/path/to/configured/ncbi/folder' # directory from vdb-config
Running pipeline
Once pipeline is configured, fill the datasets in ./config/dataset.yaml
Contents of the file should be just a list of dataset IDs (see example below)
Once datasets are specified pipeline consists of two main steps:
- Acquiring meta information (we use FFQ + custom scripts to detect single-cell technology and version of chemistry)
- Processing (we use STAR + Seurat for processing and further analysis of the dataset)
To acquire all the meta information run
To process the datasets for which you already have meta information simply run
Results of the pipeline can be found in the out_dir directory that you configured:
/out_dir/resources- all the resources (whitelists and genome indexes) will be stored here/out_dir/logs- all the logs will be stored here/out_dir/meta/{dataset}- meta information for the datasets (results ofget_all_meta)/out_dir/data/samples/{dataset}/{sample}- sample-level analysis and STAR results are stored here/out_dir/data/datasets/{dataset}/- dataset-level integration analysis results are stored here