Obtaining meta information

Rule graph for obtaining meta information

Above: Rule graph for obtaining meta information

Get dataset meta information

To obtain meta information we first use FFQ inside the get_meta_single rule:

workflow/rules/preparation/get_dataset_meta.smk

rule get_meta_single:
    params: dataset="{dataset}"
    output: ffq_json=config['out_dir'] +"/meta/{dataset}/ffq_raw.json",
    log: config['logs_dir'] + "/{dataset}/ffq.log"
    benchmark: config['logs_dir'] + "/{dataset}/ffq.benchmark"
    resources:
        mem_mb=4000
    conda: "../../envs/ffq.yaml"
    shell: """
    ffq -o {output.ffq_json} {params.dataset} 2> {log}
    if grep "error_msg" {output.ffq_json}; then exit 1; fi
    """

10x whitelists

The get_whitelistsis used to obtain 10x whitelists to further validate the raw data against whitelists to identify chemistry version.

workflow/rules/resources/get_whitelists.smk

rule get_whitelists:
    output:
        whitelist_10x_v1 = config["resources"] + "/10xv1_whitelist.txt",
        whitelist_10x_v2 = config["resources"] + "/10xv2_whitelist.txt",
        whitelist_10x_v3 = config["resources"] + "/10xv3_whitelist.txt"
    conda: "../../envs/git.yaml"
    log: config["logs_dir"] + "/resources/get_whitelists.log"
    benchmark: config["logs_dir"] + "/resources/get_whitelists.benchmark"
    shell: """
    wget -o {log} -O {output.whitelist_10x_v1} https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/737K-april-2014_rc.txt
    wget -o {log} -O {output.whitelist_10x_v2} https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/737K-august-2016.txt
    wget -o {log}  https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz -O - | zcat > {output.whitelist_10x_v3}
    """

Defining technology

Once we obtained the meta information for a dataset, prefetched the files from SRA, and obtained the whitelists we can try to identify the technology that was used to generate the scRNA-seq dataset.

Currently supported technologies are 10X and Dropseq, but we plan to implement more technologies, since STAR allows to implement new technologies relatively easily.

This step contains several in-house scripts, the brief description of this step is below.

We first parse the XML from entrez to parse the technology name from descriptions and library preparations protocols. For this step see parse_srs_for_tech in workflow/scripts/DefineTechnologyUtils.py.

We then try to identify which raw files should be used as an input for quantification.

Due to differences in which files were requested as raw files for submission (fastq or processed bam files) and how these files were later processed in SRA, fastq-dump and parallel-fastq-dump (which are recommended ways to obtain raw data from SRA) might in some cases return incomplete data (as an example Run SRR7425023 contains only biological reads and this is not enough to reconstruct scRNA-seq dataset).

If we can not identify barcode and cDNA read files in parallel-fastq-dump results we then look at FTP files that can be also found in FFQ results.

Order in which we try the files:

files from parallel-fasqt-dump
fastq files from FTP
bam files from FTP

if we can not identify barcode/cDNA information from any of these files (in that order), or the dataset was generated using an unsupported technology, define_tech must fail with an error providing context to why we couldn't identify technology.

Combining the results

While define_tech works on the level of a single dataset, our pipeline allows to works with multiple datasets in the same time, so get_all_meta rule will concatenate modified jsons into one file that will be used later by our pipeline to generate evaluation DAG for processing.