Pipeline setup and configuration¶

There are a number of main files that governs how the pipeline is executed listed below:

Snakefile
common.smk
config.yaml
resources.yaml
profile/uppsala/config.yaml
samples.tsv and units.tsv

There is more general information about the content of these files in hydra-genetics documentation in code standards, config and Snakefile.

Snakefile¶

The Snakefile is located in workflow/ and imports hydra-genetics modules and rules as well as modifies these rules when needed. It also imports pipeline specific rules and define rule orders. Finally, this is where the rule all is defined.

common.smk¶

The common.smk is located under workflow/rules/. This is a general rule taking care of any actions that are not directly connected with running a specific program. It includes version checks, import of config, resources, tsv-files and validations using schemas. Functions used by pipeline specific rules are also defined here as well as the output files using the function compile_output_list which programmatically generates a list of all necessary output files for the module to be targeted in the all rule defined in the Snakemake file. See further Result files.

config.yaml¶

The config.yaml is located under config/. The file ties all file and other dependencies as well as parameters for different rules together. See further pipeline configuration.

Expand to view current config.yaml

---


resources: "config/resources.yaml"
samples: "config/samples.tsv"
units: "config/units.tsv"

output: "config/output_list.json"

default_container: "docker://hydragenetics/common:1.11.1"

modules:
  alignment: "v0.6.0"
  annotation: "v0.3.0"
  compression: "v2.0.0" 
  cnv_sv: "v0.5.0"
  filtering: "v0.2.0"
  misc: "v0.2.0"
  parabricks: "v1.1.0"
  prealignment: "v1.2.0"
  qc: "v0.4.1"
  references: "e71ee62"
  snv_indels: "v1.0.0"

reference:
  coverage_bed: "/beegfs-storage/data/ref_data/refseq/hastings_coverage_20250109.bed"
  design_bed: "/data/ref_data/wp3/hastings/hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.sorted.bed"
  design_intervals: "/data/ref_data/wp3/hastings/hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.sorted.interval_list"
  fasta: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta"
  fai: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta.fai"
  genepanels: "/beegfs-storage/projects/wp3/Reference_files/Manifest/Clinical_research_exome/Gene_panels/genepanels.list"
  sites: "/data/ref_data/wp3/hastings/Homo_sapiens_assembly38.known_indels.vcf.gz"
  skip_chrs:
    - chrM

trimmer_software: "fastp_pe"

aligner: "bwa_cpu" # or "bwa_cpu"
snp_caller: "deepvariant_cpu" # or "deepvariant_cpu"

automap:
  container: "docker://hydragenetics/automap:1.2"
  build: "hg38"
  extra: "--DP 10 --minsize 3 --chrX"
  outdir: "cnv_sv/automap"

bcftools_view_deepvariant:
  regions: "/data/ref_data/wp3/hastings/hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.sorted_20bp_pad.bed"

bcftools_hardfilter_exomedepth:
  exclude: "'BF<=0'"

bwa_mem:
  container: "docker://hydragenetics/bwa:0.7.15"
  amb: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta.amb"
  ann: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta.ann"
  bwt: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta.bwt"
  pac: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta.pac"
  sa: "/data/ref_data/wp3/hastings/GCA_000001405.15_GRCh38_no_alt_analysis_set_masked_chr.fasta.sa"
  extra: "-K 100000000"

create_cov_excel:
  covLimits: "10 20 30"

deepvariant:
  container: "docker://google/deepvariant:1.6.1" 
  bed: "/data/ref_data/wp3/hastings/hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.sorted_20bp_pad.bed"
  model_type: "WES"
  output_gvcf: True

exomedepth_call:
  container: "docker://hydragenetics/exomedepth:1.1.15"
  bedfile: "/data/ref_data/wp3/hastings/hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.sorted.bed" 
  exonsfile: "/data/ref_data/wp3/hastings/exons_GRCh38_ensembl109.txt"
  genesfile: "/data/ref_data/wp3/hastings/genes_GRCh38_ensembl109.txt"
  genome_version: "hg38"

exomedepth_export:
  container: "docker://hydragenetics/exomedepth:1.1.15"

fastp_pe:
  container: "docker://hydragenetics/fastp:0.20.1"
  # Default enabled trimming parameters for fastp. Specified for clarity.
  extra: "--trim_poly_g --qualified_quality_phred 15 --unqualified_percent_limit 40 --n_base_limit 5 --length_required 15"

fastqc:
  container: "docker://hydragenetics/fastqc:0.11.9"

glnexus_peddy:
  container: "docker://ghcr.io/dnanexus-rnd/glnexus:v1.4.1"
  configfile: "DeepVariantWES"

glnexus_trio:
  container: "docker://ghcr.io/dnanexus-rnd/glnexus:v1.4.1"
  configfile: "DeepVariantWES"

mosdepth_bed:
  container: "docker://hydragenetics/mosdepth:0.3.2"
  extra: ""

multiqc:
  container: "docker://hydragenetics/multiqc:1.21"
  reports:
    DNA:
      config: "config/multiqc_config_DNA.yaml"
      included_unit_types: ["N"]
      qc_files:
        - "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1_fastqc.zip"
        - "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2_fastqc.zip"
        - "qc/mosdepth_bed_design/{sample}_{type}.mosdepth.summary.txt"
        - "qc/mosdepth_bed_design/{sample}_{type}.mosdepth.region.dist.txt"
        - "qc/mosdepth_bed_design/{sample}_{type}.mosdepth.global.dist.txt"
        - "qc/peddy/peddy.peddy.ped"
        - "qc/peddy/peddy.background_pca.json"
        - "qc/peddy/peddy.ped_check.csv"
        - "qc/peddy/peddy.sex_check.csv"
        - "qc/peddy/peddy.het_check.csv"
        - "qc/peddy/peddy_sex_check_mqc.tsv"
        - "qc/peddy/peddy_rel_check_mqc.tsv"
        - "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
        - "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
        - "qc/picard_collect_gc_bias_metrics/{sample}_{type}.gc_bias.summary_metrics"
        - "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
        - "qc/picard_collect_insert_size_metrics/{sample}_{type}.insert_size_metrics.txt"
        - "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
        - "qc/samtools_idxstats/{sample}_{type}.samtools-idxstats.txt"

pbrun_deepvariant:
  container: "docker://nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1"
  extra: "--use-wes-model --mode shortread \
   --disable-use-window-selector-model --gvcf " ## disable window selector model for consistency with deepvariant cpu. Also increases accuracy.

pbrun_fq2bam:
  container: "docker://nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1"
  extra: ""

peddy:
  container: "docker://hydragenetics/peddy:0.4.8"
  config: "config/peddy_mqc.yaml"
  extra: "--sites hg38 "

picard_collect_alignment_summary_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_collect_duplication_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_collect_gc_bias_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_collect_hs_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_collect_insert_size_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_collect_multiple_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_collect_wgs_metrics:
  container: "docker://hydragenetics/picard:2.25.0"

picard_mark_duplicates:
  container: "docker://hydragenetics/picard:2.25.4"


spring:
  container: "docker://hydragenetics/spring:1.1.1"

upd:
  container: "docker://hydragenetics/upd:0.1.1"
  extra: "--vep "

vep_trio:
  container: "docker://ensemblorg/ensembl-vep:release_110.1"
  vep_cache: "/beegfs-storage/data/ref_genomes/VEP/"
  extra: "--assembly GRCh38 --check_existing --pick --max_af "

vt_decompose:
  container: "docker://hydragenetics/vt:2015.11.10"

resources.yaml¶

The resources.yaml is located under config/. The file declares default resources used by rules as well as resources for specific rules that needs more resources than allocated by default. See further pipeline configuration.

# ex, default resources
default_resources:
  threads: 1
  time: "4:00:00"
  mem_mb: 6144
  mem_per_cpu: 6144
  partition: "low"

# ex, rule override
vardict:
  time: "48:00:00"

Expand to view current resources.yaml

---

default_resources:
  threads: 1
  time: "12:00:00"
  mem_mb: 6144
  mem_per_cpu: 6144
  partition: "core"

bwa_mem:
  mem_mb: 122880
  mem_per_cpu: 6144
  threads: 8

deepvariant:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10
  time: "12:00:00"

fastp_pe:
  threads: 1
  mem_mb: 6144
  mem_per_cpu: 6144

fastqc:
  threads: 2
  mem_mb: 12288
  mem_per_cpu: 6144

glnexus:
  threads: 20

mosdepth_bed:
  mem_mb: 36864
  threads: 4

pbrun_fq2bam:
  gres: "--gres=gres:gpu:2"
  mem_mb: 327680
  mem_per_cpu: 16384
  partition: "GPU_hi"
  threads: 20

pbrun_deepvariant:
  gres: "--gres=gres:gpu:4"
  mem_mb: 655360
  mem_per_cpu: 16384
  partition: "GPU_hi"
  threads: 40

peddy:
  threads: 8

samtools_sort:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10

samtools_view:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10

spring:
  mem_mb: 49152
  mem_per_cpu: 6144
  threads: 8
  partition: "core_bkup"

vep:
  threads: 4

profile yaml¶

Profiles are saved in yaml files and used to control how snakemake will be executed, if jobs will be submitted to a cluster, use singularity, restart on failure and so forth. It also forward requested resources to drmaa using a drmaa variable.

# ex, snakemake settings
jobs: 100
keep-going: True
restart-times: 2
rerun-incomplete: True
use-singularity: True
configfile: "config/config.yaml"
singularity-args: "-e --cleanenv -B /projects -B /data -B /beegfs

# ex, drmaa settings
drmaa: " -A wp1 -N 1-1 -t {resources.time} -n {resources.threads} --mem={resources.mem_mb} --mem-per-cpu={resources.mem_per_cpu} --mem-per-cpu={resources.mem_per_cpu} --partition={resources.partition} -J {rule} -e slurm_out/{rule}_%j.err -o slurm_out/{rule}_%j.out"
drmaa-log-dir: "slurm_out"
default-resources: [threads=1, time="04:00:00", partition="low", mem_mb="3074", mem_per_cpu="3074"]

samples.tsv and units.tsv¶

The samples.tsv and units.tsv are input files that must be generated before running the pipeline and should in general be located in the base folder of the analysis folder, can be changed in the config.yaml. See further running the pipeline and create input files.

Example samples.tsv¶

sample	tumor_content	sex	trioid	trio_member
NA12878	0.0	female	NA	NA
NA12911	0	male	NA	NA

Example units.tsv¶

sample	type	machine	platform	flowcell	lane	barcode	fastq1	fastq2	adapter
NA12878	N	NDX550407_RUO	NextSeq	HKTG2BGXG	L001	ACGGAACA+ACGAGAAC	fastq/NA12878_fastq1.fastq.gz	fastq/NA12878_fastq2.fastq.gz	ACGT,ACGT
NA12911	N	NDX550407_RUO	NextSeq	HKTG2BGXG	L001	TCGGAACT+TCGAGAAT	fastq/NA12911_fastq1.fastq.gz	fastq/NA12911_fastq2.fastq.gz	ACGT,ACGT

Coverage analysis¶

Too get coverage analysis for gene panels three things are needed 1) a file with a list of gene panels and 2) a list of which genes that is in the panel and 3) a bed file with information about the genes.

Gene panels¶

There are several gene panels available at Poirot's config git.

1) example genepanel.list¶

BRCA

CADASIL

EBS

EDS

2) example gene panel list BRCA.list¶

APC

ATM

BAP1

BMPR1A

BRCA1

BRCA2

3) Bed-file with gene information¶

Example bed-file¶

| chr1 | 999056 | 999434 | HES4_NM_021170.4_[4] |

| chr1 | 999523 | 999615 | HES4_NM_021170.4_[3] |

| chr1 | 999689 | 999789 | HES4_NM_021170.4_[2] |

| chr1 | 999863 | 999975 | HES4_NM_021170.4_[1] |

| chr1 | 1013571 | 1013578 | ISG15_NM_005101.4_[1] |

| chr1 | 1013981 | 1014480 | ISG15_NM_005101.4_[2] |