Core pipeline
About
This page describes the core pipeline which is run via the artic minion
command.
There are 2 workflows baked into the core pipeline, one which uses signal data (via nanopolish) and one that does not (via medaka). As the workflows are identical in many ways, this page will describe the pipeline as whole and notify the reader when there is dfferent behaviour between the two workflows.
It should be noted here that by default the nanopolish
workflow is selected; you need to specify --medaka
(and --medaka-model
) if you want the medaka workflow enabled.
NOTE: It is very important that you select the appropriate value for
--medaka-model
.
At the end of each stage, we list here the "useful" stage output files which are kept. There will also be some additional files leftover at the end of the pipeline but these can be ignored (and are hopefully quite intuitively named).
Stages
Input validation
The supplied --scheme-directory
will be checked for a reference sequence and primer scheme. If --scheme-version
is not supplied, version 1 will be assumed.
As of version 1.2.0, the pipeline will try and download the reference and scheme from the artic primer scheme repository if it is not found/provided.
Once the scheme and reference have been found, the pipeline will validate the scheme and extract the primer pool information.
If running the nanopolish workflow, the pipeline will check the reference only contains one sequence and then run nanopolish index
to map basecalled reads to the signal data.
stage output
- primer validation log (optional)
- nanopolish index (workflow dependant)
Reference alignment and post-processing
The pipeline will then perform a reference alignment of the basecalled reads against the specified reference sequence. By default minimap is used but bwa can be chosen as an alternative. Both aligners use their respective ONT presets. The alignments are filtered to keep only mapped reads, and then sorted and indexed.
We then use the align_trim
module to post-process the aligments.
The purpose of alignment post-processing is:
- assign each read alignment to a derived amplicon
- using the derived amplicon, assign each read a read group based on the primer pool
- softmask read alignments within their derived amplicon
Also, there is the option to:
- remove primer sequence by further softmasking the read alignments
- normalise/reduce the number of read alignments to each amplicon
- remove reads with imperfect primer pairing, e.g. from amplicon read through
By softmasking, we refer to the process of adjusting the CIGAR of each alignment segment such that soft clips replace any reference or query consuming operations in regions of an alignment that fall outside of primer boundaries. The leftmost mapping position of alignments are also updated during softmasking.
More information on how the primer scheme is used to infer amplicons can be found here.
stage output
file name | description |
---|---|
$SAMPLE.sorted.bam |
the raw alignment of sample reads to reference genome |
$SAMPLE.trimmed.rg.sorted.bam |
the post-processed alignment |
$SAMPLE.primertrimmed.rg.sorted.bam |
the post-processed alignment with additional softmasking to exclude primer sequences |
Variant calling
Once alignments have been softmasked, sorted and indexed (again), they are used for variant calling. This is where the two workflows actually differ.
For the medaka workflow, we use the following commands on the $SAMPLE.primertrimmed.rg.sorted.bam
alignment:
- medaka consensus
- medaka variant or snps (if pipeline has been told not to detect INDELS via
--no-indels
) - medaka tools annotate (if
--no-longshot
has been selected) - longshot (if
--no-longshot
not selected)
And for the nanopolish workflow we use the following command on the $SAMPLE.trimmed.rg.sorted.bam
alignment:
For both workflows, the variant calling steps are run for each read group in turn. We then merge variants reported per read group into a single file using the artic_vcf_merge
module.
Note: we use the
$SAMPLE.trimmed.rg.sorted.bam
alignment for the nanopolish workflow as nanopolish requires some leading sequence to make variant calls; we use the primer sequence for this purpose in order to call variants at the start of amplicons.
Opionally, we can check the merged variant file against the primer scheme. This will allow us to detect variants called in primer and amplicon overlap regions.
Finally, we use the artic_vcf_filter
module to filter the merged variant file through a set of workflow specific checks and assign all variants as either PASS or FAIL. The final PASS file is subsequently indexed ready for the next stage.
stage output
file name | description |
---|---|
$SAMPLE.$READGROUP.vcf |
the raw variants detected (one file per primer pool) |
$SAMPLE.merged.vcf |
the raw variants detected merged into one file |
$SAMPLE.vcfreport.txt |
a report evaluating reported variants against the primer scheme |
$SAMPLE.fail.vcf |
variants deemed too low quality |
$SAMPLE.pass.vcf.gz |
detected variants (indexed) |
Consensus building
Prior to building a consensus, we use the post-processed alignment from the previous step to check each position of the reference sequence for sample coverage. Any poition that is not covered by at least 20 reads from either read group are marked as low coverage. We use the artic_make_depth_mask
module for this, which produces coverage information for each read group and also produces a coverage mask to tell us which coordinates in the reference sequence failed the coverage threshold.
Next, to build a consensus sequence for a sample, we require a pre-consensus sequence based on the input reference sequence. The preconsensus has low quality sites masked out with N
's using the coverage mask and the $SAMPLE.fail.vcf
file. We then use bcftools consensus
to combine the preconsensus with the $SAMPLE.pass.vcf
variants to produce a consensus sequence for the sample. The consensus sequence has the artic workflow written to its header.
Finally, the consensus sequence is aligned against the reference sequence using muscle
.
stage output
file name | description |
---|---|
$SAMPLE.*_mqc.json |
stats files which MultiQC can use to make a report |
$SAMPLE.consensus.fasta |
the consensus sequence for the input sample |
$SAMPLE.muscle.out.fasta |
an alignment of the consensus sequence against the reference sequence |
Summary of pipeline modules
module | function |
---|---|
align_trim | alignment post processing (amplicon assignment, softmasking, normalisation) |
artic_vcf_merge | combines VCF files from multiple read groups |
artic_vcf_filter | filters a combined VCF into PASS and FAIL variant files |
artic_make_depth_mask | create a coverage mask from the post-processed alignment |
artic_mask | combines the reference sequence, FAIL variants and coverage mask to produce a pre-consensus sequence |
artic_fasta_header | applies the artic workflow and identifier to the consensus sequence header |
Optional pipeline report
As of version 1.2.1, if you run the pipeline with --strict
, you can run MultiQC (which should be installed as part of the artic conda environment) on the pipeline output directory and this will produce a report containing amplicon coverage plots and variant call information. To generate a report from within your pipeline output directory:
multiqc .