Core pipeline

About

This page describes the core pipeline which is run via the artic minion command.

Output files are grouped below into outputs (files you should use downstream) and intermediates (files left on disk after the run which may be useful for troubleshooting but are not intended as primary outputs).

Stages

Input validation

The pipeline fetches primer scheme and reference files from the Quick-lab primerschemes repository when --scheme-name, --scheme-version, and (where required) --scheme-length are provided. Downloaded files are cached in the directory given by --scheme-directory (default: ./primer-schemes).

For example, to fetch the artic-inrb-mpox/2500/v1.0.0 scheme:

--scheme-name artic-inrb-mpox
--scheme-length 2500
--scheme-version v1.0.0

Alternatively, provide --bed and --ref to use a local primer BED file and reference FASTA directly.

For further detail on scheme fetching, aliases, and automatic reference selection, see Primer Schemes.

Reference alignment and post-processing

The pipeline aligns basecalled reads against the reference using minimap2 with the ONT preset (map-ont). Alignments are filtered to remove unmapped reads, then sorted and indexed with samtools.

The align_trim module then post-processes the alignments to:

assign each read to a derived amplicon
assign each read a read group based on its primer pool
softmask alignments within their amplicon boundaries

Optionally it can also:

remove primer sequence by further softmasking (--primer-match-threshold controls fuzzy matching)
downsample reads per amplicon to --normalise depth (set to 0 to disable)
remove reads with mismatched primer pairs, e.g. from amplicon read-through (use --allow-mismatched-primers to retain these)

Softmasking adjusts the CIGAR of each alignment segment so that soft clips replace any reference- or query-consuming operations outside primer boundaries. The leftmost mapping position is updated accordingly.

More information on how the primer scheme is used to infer amplicons can be found in Primer Schemes.

Outputs

File	Description
`$SAMPLE.sorted.bam` / `.bai`	Raw alignment of reads to the reference
`$SAMPLE.primertrimmed.rg.sorted.bam` / `.bai`	Primer-trimmed, read-group-annotated alignment used for all downstream steps

Intermediates

File	Description
`$SAMPLE.alignreport.tsv`	Per-read primer assignment report from align_trim
`$SAMPLE.amplicon_depths.tsv`	Per-amplicon read depth report

Variant calling

Variant calling is performed per read group using Clair3 (run_clair3.sh). Per-pool VCFs are merged into a single file using artic_vcf_merge.

During merging, any variants that fall within primer binding sites are separated into $SAMPLE.primers.vcf and excluded from the merged output — these positions are unreliable due to primer softmasking. The details are written to $SAMPLE.primersitereport.txt.

The merged variants are then filtered by artic_vcf_filter into three output files. Each failing variant is classified as either mask or discard:

Mask — the position is genuinely ambiguous; it will be replaced with N in the consensus. This applies when the variant has low quality, mixed reads (low allele frequency above the ignore threshold), a frameshift, or insufficient depth.
Discard — the variant call is likely an artefact but the position itself is fine; the reference base is kept. This applies when the allele frequency is below the ignore threshold (suggesting noise rather than a real mixed site) or when the absolute alt read count is too low despite adequate quality.

Filter	Default threshold	Result on failure
Variant quality (QUAL)	≥ 10 (`--min-variant-quality`)	mask
Allele frequency (AF) — lower bound	≥ 0.1 (`--min-mask-allele-frequency`)	discard (below this, variant is noise)
Allele frequency (AF) — upper bound	≥ 0.6 (`--min-allele-frequency`)	mask (between bounds, position is ambiguous)
Frameshift indel quality	≥ 50 (`--min-frameshift-quality`), or always excluded with `--no-frameshifts`	mask
Read depth (DP)	≥ `--min-depth`	mask
Alt allele read count (AD)	≥ 5 (`--min-minor-allele-count`)	discard (too few reads despite good quality)
All indels	excluded when `--no-indels` is set	—

Finally, passing variants are normalised against the pre-consensus using bcftools norm to ensure REF alleles match the masked reference before consensus generation.

Outputs

File	Description
`$SAMPLE.normalised.vcf.gz` / `.tbi`	Normalised PASS variants — these are the variants applied to produce the consensus
`$SAMPLE.pass.vcf`	PASS variants before normalisation
`$SAMPLE.fail.vcf`	Mask variants — positions that will be replaced with `N` in the consensus
`$SAMPLE.ignore.vcf`	Discarded variants — rejected as likely artefacts; the reference base is kept at these positions

Intermediates

File	Description
`$SAMPLE.merged.vcf`	Pre-filter merged variants from all read groups
`$SAMPLE.$POOL.vcf`	Raw per-pool Clair3 output (one file per primer pool)
`$SAMPLE_rg_$POOL/`	Full Clair3 output directory per pool
`$SAMPLE.primers.vcf`	Variants at primer binding sites, excluded from the main merge
`$SAMPLE.primersitereport.txt`	Report of primer-site variant handling

Consensus building

Each position in the reference is checked for read depth against the value of --min-depth (default: 20) using artic_make_depth_mask. Positions below this threshold are recorded in the coverage mask.

artic_mask then produces a pre-consensus sequence by applying low-coverage masking and the mask variants ($SAMPLE.fail.vcf) to the reference, replacing low-confidence positions with N. Discarded variants ($SAMPLE.ignore.vcf) are not applied — the reference base is used at those positions. bcftools consensus is then run against the pre-consensus using the normalised PASS variants to produce the final consensus sequence. The consensus header is annotated with the artic workflow identifier by artic_fasta_header.

If --align-consensus is provided, the consensus is aligned against the reference using:

mafft --6merpair --addfragments $SAMPLE.consensus.fasta $REF > $SAMPLE.aligned.fasta

Outputs

File	Description
`$SAMPLE.consensus.fasta`	Final consensus sequence
`$SAMPLE.aligned.fasta`	MAFFT alignment of consensus against reference (only produced with `--align-consensus`)
`$SAMPLE.primer.bed`	Copy of the primer scheme BED used for this run
`$SAMPLE.reference.fasta`	Copy of the reference FASTA used for this run
`$SAMPLE.minion.log.txt`	Log of all commands executed with wall-clock runtimes

Intermediates

File	Description
`$SAMPLE.coverage_mask.txt`	BED-format record of positions masked below `--min-depth`
`$SAMPLE.preconsensus.fasta`	N-masked reference with FAIL variants applied; input to `bcftools consensus`

Note

$SAMPLE.primer.bed and $SAMPLE.reference.fasta record which scheme was used for this run. In future, when per-segment reference selection is supported, these will become primary outputs of greater significance.

Summary of pipeline modules

Module	Function
`align_trim`	Amplicon assignment, softmasking, read-group annotation, and normalisation
`artic_vcf_merge`	Combines per-pool VCFs; separates primer-site variants
`artic_vcf_filter`	Filters merged VCF into PASS and FAIL files
`artic_make_depth_mask`	Produces a coverage mask from the post-processed alignment
`artic_mask`	Combines reference, FAIL variants, and coverage mask into a pre-consensus
`artic_fasta_header`	Applies the artic workflow identifier to the consensus header
`artic_get_models`	Downloads Clair3 model files
`artic_get_scheme`	Downloads a primer scheme independently of the full pipeline