AI bioinformatics: explaining incorrect command lines

AI bioinformatics

A follow up to my previous article about using LLMs to explain complex bioinformatics command lines.

In the previous experiment we found that GTP5.2 did a pretty good job explaining our three command lines, correctly identifying the job of each option and adding mostly-helpful commentary.

But what about the dreaded hallucinations? Let’s try adding some plausible but invalid options to the command lines and see if the LLM can catch them.

For each I’ll include the complete prompt and the complete response - just click the green response box to expand it and see the full thing. Note: all prompts were given to GPT5.2 with all options on their default setting.

First challenge: adding a nonexistant argument

We will take our original bwa mem command which aligns reads to a reference genome, and add a --parallel argument. There is no such argument to any of the bwa commands - in fact bwa doesn’t use these long style arguments, so it would probably stand out right away to anyone who is familiar with the tool.

Prompt

Explain the options and arguments in this command line:

bwa mem -t 8 –parallel -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

include details on the roles of each file name

The relevant bit of the response is here:

--parallel

Meaning: enable additional internal parallel behavior (exact behavior can depend on > BWA version/build).

Effect: typically helps performance when multiple threads are used.

Caution: if you don’t see a speedup or your version doesn’t recognize it, check bwa > mem -h for your installed version’s supported flags (some distributions differ).

The output here is incorrect, but in an interesting way! GPT5.2 fails to explicitly call out the incorrect argument, and makes up a plausible explanation for its existence. However, the output hints at the fact that the LLM is less confident about this parameter than the other ones.

Response by GPT5.2

Here’s what every piece of this command does, including what each filename represents:

bwa mem -t 8 --parallel -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

Program + subcommand

bwa: the aligner (Burrows–Wheeler Aligner).
mem: the algorithm within BWA intended for mapping relatively long reads (commonly Illumina 70bp–250bp, and longer). It outputs alignments in SAM format.

Options / arguments

`-t 8`

Meaning: use 8 CPU threads.
Effect: speeds up alignment by parallelizing parts of the computation.
Notes: Threads help throughput, but memory usage can rise somewhat depending on data size and settings.

`--parallel`

Meaning: enable additional internal parallel behavior (exact behavior can depend on BWA version/build).
Effect: typically helps performance when multiple threads are used.
Caution: if you don’t see a speedup or your version doesn’t recognize it, check bwa mem -h for your installed version’s supported flags (some distributions differ).

`-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'`

Adds a read group (RG) line into the SAM header and tags each alignment with that RG.

The string is a literal SAM header line (an @RG header record) with tab-separated fields; \t means “tab”:

@RG: this is a read group header record.
ID:Sample1
- Read Group ID (must be unique within the file).
- Used by downstream tools to distinguish lanes/runs/libraries when multiple groups exist.
SM:Sample1
- Sample name.
- Very important for tools like GATK and many variant callers; it ties reads to the biological sample.
PL:illumina
- Platform/technology (e.g., illumina, ont, pacbio).
- Used in some pipelines for platform-specific assumptions or metadata.

Why read groups matter: If you later mark duplicates, do BQSR, or run joint variant calling, correct RG information (often also including LB library and PU platform unit) is important to avoid treating distinct runs/libraries incorrectly.

File arguments (roles of each filename)

`reference.fasta`

Role: the reference genome/sequence you are aligning the reads to.
What BWA expects: a FASTA containing one or more reference contigs/chromosomes.
Important dependency: BWA typically requires the reference to have been indexed first (e.g., bwa index reference.fasta), which creates auxiliary index files alongside it.

`reads_R1.fastq.gz`

Role: the forward reads (Read 1) from paired-end sequencing.
Format: FASTQ, gzip-compressed.
Content: sequences + base quality scores for the first read in each pair.

`reads_R2.fastq.gz`

Role: the reverse reads (Read 2) from paired-end sequencing.
Format: FASTQ, gzip-compressed.
Content: sequences + base quality scores for the second read in each pair.

Why order matters: With BWA MEM, giving two FASTQ files means “these are paired-end mates.” The i-th read in R1 is assumed to correspond to the i-th read in R2 (same read name, matching pairs).

`> alignment.sam`

Role: this is shell redirection, not a BWA argument.
Meaning: write BWA’s output (SAM text) into the file alignment.sam instead of printing it to the terminal.
Output content: alignments in SAM format (text), including header lines (@HD, @SQ, @RG etc.) and one line per aligned read.

Putting it together (plain-English description)

“Align the paired-end reads (reads_R1.fastq.gz and reads_R2.fastq.gz) to the reference genome (reference.fasta) using BWA-MEM, using 8 threads, annotate alignments with a read group identifying Sample1 on Illumina, and save the resulting SAM alignments to alignment.sam.”

Practical notes (common next steps)

SAM files get large; many pipelines write BAM directly, e.g. pipe to samtools:
- bwa mem ... | samtools sort -@ 8 -o alignment.sorted.bam
Make sure read group info matches your experiment. Often you’ll also include:
- LB: (library) and PU: (platform unit / flowcell-lane), especially if multiple lanes exist.

If you tell me your sequencing layout (single lane vs multiple lanes, libraries, flowcell/lane IDs), I can suggest a more complete -R line that plays nicely with downstream duplicate marking and variant calling.

Want to see future experiments? sign up for the mailing list:

Second challenge : adding an argument for a different subcomand

Let’s try something a bit trickier. Here we will add an argument -q that applies to a different subcommand. The -q option is valid for bwa aln but not bwa mem.

Prompt

Explain the options and arguments in this command line:

bwa mem -t 8 -q 25 -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

include details on the roles of each file name

Here’s relevant bit of the response:

-q 25

Minimum mapping quality

Reads with a mapping quality (MAPQ) < 25 are:

Either not output

Or output as low-quality/unmapped (implementation detail)

MAPQ reflects confidence that a read is correctly aligned.

A value of 25 is moderately strict (≈ 0.3% chance of misalignment).

👉 This is a post-alignment filter, not a base-quality filter.

While there is a lot of good, useful information in this response, it fails to identify that the -q argument isn’t valid, though it gives a good description of its role in bwa aln.

GPT5.2 response

Let’s break this command down piece by piece and explain what each option does and what role each file plays.

bwa mem -t 8 -q 25 -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' \
reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

1. The program and algorithm

`bwa mem`

bwa: A popular aligner for mapping sequencing reads to a reference genome.
mem: The BWA algorithm designed for:
- Illumina short reads
- Reads ~70 bp to several kb
- Single-end or paired-end data
It outputs alignments in SAM format by default.

2. Options (flags) and arguments

`-t 8`

Threads
Uses 8 CPU threads in parallel.
Speeds up alignment on multicore machines.
Does not change the results, only performance.

`-q 25`

Minimum mapping quality
Reads with a mapping quality (MAPQ) < 25 are:
- Either not output
- Or output as low-quality/unmapped (implementation detail)
MAPQ reflects confidence that a read is correctly aligned.
A value of 25 is moderately strict (≈ 0.3% chance of misalignment).

👉 This is a post-alignment filter, not a base-quality filter.

`-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'`

Adds a read group (RG) header to the SAM output.
Read groups are essential for many downstream tools (e.g., variant calling).

Inside the quoted string:

Tag	Meaning
`@RG`	Declares a read group
`ID:Sample1`	Read group ID (must be unique per group)
`SM:Sample1`	Sample name (used by GATK and others)
`PL:illumina`	Platform (sequencing technology)

Why `\t`?

\t inserts tab characters
SAM headers require tabs, not spaces

Without proper read group info, many pipelines will fail or give warnings.

3. Input files and their roles

`reference.fasta`

The reference genome
All reads are aligned against this sequence
Must be:
- Indexed beforehand (bwa index reference.fasta)
Determines:
- Chromosome names
- Coordinate system
- Alignment positions

`reads_R1.fastq.gz`

Forward reads (Read 1) from paired-end sequencing
FASTQ format contains:
- Read sequence
- Per-base quality scores
Compressed with gzip

`reads_R2.fastq.gz`

Reverse reads (Read 2) from paired-end sequencing
Corresponds to the other end of the same DNA fragments as R1
BWA uses pairing information to:
- Improve alignment accuracy
- Detect insert size
- Resolve ambiguous mappings

👉 The order matters: R1 first, R2 second.

4. Output redirection

`> alignment.sam`

Redirects standard output to a file
alignment.sam:
- Contains all alignments in SAM (text) format
- Includes:
  - Header lines (@HD, @SQ, @RG)
  - One line per read alignment

⚠️ SAM files can be very large. Often you’ll see:

bwa mem ... | samtools sort -o alignment.bam

to produce a compressed, sorted BAM file instead.

5. High-level summary

What this command does:

Aligns paired-end Illumina reads to a reference genome
Uses 8 threads
Filters out low-confidence alignments (MAPQ < 25)
Tags all reads as belonging to Sample1
Writes alignments to a SAM file

Conceptual flow:

reference.fasta
        ↑
reads_R1.fastq.gz + reads_R2.fastq.gz
        ↓
     bwa mem
        ↓
   alignment.sam

If you want, I can also:

Rewrite this command for BAM output
Explain SAM fields line-by-line
Show how changing -q affects results
Explain paired-end alignment logic in detail

Challenge three : similarity searching

Let’s try our blastn command line. We will add another completely spurious option --precise and see what happens:

Prompt

Explain the options and arguments in this command line:

blastn -query queries.fasta -db nt -remote –precise -outfmt “6 qseqid sseqid staxids pident length evalue bitscore” -max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv

include details on the roles of each file name

and the response:

--precise

Requests higher-accuracy alignments

Trades speed for sensitivity

Useful when:

Sequences are short

You care about accurate percent identity

May increase runtime slightly

While the rest of the output is helpful, the bit about --precise is completely made up.

GPT5.2 response

Here’s a clean, line-by-line explanation of what this blastn command does, including what each option controls and the role of every file name.

The full command (reformatted)

blastn \
  -query queries.fasta \
  -db nt \
  --precise \
  -remote \
  -outfmt "6 qseqid sseqid staxids pident length evalue bitscore" \
  -max_target_seqs 5 \
  -evalue 1e-6 \
  -num_threads 8 \
  -out results.tsv

What `blastn` is doing overall

This command runs BLASTN (nucleotide–nucleotide BLAST) to:

Compare DNA sequences in queries.fasta
Against the NCBI nucleotide database (nt)
Using NCBI’s remote servers
Returning up to 5 significant matches per query
Writing a tab-delimited results file called results.tsv

File names and their roles

`queries.fasta`

Role: Input file (your data)
Contains one or more query nucleotide sequences
Must be in FASTA format
Each sequence in this file will be searched independently against the database

Example:

>query1
ATGCGTACGTTAGC...

`nt`

Role: Reference database
Stands for NCBI nucleotide database
Contains millions of curated and uncurated DNA sequences from GenBank, RefSeq, etc.
Because -remote is used, this database is not stored locally

`results.tsv`

Role: Output file
A tab-separated text file
Each line represents one BLAST hit
Easy to open in Excel, R, Python (pandas), or Unix tools

Explanation of each option

`-query queries.fasta`

Specifies the input FASTA file
BLAST reads each sequence in this file as a separate query

`-db nt`

Specifies the database to search against
nt = comprehensive nucleotide collection

`--precise`

Requests higher-accuracy alignments
Trades speed for sensitivity
Useful when:
- Sequences are short
- You care about accurate percent identity
May increase runtime slightly

`-remote`

Runs BLAST on NCBI’s servers instead of your local machine
Required when:
- You don’t have nt downloaded locally
Ignores many local performance settings (see -num_threads below)

`-outfmt "6 qseqid sseqid staxids pident length evalue bitscore"`

Controls output format and columns

`6` = tabular format (no comments)

The selected fields:

Field	Meaning
`qseqid`	Query sequence ID (from `queries.fasta`)
`sseqid`	Subject (database hit) sequence ID
`staxids`	NCBI taxonomic ID(s) of the hit
`pident`	Percent identity of the alignment
`length`	Alignment length (bp)
`evalue`	Expectation value (statistical significance)
`bitscore`	Alignment score (higher = better)

Each hit becomes one row in results.tsv.

`-max_target_seqs 5`

Limits output to at most 5 database hits per query
Keeps results concise
Important for large databases like nt

⚠️ Note: With -remote, this is treated as a soft limit—ordering is approximate.

`-evalue 1e-6`

Filters out weak matches
Only reports hits expected to occur by chance less than once per million searches
Common thresholds:
- 1e-3 → permissive
- 1e-6 → moderate/stringent
- 1e-10 → very strict

`-num_threads 8`

Requests use of 8 CPU threads
Important caveat: When using -remote, this is largely ignored because computation happens on NCBI’s servers
Only truly matters for local BLAST databases

`-out results.tsv`

Specifies the output file name
Overwrites the file if it already exists
Output is tab-delimited because of outfmt 6

What the final output represents

Each row = one alignment between:
- A sequence from queries.fasta
- A sequence from the nt database
Columns are strictly ordered as defined in -outfmt
Multiple rows per query are possible (up to 5)

In one sentence

This command compares nucleotide sequences in queries.fasta against the NCBI nt database using remote BLASTN, applies sensitive search settings, filters for significant matches, and writes the top 5 hits per query to a tab-delimited file called results.tsv.

If you want, I can also:

Help you interpret the results.tsv
Suggest better parameters for short reads, barcodes, or metagenomics
Rewrite this for local BLAST databases or faster runs

While the rest of the output is helpful, the bit about --precise is completely made up.

Verdict

So far the picture is not looking great for LLMs (I carried out these experiments with GPT5.2 via the chat interface but I have no reason to suspect that other LLMs will be significantly different). All three incorrect arguments failed to be properly flagged.

This is in very sharp contrast to the previous experiment, which showed incredibly good performance. It’s a constant source of surprise (to me at least) how widly the ability of LLMs varies on apparently similar tasks.

A follow-up for a future experiement: can we tweak the prompt to make it more likely for the AI to catch these kinds of errors?

First challenge: adding a nonexistant argument

--parallel

Program + subcommand

Options / arguments

-t 8

--parallel

-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'

File arguments (roles of each filename)

reference.fasta

reads_R1.fastq.gz

reads_R2.fastq.gz

> alignment.sam

Putting it together (plain-English description)

Practical notes (common next steps)

Second challenge : adding an argument for a different subcomand

-q 25

1. The program and algorithm

bwa mem

2. Options (flags) and arguments

-t 8

-q 25

-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'

Why \t?

3. Input files and their roles

reference.fasta

reads_R1.fastq.gz

reads_R2.fastq.gz

4. Output redirection

> alignment.sam

5. High-level summary

Challenge three : similarity searching

--precise

The full command (reformatted)

What blastn is doing overall

File names and their roles

queries.fasta

nt

results.tsv

Explanation of each option

-query queries.fasta

-db nt

--precise

-remote

-outfmt "6 qseqid sseqid staxids pident length evalue bitscore"

6 = tabular format (no comments)

-max_target_seqs 5

-evalue 1e-6

-num_threads 8

-out results.tsv

What the final output represents

In one sentence

Verdict

`--parallel`

`-t 8`

`--parallel`

`-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'`

`reference.fasta`

`reads_R1.fastq.gz`

`reads_R2.fastq.gz`

`> alignment.sam`

`-q 25`

`bwa mem`

`-t 8`

`-q 25`

`-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'`

Why `\t`?

`reference.fasta`

`reads_R1.fastq.gz`

`reads_R2.fastq.gz`

`> alignment.sam`

`--precise`

What `blastn` is doing overall

`queries.fasta`

`nt`

`results.tsv`

`-query queries.fasta`

`-db nt`

`--precise`

`-remote`

`-outfmt "6 qseqid sseqid staxids pident length evalue bitscore"`

`6` = tabular format (no comments)

`-max_target_seqs 5`

`-evalue 1e-6`

`-num_threads 8`

`-out results.tsv`