AI bioinformatics: explaining incorrect command lines

AI bioinformatics

A follow up to my previous article about using LLMs to explain complex bioinformatics command lines.

In the previous experiment we found that GTP5.2 did a pretty good job explaining our three command lines, correctly identifying the job of each option and adding mostly-helpful commentary.

But what about the dreaded hallucinations? Let’s try adding some plausible but invalid options to the command lines and see if the LLM can catch them.

For each I’ll include the complete prompt and the complete response - just click the green response box to expand it and see the full thing. Note: all prompts were given to GPT5.2 with all options on their default setting.

First challenge: adding a nonexistant argument

We will take our original bwa mem command which aligns reads to a reference genome, and add a --parallel argument. There is no such argument to any of the bwa commands - in fact bwa doesn’t use these long style arguments, so it would probably stand out right away to anyone who is familiar with the tool.

TipPrompt

Explain the options and arguments in this command line:

bwa mem -t 8 –parallel -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

include details on the roles of each file name

The relevant bit of the response is here:

--parallel

  • Meaning: enable additional internal parallel behavior (exact behavior can depend on > BWA version/build).
  • Effect: typically helps performance when multiple threads are used.
  • Caution: if you don’t see a speedup or your version doesn’t recognize it, check bwa > mem -h for your installed version’s supported flags (some distributions differ).

The output here is incorrect, but in an interesting way! GPT5.2 fails to explicitly call out the incorrect argument, and makes up a plausible explanation for its existence. However, the output hints at the fact that the LLM is less confident about this parameter than the other ones.

Here’s what every piece of this command does, including what each filename represents:

bwa mem -t 8 --parallel -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

Program + subcommand

  • bwa: the aligner (Burrows–Wheeler Aligner).
  • mem: the algorithm within BWA intended for mapping relatively long reads (commonly Illumina 70bp–250bp, and longer). It outputs alignments in SAM format.

Options / arguments

-t 8

  • Meaning: use 8 CPU threads.
  • Effect: speeds up alignment by parallelizing parts of the computation.
  • Notes: Threads help throughput, but memory usage can rise somewhat depending on data size and settings.

--parallel

  • Meaning: enable additional internal parallel behavior (exact behavior can depend on BWA version/build).
  • Effect: typically helps performance when multiple threads are used.
  • Caution: if you don’t see a speedup or your version doesn’t recognize it, check bwa mem -h for your installed version’s supported flags (some distributions differ).

-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'

Adds a read group (RG) line into the SAM header and tags each alignment with that RG.

The string is a literal SAM header line (an @RG header record) with tab-separated fields; \t means “tab”:

  • @RG: this is a read group header record.

  • ID:Sample1

    • Read Group ID (must be unique within the file).
    • Used by downstream tools to distinguish lanes/runs/libraries when multiple groups exist.
  • SM:Sample1

    • Sample name.
    • Very important for tools like GATK and many variant callers; it ties reads to the biological sample.
  • PL:illumina

    • Platform/technology (e.g., illumina, ont, pacbio).
    • Used in some pipelines for platform-specific assumptions or metadata.

Why read groups matter: If you later mark duplicates, do BQSR, or run joint variant calling, correct RG information (often also including LB library and PU platform unit) is important to avoid treating distinct runs/libraries incorrectly.

File arguments (roles of each filename)

reference.fasta

  • Role: the reference genome/sequence you are aligning the reads to.
  • What BWA expects: a FASTA containing one or more reference contigs/chromosomes.
  • Important dependency: BWA typically requires the reference to have been indexed first (e.g., bwa index reference.fasta), which creates auxiliary index files alongside it.

reads_R1.fastq.gz

  • Role: the forward reads (Read 1) from paired-end sequencing.
  • Format: FASTQ, gzip-compressed.
  • Content: sequences + base quality scores for the first read in each pair.

reads_R2.fastq.gz

  • Role: the reverse reads (Read 2) from paired-end sequencing.
  • Format: FASTQ, gzip-compressed.
  • Content: sequences + base quality scores for the second read in each pair.

Why order matters: With BWA MEM, giving two FASTQ files means “these are paired-end mates.” The i-th read in R1 is assumed to correspond to the i-th read in R2 (same read name, matching pairs).

> alignment.sam

  • Role: this is shell redirection, not a BWA argument.
  • Meaning: write BWA’s output (SAM text) into the file alignment.sam instead of printing it to the terminal.
  • Output content: alignments in SAM format (text), including header lines (@HD, @SQ, @RG etc.) and one line per aligned read.

Putting it together (plain-English description)

“Align the paired-end reads (reads_R1.fastq.gz and reads_R2.fastq.gz) to the reference genome (reference.fasta) using BWA-MEM, using 8 threads, annotate alignments with a read group identifying Sample1 on Illumina, and save the resulting SAM alignments to alignment.sam.”

Practical notes (common next steps)

  • SAM files get large; many pipelines write BAM directly, e.g. pipe to samtools:

    • bwa mem ... | samtools sort -@ 8 -o alignment.sorted.bam
  • Make sure read group info matches your experiment. Often you’ll also include:

    • LB: (library) and PU: (platform unit / flowcell-lane), especially if multiple lanes exist.

If you tell me your sequencing layout (single lane vs multiple lanes, libraries, flowcell/lane IDs), I can suggest a more complete -R line that plays nicely with downstream duplicate marking and variant calling.

Want to see future experiments? sign up for the mailing list:

Second challenge : adding an argument for a different subcomand

Let’s try something a bit trickier. Here we will add an argument -q that applies to a different subcommand. The -q option is valid for bwa aln but not bwa mem.

TipPrompt

Explain the options and arguments in this command line:

bwa mem -t 8 -q 25 -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

include details on the roles of each file name

Here’s relevant bit of the response:

-q 25

  • Minimum mapping quality

  • Reads with a mapping quality (MAPQ) < 25 are:

    • Either not output
    • Or output as low-quality/unmapped (implementation detail)
  • MAPQ reflects confidence that a read is correctly aligned.

  • A value of 25 is moderately strict (≈ 0.3% chance of misalignment).

👉 This is a post-alignment filter, not a base-quality filter.

While there is a lot of good, useful information in this response, it fails to identify that the -q argument isn’t valid, though it gives a good description of its role in bwa aln.

Let’s break this command down piece by piece and explain what each option does and what role each file plays.

bwa mem -t 8 -q 25 -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' \
reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam

1. The program and algorithm

bwa mem

  • bwa: A popular aligner for mapping sequencing reads to a reference genome.

  • mem: The BWA algorithm designed for:

    • Illumina short reads
    • Reads ~70 bp to several kb
    • Single-end or paired-end data
  • It outputs alignments in SAM format by default.


2. Options (flags) and arguments

-t 8

  • Threads
  • Uses 8 CPU threads in parallel.
  • Speeds up alignment on multicore machines.
  • Does not change the results, only performance.

-q 25

  • Minimum mapping quality

  • Reads with a mapping quality (MAPQ) < 25 are:

    • Either not output
    • Or output as low-quality/unmapped (implementation detail)
  • MAPQ reflects confidence that a read is correctly aligned.

  • A value of 25 is moderately strict (≈ 0.3% chance of misalignment).

👉 This is a post-alignment filter, not a base-quality filter.


-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'

  • Adds a read group (RG) header to the SAM output.
  • Read groups are essential for many downstream tools (e.g., variant calling).

Inside the quoted string:

Tag Meaning
@RG Declares a read group
ID:Sample1 Read group ID (must be unique per group)
SM:Sample1 Sample name (used by GATK and others)
PL:illumina Platform (sequencing technology)

Why \t?

  • \t inserts tab characters
  • SAM headers require tabs, not spaces

Without proper read group info, many pipelines will fail or give warnings.


3. Input files and their roles

reference.fasta

  • The reference genome

  • All reads are aligned against this sequence

  • Must be:

    • Indexed beforehand (bwa index reference.fasta)
  • Determines:

    • Chromosome names
    • Coordinate system
    • Alignment positions

reads_R1.fastq.gz

  • Forward reads (Read 1) from paired-end sequencing

  • FASTQ format contains:

    • Read sequence
    • Per-base quality scores
  • Compressed with gzip


reads_R2.fastq.gz

  • Reverse reads (Read 2) from paired-end sequencing

  • Corresponds to the other end of the same DNA fragments as R1

  • BWA uses pairing information to:

    • Improve alignment accuracy
    • Detect insert size
    • Resolve ambiguous mappings

👉 The order matters: R1 first, R2 second.


4. Output redirection

> alignment.sam

  • Redirects standard output to a file

  • alignment.sam:

    • Contains all alignments in SAM (text) format

    • Includes:

      • Header lines (@HD, @SQ, @RG)
      • One line per read alignment

⚠️ SAM files can be very large. Often you’ll see:

bwa mem ... | samtools sort -o alignment.bam

to produce a compressed, sorted BAM file instead.


5. High-level summary

What this command does:

  • Aligns paired-end Illumina reads to a reference genome
  • Uses 8 threads
  • Filters out low-confidence alignments (MAPQ < 25)
  • Tags all reads as belonging to Sample1
  • Writes alignments to a SAM file

Conceptual flow:

reference.fasta
        ↑
reads_R1.fastq.gz + reads_R2.fastq.gz
        ↓
     bwa mem
        ↓
   alignment.sam

If you want, I can also:

  • Rewrite this command for BAM output
  • Explain SAM fields line-by-line
  • Show how changing -q affects results
  • Explain paired-end alignment logic in detail

Challenge three : similarity searching

Let’s try our blastn command line. We will add another completely spurious option --precise and see what happens:

TipPrompt

Explain the options and arguments in this command line:

blastn -query queries.fasta -db nt -remote –precise  -outfmt “6 qseqid sseqid staxids pident length evalue bitscore”  -max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv

include details on the roles of each file name

and the response:

--precise

  • Requests higher-accuracy alignments

  • Trades speed for sensitivity

  • Useful when:

    • Sequences are short
    • You care about accurate percent identity
  • May increase runtime slightly

While the rest of the output is helpful, the bit about --precise is completely made up.

Here’s a clean, line-by-line explanation of what this blastn command does, including what each option controls and the role of every file name.


The full command (reformatted)

blastn \
  -query queries.fasta \
  -db nt \
  --precise \
  -remote \
  -outfmt "6 qseqid sseqid staxids pident length evalue bitscore" \
  -max_target_seqs 5 \
  -evalue 1e-6 \
  -num_threads 8 \
  -out results.tsv

What blastn is doing overall

This command runs BLASTN (nucleotide–nucleotide BLAST) to:

  • Compare DNA sequences in queries.fasta
  • Against the NCBI nucleotide database (nt)
  • Using NCBI’s remote servers
  • Returning up to 5 significant matches per query
  • Writing a tab-delimited results file called results.tsv

File names and their roles

queries.fasta

  • Role: Input file (your data)
  • Contains one or more query nucleotide sequences
  • Must be in FASTA format
  • Each sequence in this file will be searched independently against the database

Example:

>query1
ATGCGTACGTTAGC...

nt

  • Role: Reference database
  • Stands for NCBI nucleotide database
  • Contains millions of curated and uncurated DNA sequences from GenBank, RefSeq, etc.
  • Because -remote is used, this database is not stored locally

results.tsv

  • Role: Output file
  • A tab-separated text file
  • Each line represents one BLAST hit
  • Easy to open in Excel, R, Python (pandas), or Unix tools

Explanation of each option

-query queries.fasta

  • Specifies the input FASTA file
  • BLAST reads each sequence in this file as a separate query

-db nt

  • Specifies the database to search against
  • nt = comprehensive nucleotide collection

--precise

  • Requests higher-accuracy alignments

  • Trades speed for sensitivity

  • Useful when:

    • Sequences are short
    • You care about accurate percent identity
  • May increase runtime slightly


-remote

  • Runs BLAST on NCBI’s servers instead of your local machine

  • Required when:

    • You don’t have nt downloaded locally
  • Ignores many local performance settings (see -num_threads below)


-outfmt "6 qseqid sseqid staxids pident length evalue bitscore"

  • Controls output format and columns

6 = tabular format (no comments)

The selected fields:

Field Meaning
qseqid Query sequence ID (from queries.fasta)
sseqid Subject (database hit) sequence ID
staxids NCBI taxonomic ID(s) of the hit
pident Percent identity of the alignment
length Alignment length (bp)
evalue Expectation value (statistical significance)
bitscore Alignment score (higher = better)

Each hit becomes one row in results.tsv.


-max_target_seqs 5

  • Limits output to at most 5 database hits per query
  • Keeps results concise
  • Important for large databases like nt

⚠️ Note: With -remote, this is treated as a soft limit—ordering is approximate.


-evalue 1e-6

  • Filters out weak matches

  • Only reports hits expected to occur by chance less than once per million searches

  • Common thresholds:

    • 1e-3 → permissive
    • 1e-6 → moderate/stringent
    • 1e-10 → very strict

-num_threads 8

  • Requests use of 8 CPU threads
  • Important caveat: When using -remote, this is largely ignored because computation happens on NCBI’s servers
  • Only truly matters for local BLAST databases

-out results.tsv

  • Specifies the output file name
  • Overwrites the file if it already exists
  • Output is tab-delimited because of outfmt 6

What the final output represents

  • Each row = one alignment between:

    • A sequence from queries.fasta
    • A sequence from the nt database
  • Columns are strictly ordered as defined in -outfmt

  • Multiple rows per query are possible (up to 5)


In one sentence

This command compares nucleotide sequences in queries.fasta against the NCBI nt database using remote BLASTN, applies sensitive search settings, filters for significant matches, and writes the top 5 hits per query to a tab-delimited file called results.tsv.

If you want, I can also:

  • Help you interpret the results.tsv
  • Suggest better parameters for short reads, barcodes, or metagenomics
  • Rewrite this for local BLAST databases or faster runs

While the rest of the output is helpful, the bit about --precise is completely made up.

Verdict

So far the picture is not looking great for LLMs (I carried out these experiments with GPT5.2 via the chat interface but I have no reason to suspect that other LLMs will be significantly different). All three incorrect arguments failed to be properly flagged.

This is in very sharp contrast to the previous experiment, which showed incredibly good performance. It’s a constant source of surprise (to me at least) how widly the ability of LLMs varies on apparently similar tasks.

A follow-up for a future experiement: can we tweak the prompt to make it more likely for the AI to catch these kinds of errors?