AI bioinformatics: explaining complex command lines
The first in a series of experiments to explore uses for AI models that are not simply using it as a search engine or writing exercise code.
Many command line bioinformatics tools (genome assemblers, read mappers, etc.) are on the fairly extreme end of the spectrum of power/simplicity. That is to say, they expose a lot of options and features, and as a result have a complicated interface with many different arguments.
Since all these options and arguments are structured and documented as text. interpreting them should be exactly the kind of thing that LLMs are good at. For each I’ll include the prompt and the complete response - just click the green response box to expand it and see the full thing. Note: all prompts were given to GPT5.2 with web searching turned on.
First challenge: mapping reads to a reference genome
First let’s try this bwa mem command which aligns reads to a reference genome.
Explain the options and arguments in this command line:
bwa mem -t 8 -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam
include details on the roles of each file name
GPT5.2 does a pretty good job here. Here’s the complete response:
Here’s a detailed breakdown of the components in this BWA-MEM command line and what each part (option or file) does:
bwa mem -t 8 -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' reference.fasta \
reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam
This command runs the BWA-MEM algorithm to align DNA sequencing reads to a reference genome. BWA-MEM is a widely used sequence aligner that maps short or moderate-length reads to large reference sequences such as whole genomes.(chipster.csc.fi)
🔧 Command and Options
bwa mem
This calls the MEM algorithm of BWA (Burrows–Wheeler Aligner), which is optimized for mapping longer reads (e.g., typical Illumina paired-end reads) and can handle paired and single-end input.(chipster.csc.fi)
-t 8
- Meaning: Use 8 CPU threads (parallel workers).
- Role: Speeds up alignment by letting BWA do multiple tasks simultaneously on multicore systems. Increasing this number generally decreases runtime (up to limits of memory and CPU).
- Without this option BWA runs in single-thread mode.(NodePit)
-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'
Meaning: Adds a read group (@RG) header line to the SAM output.
Role of the RG fields: These tags are written into each alignment record and are essential for downstream analyses (e.g., joint variant calling) that need to know which sample or run each read came from.
ID:Sample1— a unique identifier for this read groupSM:Sample1— sample name (often same as ID)PL:illumina— sequencing platform (here Illumina)
The string uses
\tto represent literal tabs in the@RGSAM header line.(chipster.csc.fi)
📂 File and Device Roles
reference.fasta
- This is the reference genome or sequence you are aligning to.
- Before running
bwa mem, BWA must have an index built for this FASTA (typically viabwa index). - The FASTA contains the actual nucleotides of the genome or contigs (e.g., human_chr1.fa).(chipster.csc.fi)
reads_R1.fastq.gz
- Contains the forward (R1) reads in FASTQ format (compressed with
gzip). - Each record contains a sequence and quality scores from sequencing.
- In paired-end data, these are the “first” mates.(chipster.csc.fi)
reads_R2.fastq.gz
- Contains the reverse (R2) reads in FASTQ format (also gzipped).
- These are the paired “second” mates corresponding to the reads in R1.
- Both R1 and R2 must be the same length and in the same read order for proper pairing.(Bioinformatics Stack Exchange)
📤 Redirecting Output
> alignment.sam
- The
>symbol redirects standard output to a file named alignment.sam. - This file will store the SAM (Sequence Alignment/Map) output: a text-based alignment format showing how each read maps to the reference, including mapping position, flags, CIGAR alignment string, and optional tags (including the RG tag you provided).(chipster.csc.fi)
🧠 Summary of Workflow
- BWA reads the indexed reference and loads it into memory.
- It reads the FASTQ reads (paired-end), finds maximal exact matches (MEMs) against the reference, and performs alignment.
- Each aligned read gets annotation/flags in SAM format, including the read group info you specified.
- The result (list of alignments) is written to alignment.sam.
- You can then convert, sort, or process this SAM into BAM format with tools like samtools for downstream variant calling or analysis.(chipster.csc.fi)
If you want, I can also walk through an example of parsing a SAM line to explain how those fields relate back to this command!
If we ignore the annoying overuse of emojis (which would be easy enough to turn off with a tweak to the prompt), this is pretty good, explaining both the literal meaning of each argument, and the role. It even points out a few extra helpful hints, like the fact that \t represents a literal tab, and the fact that the reference FASTA filename implies the existence of a matching index file.
The point that jumps out most to me is in the section discussing the -t flag to set the number of threads, where the response explicitly points out what the default behaviour would be in the absence of the option.
Want to see future experiments? sign up for the mailing list:
Second challenge : de novo genome assembly
Let’s see how we do with a spades command.
Explain the options and arguments in this command line:
spades.py -k 21,33,55,77,99,127 -t 16 -m 1000 –careful
–pacbio long_reads.fastq.gz
–pe1-1 illumina_lib1_R1.fastq.gz –pe1-2 illumina_lib1_R2.fastq.gz
–mp1-1 matepair_lib1_R1.fastq.gz –mp1-2 matepair_lib1_R2.fastq.gz
-o assembly_output/
include details on the roles of each file name
Here’s the response:
Here’s a breakdown of the SPAdes command you posted, explaining what each option means and what role each file plays in the assembly process. SPAdes is a de-novo genome assembler that takes sequencing reads from one or more technologies (e.g., Illumina, PacBio) and reconstructs contiguous sequences (contigs/scaffolds) without needing a reference sequence.(Wikipedia)
🧠 Command Line (Annotated)
spades.py \
-k 21,33,55,77,99,127 \
-t 16 \
-m 1000 \
--careful \
--pacbio long_reads.fastq.gz \
--pe1-1 illumina_lib1_R1.fastq.gz \
--pe1-2 illumina_lib1_R2.fastq.gz \
--mp1-1 matepair_lib1_R1.fastq.gz \
--mp1-2 matepair_lib1_R2.fastq.gz \
-o assembly_output/🔧 Core SPAdes Options
📌 -k 21,33,55,77,99,127
Specifies the k-mer sizes SPAdes will use in building its de-Bruijn graph.
- k-mers are substrings of length k extracted from reads.
- Using multiple k values helps SPAdes balance between assembling through repetitive regions (larger k) and handling low-coverage regions (smaller k).
- SPAdes will build and integrate graphs at each k value in the list.(ablab.github.io)
⚙️ -t 16
Sets the number of CPU threads to 16.
- More threads → faster assembly (if compute resources allow).
🧠 -m 1000
Limits SPAdes to 1000 GB of RAM.
- SPAdes can use a lot of memory, especially for large genomes or many input libraries.
🐞 --careful
Runs SPAdes in error-correction-aware mode.
- Tries to reduce the number of mismatches and short indels in the final assembly by more conservative graph construction.
- This usually increases accuracy but can slow the assembly.(inf-biox121.readthedocs.io)
📦 Input Read Libraries
SPAdes accepts reads from different sequencing technologies and library types. You tell SPAdes which files belong to which library — important because SPAdes uses this information during graph building and scaffolding.
🧬 --pacbio long_reads.fastq.gz
Specifies one file of PacBio CLR (continuous long reads).
- Long reads help bridge gaps and resolve repeats.
- SPAdes treats them as unpaired long reads for hybrid assembly.
- These can dramatically improve assembly contiguity when combined with accurate short reads.(ablab.github.io)
🔁 Illumina Paired-End Reads
--pe1-1 illumina_lib1_R1.fastq.gz \
--pe1-2 illumina_lib1_R2.fastq.gz--pe1-1= forward (R1) reads of paired-end Illumina library #1--pe1-2= reverse (R2) reads of paired-end Illumina library #1
Paired-end reads have a known approximate distance between mates and are excellent for local assembly and resolving small repeats.(ablab.github.io)
🔗 Illumina Mate-Pair Reads
--mp1-1 matepair_lib1_R1.fastq.gz \
--mp1-2 matepair_lib1_R2.fastq.gz--mp1-1= forward reads of mate-pair library #1--mp1-2= reverse reads of mate-pair library #1
Mate-pair libraries are similar to paired-end libraries but have larger insert sizes (often several kb).
- Useful for scaffolding and spanning longer repeats or structural gaps.
- SPAdes automatically recognises these are mate-pair reads and uses them accordingly.(ablab.github.io)
💡 Note: SPAdes options like
--pe1-1and--mp1-1follow a convention:pe= paired-end,mp= mate-pair, the number after indicates the library index (1here), and-1/-2identify forward/reverse read files, respectively.(ablab.github.io)
📁 Output Directory
📦 -o assembly_output/
Specifies the folder where SPAdes will write:
- error-corrected reads
- intermediate assembly data
- contigs (
contigs.fasta) - scaffolds (
scaffolds.fasta) - logs and assembly graph files
This directory must not exist before running SPAdes (or should be empty).(ablab.github.io)
🧠 Summary of File Roles
| File | Role |
|---|---|
illumina_lib1_R1.fastq.gz |
Illumina paired-end forward reads |
illumina_lib1_R2.fastq.gz |
Illumina paired-end reverse reads |
matepair_lib1_R1.fastq.gz |
Illumina mate-pair forward reads |
matepair_lib1_R2.fastq.gz |
Illumina mate-pair reverse reads |
long_reads.fastq.gz |
Long reads from PacBio instrument |
assembly_output/ |
Directory to collect final assembly results |
📌 How SPAdes Uses These Reads
- Pre-processing/error correction of short reads (Illumina) internally.
- Construction of De-Bruijn graphs at multiple k values.
- Integration of long reads (PacBio) for bridging and resolving repeats.
- Use of paired end and mate pair information for scaffolding assembled contigs.(ablab.github.io)
If you’d like, I can also explain how SPAdes chooses k-mer values automatically or suggest parameter sets for different read types/genome sizes.
In this response, I like how it repeats the relevant blocks of options when discussing the input files - especially useful since there are three different file types represented. It’s also notable how the response helpfully points out what we expect to find in the output directory after the assembly job has run.
Challenge three : similarity searching
For the next challenge, a blastn command line.
Explain the options and arguments in this command line:
blastn -query queries.fasta -db nt -remote -outfmt “6 qseqid sseqid staxids pident length evalue bitscore” -max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv
include details on the roles of each file name
and the response:
Here’s a detailed, field-by-field breakdown of what each part of your BLAST command does and what the filenames refer to:
🧬 Command Overview
blastn -query queries.fasta -db nt -remote \
-outfmt "6 qseqid sseqid staxids pident length evalue bitscore" \
-max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv
This runs blastn, a BLAST+ program that compares nucleotide sequences against a nucleotide database to find similar sequences. (docs.metacentrum.cz)
🔹 Core Components and What They Mean
📌 blastn
The program name — it runs a nucleotide-to-nucleotide search (as opposed to e.g., blastp for proteins). (docs.metacentrum.cz)
📁 -query queries.fasta
Specifies the input file with your query sequences.
queries.fastashould be in FASTA format.- It can contain one or more nucleotide sequences that you want to search against a database.
FASTA format means each sequence begins with a header line (starting with >), followed by the sequence itself.
📚 -db nt
Specifies the BLAST database to search against.
ntis the NCBI nucleotide collection, a large, comprehensive nucleotide sequence database.- Because it’s a remote database (see
-remotebelow), you don’t need the database files locally — they’re on NCBI’s servers.
🌐 -remote
Tells BLAST to run the search on NCBI’s online servers instead of using a local database.
- This is useful when you don’t have a local copy of the database (like
nt) — which is huge and requires a lot of disk space. - The query and results are sent over the internet to NCBI’s BLAST service.
📤 Output Format and Fields
🧾 -outfmt "6 qseqid sseqid staxids pident length evalue bitscore"
Controls how the results are formatted.
- The number
6means tabular format (easy to parse). (metagenomics.wiki) - After
6, you list specific fields you want in the output:
| Field | Meaning |
|---|---|
qseqid |
Query sequence identifier (from your FASTA) |
sseqid |
Subject (database) sequence identifier |
staxids |
Taxonomy IDs of the subject hits |
pident |
Percentage identity between query and subject |
length |
Alignment length |
evalue |
Expectation value — measure of statistical significance |
bitscore |
Normalized score indicating match strength |
These fields control the columns in your output file, which is useful for scripts and downstream analysis. (NCBI)
🎯 Filtering and Limits
🔢 -max_target_seqs 5
Limits BLAST to reporting up to 5 top hits per query.
- Only the top N matches (by score/e-value) are kept in the results.
- With tabular outputs, this is how you tell BLAST to cut off after the best hits. (NCBI)
⚠️ Note: BLAST may not always sort perfectly by quality alone due to heuristic behavior — but this is the intended way to limit results.
🔍 -evalue 1e-6
Sets the E-value threshold:
- Only matches with E-value ≤ 1×10⁻⁶ are reported.
- The E-value estimates how many hits of similar quality would be expected by chance — lower is more significant. (open.oregonstate.education)
🧠 Performance and Output
⏱️ -num_threads 8
Specifies that BLAST should use 8 CPU threads for the search.
- This speeds up the search by parallelizing work across cores. (wiki.gacrc.uga.edu)
📄 -out results.tsv
Writes the output to a file named results.tsv.
.tsvstands for tab-separated values, matching the tabular format you requested with-outfmt 6.- This file will contain one row per hit with the columns you specified.
🧠 What Happens When You Run It
- BLAST reads
queries.fastaand parses each sequence. - It sends queries to the remote nt database due to
-remote. - BLAST searches for similar sequences and evaluates hits based on statistical significance (E-value, bitscore).
- Top matches (up to 5 per query with E-value ≤ 1e-6) are returned.
- The results are written in the specified tabular format to
results.tsv.
🎉 Summary of Each Option
| Option | Role |
|---|---|
blastn |
BLAST nucleotide search program |
-query queries.fasta |
FASTA file with sequences to search |
-db nt |
Target database (nucleotide) |
-remote |
Searches online at NCBI (no local DB needed) |
-outfmt ... |
Custom control of output fields and format |
-max_target_seqs 5 |
Return only top 5 matches per query |
-evalue 1e-6 |
Only report hits with E-value ≤ 1e-6 |
-num_threads 8 |
Use 8 parallel threads |
-out results.tsv |
Save output to this file |
If you’d like, I can also explain how to interpret the results.tsv contents for biological meaning — just let me know!
Here the interesting bits are the fact that it’s a remote query, and that it specifies a custom output format, both of which the LLM picks up on with no problems. As before, there are many useful extra bits of information sprinkled thoughout, like the brief summary of FASTA format, and the note that due to the heuristic rules of BLAST the hits are not always guaranteed to be the best ones.
One thing that I think is missing here is a note that the parallelisation might not have as much of a speed up as expected, given that the actual processing is done on remote servers.
Verdict
This is incredibly useful for anyone getting into bioinformatics work. Many of these options are pretty obvious if you’re familiar with the general job that the tools are doing, but the extra context that the LLM supplies seems very useful. Given that these tools are well documented, I would put a fairly high level of confidence in the accuracy of the summaries.
A possible drawback of all the responses here is that they systematically go through the arguments/files in the order that they appear in the command lines, though I’m sure that could be tweaked with a slight modification to the prompt. Indeed, the most impressive thing is how well the model performs with a very simple prompt - I was expecting to have to spend more time adjusting it.
A follow-up for a future experiement: how will the model cope with unrealistic, invalid options? we will see…..