Methods in Bioinformatics 2025/26 - Tutorial for Computer Scientists: Heuristic Sequence Alignment

Algorithm for computing the sensitivity of a seed
- Combinatorial algorithm
- Probabilistic DP algorithm
Minimizers: saving memory and time in heuristic sequence alignment
MinHash: finding similar sequences without alignments

Algorithm for computing the sensitivity of a seed

In the lecture we saw a plot of seed sensitivity for BLAST, i.e., the probability that in a random alignment of a certain length and similarity level, we find $k$ consecutive matches.
For this probability, we do not have a simple formula, but it can be computed by dynamic programming.
Consider a seed of length $k$ (as in the BLAST program for nucleotides; in the lecture, the seed length was denoted $w$, now $k$).

Combinatorial algorithm

Let $a_n$ be the number of binary sequences of length $n$ that do not contain $k$ adjacent 1s
When $n<k$, $a_n = 2^n$
When $n \geq k$, $a_n = a_{n−1} + a_{n−2} + a_{n−3} + \ldots + a_{n−k}$

Probabilistic DP algorithm

Consider a probabilistic alignment model, where each position has probability $p$ of being a match and $(1-p)$ of being a mismatch or gap, and the alignment has length $L$.
Random variable $X_i = 1$ if at position $i$ there is a match, 0 otherwise
Random variable $Y_i = 1$ if position $i$ is the start of a seed, i.e., if $X_i=1, X_{i+1}=1, \dots, X_{i+k-1}=1$
$\Pr(Y_i = 1) = p^k$, since $X_i$ are mutually independent
Let $Y = \sum_{i=1}^{L-k-1} Y_i$
By linearity of expectation we can easily estimate $E(Y) = (L-k+1)p^k$
But we are interested in $\Pr(Y>0) = 1-\Pr(Y=0)$
$\Pr(Y=0) = \Pr(Y_1=0 \wedge \dots \wedge Y_{L-k+1}=0)$
Why doesn’t $\Pr(Y=0) = \Pr(Y_i = 0)^{L-k+1}$ hold?
- The individual $Y_i$ are not independent, e.g., $\Pr(Y_{i+1}=1|Y_i=1)=p$
- In the sequence of $Y_i$s, ones tend to cluster together
The probability of absence of a seed $\Pr(Y=0)$ can be computed by dynamic programming
Let $A[n]$ be the probability of absence of a seed in the first $n$ columns of the alignment ($0\le n\le L$)
We distinguish cases according to how many 1s occur at the very end of sequence $X_1,\dots,X_n$
- This number can be $0,1,\dots k-1$ (if it were $k$ or more, we would have a seed occurrence)

$A[n] = \left\{\begin{array}{ll} 1 & \mbox{if } n < k\\\
\sum_{i=0}^{k-1} p^i (1-p)A[n-i-1] & \mbox{if } n \ge k \end{array}\right.$

In the second line, $p^i(1-p)$ corresponds to $\Pr(X_1…X_n\mbox{ ends with exactly }i\mbox{ ones})$
$A[n-i-1]$ is $\Pr(X_1\dots X_{n-i-1}\mbox{ does not contain a seed})$, which is the same as $\Pr(X_1 \dots X_n\mbox{ does not contain a seed }| X_1 \dots X_n\mbox{ ends with exactly }i\mbox{ ones})$

Minimizers: saving memory and time in heuristic sequence alignment

Heuristic sequence alignment

A $k$-mer is defined as $k$ consecutive characters in a string (in our case DNA nucleotides)
In lecture: use of $k$-mers as seeds for finding potential matches (BLAST)
Note that in the lecture the length of seed was denoted $w$, here $k$
When comparing two sequences (or sets of sequences), we store all $k$-mers of one sequence in a dictionary (such as a hash table), then go through all $k$-mers of the other sequence and look them up in the dictionary
Example $k=5$,
- DB on the left, $k$-mers and their positions to be stored in the dictionary,
- dictionary here shown sorted, but in practice a hash table is used,
- query on the right, $k$-mers searched in the dictionary, arrows indicate found $k$-mers (alignment seeds)

Alignment we look for (mismatches shown in small letters):
AGTGGCTGCCAGGCTGG    
cGaGGCTGCCaGGtTGG  

DB:                   Dict:       Query:
AGTGGCTGCCAGGCTGG                 cGaGGCTGCCtGGtTGG  
AGTGG,1               AGGCT,11    CGAGG,1              
 GTGGC,2              AGTGG,1      GAGGC,2              
  TGGCT,3             CAGGC,10      AGGCT,3 -> 11         
   GGCTG,4            CCAGG,9        GGCTG,4 -> 4,12           
    GCTGC,5           CTGCC,6         GCTGC,5 -> 5     
     CTGCC,6          GCCAG,8          CTGCC,6 -> 6        
      TGCCA,7         GCTGC,5           TGCCT,7
       GCCAG,8        GCTGG,13           GCCTG,8       
        CCAGG,9       GGCTG,4,12          CCTGG,9     
         CAGGC,10     GTGGC,2              CTGGT,10    
          AGGCT,11    TGCCA,7               TGGTT,11    
           GGCTG,12   TGGCT,3                GGTTG,12  
            GCTGG,13                          GTTGG,13  

K-mers with a step

Trick to reduce required memory (used in the BLAT program): store only every $s$-th $k$-mer from the first sequence, then search all $k$-mers from the second
This slightly reduces sensitivity
We are guaranteed to find a seed if there are at least $k+s-1$ consecutive matches
Could we also search only every $s$-th $k$-mer in the query?
- What if db and query were the same, but the query was missing the first letter?

Example $k=5$, $s=3$, $k$-mers with numbers are stored in the dictionary; all query $k$-mers are searched

Alignment we look for:
AGTGGCTGCCAGGCTGG    
cGaGGCTGCCaGGtTGG  

DB:                  Dict:      Query:
AGTGGCTGCCAGGCTGG               cGaGGCTGCCtGGtTGG  
AGTGG                CCAGG,9    CGAGG,1              
 GTGGC               CTGCC,6     GAGGC,2              
  TGGCT,3            GGCTG,12     AGGCT,3               
   GGCTG             TGGCT,3       GGCTG,4 -> 12           
    GCTGC                           GCTGC,5          
     CTGCC,6                         CTGCC,6 -> 6        
      TGCCA                           TGCCT,7
       GCCAG                           GCCTG,8       
        CCAGG,9                         CCTGG,9     
         CAGGC                           CTGGT,10    
          AGGCT                           TGGTT,11    
           GGCTG,12                        GGTTG,12  
            GCTGG                           GTTGG,13  

Minimizers

To reduce dictionary lookups, we can use minimizers
Consider all groups of $s$ consecutive $k$-mers (sliding window), in each group find the lexicographically smallest $k$-mer (minimizer) and store it in the dictionary
When shifting the window by one to the right, often the smallest $k$-mer remains the same and does not need to be stored again, saving memory (and time)
We do not search for all $k$-mers of the query, but also only minimizers, which can save time
We are again guaranteed to find a seed if there are at least $k+s-1$ consecutive matches
Example $k=5$, $s=4$, $k$-mers with numbers are stored/searched in the dictionary

AGTGGCTGCCAGGCTGG                cGaGGCTGCCtGGtTGG  
AGTGG,1             AGGCT,11     CGAGG              
 GTGGC              AGTGG,1       GAGGC             
  TGGCT             CAGGC,10       AGGCT,3 -> 11          
   GGCTG            CCAGG,9         GGCTG           
    GCTGC,5         CTGCC,6          GCTGC          
     CTGCC,6        GCTGC,5           CTGCC,6 -> 6        
      TGCCA                            TGCCT        
       GCCAG                            GCCTG       
        CCAGG,9                          CCTGG,9     
         CAGGC,10                         CTGGT,10    
          AGGCT,11                         TGGTT    
           GGCTG                            GGTTG   
            GCTGG                            GTTGG  

Especially advantageous if the first and second set of sequences are the same, e.g., when searching for overlaps in reads during genome assembly. Each read has a set of minimizers, which are used as keys in the dictionary, values are lists of reads. Pairs of reads sharing a minimizer end up in the same list (binning) and will be considered for overlap computation
In practice, we do not use the lexicographically smallest $k$-mer as a minimizer, but each $k$-mer is hashed by some function and the one with the minimal hash value is taken
The reason is to avoid minimizers often being sequences like AAAAA, which are overrepresented in biological sequences and often not functionally interesting
The average density of minimizers in a sequence with random hashing is about $2/(s+1)$ (in BLAT it was lower, $1/s$)
Minimizers are used by minimap2, a very popular tool for aligning reads to each other and to genomes
- for aligning nanopore reads to a genome it uses $k=15$, $s=10$, overlaps in nanopore reads $k=15$, $s=5$, genome comparison with identity over 80% $k=19$, $s=10$

Summary

Program	# $k$-mers in dict	# $k$-mers searched	a seed guaranteed in
BLAST	$n$	$m$	$k$ adjacent matches
BLAT	$n/s$	$m$	$k+s-1$ adjacent matches
minimizers	$2n/(s+1)$	roughly $2m/(s+1)$	$k+s-1$ adjacent matches

$n$, $m$ are the lenths of sequences
in $k$-mer counts we neglected terms like $-k+1$

Literature

Li, Heng. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32.14 (2016): 2103-2110.
Li, Heng. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34.18 (2018): 3094-3100.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004 Dec 12;20(18):3363-9.
Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., & Kingsford, C. (2017). Improving the performance of minimizers and winnowing schemes. Bioinformatics, 33(14), i110-i117.

MinHash: finding similar sequences without alignments

A short aside: Text similarity in web analysis

Let us have a set of web pages (a web page is a sequence of words). We want to find pairs of similar ones among them. How can we define the similarity of two texts?

One way to do this is to look at the number of words shared by each pair of pages. We expect that the more words they have in common, the more similar they are. This measure of similarity is formalized by the mathematical concept of the Jaccard index.

Let $U$ be the universe of words and let $A, B \subseteq U$ be its subsets (i.e., the sets of words of two texts). Then the Jaccard index is $J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}$

This measure takes the value 0 only if the sets are disjoint, and the value 1 only if the sets are identical. Otherwise, its value lies in the open interval $(0, 1)$, and the more words they have in common, the higher its value.

Then the question “Which pairs of texts are similar?” can be reformulated, for example, as “Which pairs of texts have the Jaccard index higher than some threshold $\alpha$?”.

How quickly could we compute the Jaccard index for two sets of words, each with $n$ elements?

The exact computation of the Jaccard index is not always fast enough for the purposes of a particular application, so a logical solution is to try to compute its value only approximately.

Approximation of the Jaccard coefficient: MinHash

We will have random hash functions $h_1, h_2, \dots, h_s$.

For each hash function $h$, we assume that if we use it on some set $A = {a_1, a_2, \ldots, a_n}$, then $h(a_1), h(a_2), \ldots, h(a_n)$ will be a random permutation of the set $A$ (not really true for practical hash functions).

For a set $A = {a_1, a_2, \ldots, a_n}$ and a hash function $h$, $minHash_{h}(A)$ is defined as follows:

$minHash_h(A) := \min{h(a_1), h(a_2), \ldots, h(a_n)}$

Since $h$ is a random hash function, we can view the value $minHash_h(A)$ as a random variable representing a uniformly random selection of an element from the set $A$.

$\Pr(minHash_h(A) = minHash_h(B)) = J(A, B)$

Proof:

Minimum of $A\cup B$ is in $A\cap B$ with probability equal to the Jaccard index
If the minimum is outside of $A\cap B$, then the minHashes of $A$ and $B$ are different
If it is inside $A\cap B$, then the minHashes of $A$ and $B$ are the same

Let us have $s$ independent hash functions $h_1, h_2, \dots, h_s$. For each $i=1,2,\dots,s$ we define the variable $X_i$ to be 1 if $minHash_{h_i}(A) = minHash_{h_i}(B)$ and 0 otherwise.

Then $E(X_i) = J(A, B)$ and the variables $X_i$ are independent.

MinHash algorithm

Computing sketches of documents:

Choose “random” hash functions $h_1,\dots, h_s$.
For each text $A={a_1, \dots, a_n}$:
- For each function $h_i$ from $h_1,\dots, h_s$:
  - $S_{A,i} = \min{h_i(a_1), h_i(a_2), \ldots, h_i(a_n)}$
- Vector $S_A = (S_{A,1}, S_{A,2}, \dots, S_{A,s})$ is called the sketch of $A$

Comparing sketches of documents:

For each pair of documents $A$, $B$:
- $x = |\{ i : S_{A,i} = S_{B,i}\}|$
- $x/s$ is an estimate of $J(A,B)$

Running time and memory:

Let us have $N$ documents, each with $M$ words of length $w$
Computing all sketches takes time $O(NMsw)$
Sketches take up memory $O(Ns)$, we do not need to keep the documents in memory (when computing sketches, we can read documents word by word and keep only the current sketch values in memory)
Comparing one pair of documents takes $O(s)$, comparing all pairs takes $O(N^2s)$

For comparison, the trivial computation of the exact Jaccard measure using hashing would take $O(N^2Mw)$ and all documents would take up $O(NMw)$ memory. So if $N$ is large and $s$ is much smaller than $M$, we save both time and memory.

Alternative: instead of using $s$ different functions, we use only hash function and take not just the minimum, but the $s$ smallest elements. Then we estimate $J(A,B)$ using $|S_A\cap S_B|/s$ where $S_A$ is the set of values in the sketch of set $A$. This saves time when computing the sketch, since we do not need to hash all elements $s$ times.

To avoid comparing all pairs of documents, we can use binning:

For each $h_i$, create a dictionary mapping hash value to a list of documents for which it was minHash (bins)
We compare only pairs of documents that end up in the bin (which means that their estimate of $J(A,B)$ will be nonzero)

Literature

Broder AZ. On the resemblance and containment of documents. Compression and Complexity of SEQUENCES 1997 (pp. 21-29). IEEE.

Finding similar sequences

As “words” we use all $k$-mers of a given sequence.
Instead of aligning them, we estimate their similarity by the Jaccard index of their $k$-mer sets.
Jaccard index is estimated by Minhash.
- For example, a popular program called Mash for genome comparison uses $k=21$, $s=1000$
- It stores $s$ smallest values from one hash function
- The sketch is about 8kb per genome (a genome has millions or billions of nucleotides)

Literature

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology. 2016 Dec;17(1):1-4.