Finding Sequence Motifs Defined by a Probability Matrix

  • We are given $n$ sequences $S=(S_1,\dots, S_n)$, each of length m, motif length L, null hypothesis q (nucleotide frequencies in the genome)
  • We are looking for a motif in the form of a probability profile of length L and its occurrence in each sequence
  • Let $W[a,i]$ be the probability that at position i of the motif there is base a, W is the whole matrix
  • Let $o_i$ be the position of occurrence in sequence $S_i$, $O=(o_1, \dots, o_n)$ are all occurrences
  • $\Pr(S|W,O)$ is a simple product, where for positions in the motif windows we use probabilities from W, for positions outside the windows we use q
    • $\Pr(S_i|W,o_i) = \prod_{j=1}^{L} W[S_i[j+o_i-1],j] \prod_{j=1}^{o_i-1} q[S_i[j]] \prod_{j=o_i+L}^m q[S_i[j]]$
    • $\Pr(S|W,O) = \prod_{i=1}^n \Pr(S_i|W,o_i)$
  • We are looking for W and O that maximize this likelihood $\Pr(S|W,O)$
    • There is no known efficient algorithm that always finds the maximum
    • We could try all possibilities for O. For a given O the best W is the frequencies from the data.
    • Conversely, if we know W, we can find the best O:
      • in each sequence i we try all positions $o_i$ and choose the one with the highest value $\Pr(S_i|W,o_i)$

EM Algorithm

  • Iteratively improves W, taking all O weighted by their probability with respect to W from the previous round
  • As seen in the lecture, here is a slightly rewritten version:

  • Initialization: assign each position j in sequence $S_i$ some score $p_{i,j}$
  • Iterate:
    • Compute W from all possible occurrences in $S_1,\dots,S_k$ weighted by $p_{i,j}$
      for i in [0..n):			# for all strings
      for j in [0..|s_i|-L+1):		# for all positions
      for x in [j..j+L):		# substring starting at j
      W[s_i[x], x-j] += p_{i,j}	# depending on nucleotide s[j], fill corresponding motif position in W
      # normalize each column so that sum is 1
      
    • Recompute all scores $p_{i,j}$ so that they correspond to the ratios of probabilities of occurrence of W at position j in $S_i$. That is, $p_{i,j}$ is proportional to $\Pr(S_i|W,o_i=j)$, and normalized so that the sum of $p_{i,j}$ in each row $i$ is 1.

Gibbs Sampling

  • Initialization: Take random positions of occurrences O
  • Iterate:
    • Compute W from occurrences O
    • Randomly select one sequence $S_i$
    • For each possible position j in $S_i$ compute the score $p_{i,j}$ (as in EM) of occurrence of W at this position
    • Choose $o_i$ randomly weighted by $p_{i,j}$
  • In this way, we get a sequence of samples $O^{(0)}, O^{(1)}, \dots$.
  • Consecutive samples are similar (differ only in one component $o_i$), so they are not independent.
  • For each sample $O^{(t)}$ we find the best $W^{(t)}$ and compute the likelihood $\Pr(S|W^{(t)},O^{(t)})$. Finally, we select O and W where the likelihood was the highest.
  • This algorithm (with minor modifications) was used in the paper Lawrence, Charles E., et al. (1993) “Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.” Science.
    • In the paper, in each iteration, the matrix W is computed from all sequences except $S_i$.
    • Occasionally, they make a step where they randomly try to shift all occurrences one position left or right.
    • This algorithm is not strictly mathematically correct Gibbs sampling (it does not even have a properly defined distribution from which it samples). For information, at the bottom of the page we provide the Gibbs sampling algorithm for motif finding from another paper.

Sampling from a Probabilistic Model in General

  • Suppose we have a probabilistic model where $D$ are some observed data and $X$ are unknown random variables (for us, $D$ are sequences $S$ and $X$ are occurrences $O$, possibly also matrix $W$)
  • We can search for $X$ for which the likelihood $\Pr(D|X)$ is highest
  • Or we can randomly sample different $X$ from $\Pr(X|D)$

Use of samples

  • Among the obtained samples, we choose the one for which the likelihood $\Pr(D|X)$ is highest (another heuristic approach to maximizing likelihood)
  • But samples also give us information about the uncertainty in the estimate of $X$
    • We can estimate means and deviations of various quantities
    • For example, in motif finding, we can track how often each position is an occurrence of the motif
  • Generating independent samples from $\Pr(X|D)$ can be difficult
  • The Markov chain Monte Carlo (MCMC) method generates a sequence of dependent samples $X^{(0)}, X^{(1)},\dots$, which converges in the limit to the target distribution $\Pr(X|D)$
  • Gibbs sampling is a special case of MCMC

Markov Chains

  • See earlier exercises

  • Markov chain is a sequence of random variables $X^{(0)}, X^{(1)}, \dots,$ such that $\Pr(X^{(t)}|X^{(0)},\dots,X^{(t-1)}) = \Pr(X^{(t)}|X^{(t-1)})$, i.e. the value at time $t$ depends only on the value at time $t-1$ and not on earlier values.
  • We are interested in homogeneous Markov chains, in which $\Pr(X^{(t)}|X^{(t-1)})$ does not depend on $t$.
  • We are also only interested in chains where the random variables $X_t$ take values from a finite set (possible values of $X^{(t)}$ are called states)
    • For example, states A,C,G,T
    • In Gibbs sampling for motifs, the state is the configuration of variables $O$ (i.e. we have $(m-L+1)^n$ states)
      • The sample at step $t$ depends on the sample at step $t-1$ and differs only in the value of one $o_i$

Matrix

  • The probabilities of transitions between states in one step can be expressed by a probability matrix P, whose element $p_{x,y}$ denotes the probability of transition from state x to state y $p_{x,y}=\Pr(X_t=y|X_{t-1}=x)$
    • The sum of each row is 1, numbers are non-negative
  • We denote $p_{x,y}^t$ as $\Pr(X^{(t)}=y|X^{(0)}=x)$, these values are obtained by raising the matrix P to the power t

Stationary Distribution

  • A distribution $\pi$ on the set of states is called stationary for a Markov chain $P$ if for each j $\sum_{i}\pi(i)p_{i,j} = \pi(j)\,$ (or in matrix notation $\pi P = \pi$)
  • If matrix P satisfies certain conditions (is ergodic), there exists exactly one stationary distribution $\pi$. Moreover, for each x and y $\lim_{t\to\infty} p_{x,y}^{t} = \pi(y)\,$

Examples of Markov Chains in Bioinformatics

  • In HMMs, the states form a Markov chain
  • Other variants: infinite state spaces (more complex theory), continuous time (seen in substitution models), higher-order chains where we determine $\Pr(X_t|X_{t-r},\dots,X_{t-1})$, etc.
  • Use in bioinformatics: characterization of random sequences (null hypothesis), for DNA orders up to 5 are used, better than independent variables

Ergodic Markov Chains

  • We say that a matrix is ergodic if $P^t$ for some t>0 has all entries nonzero
  • Examples of non-ergodic matrices
    1 0          0.5 0.5          0 1             0.5 0.5
    0 1          0   1            1 0             1   0
    disconnected weakly connected periodic      ergodic
  • In HMMs, the states form a Markov chain; gene finding HMM from the lecture has an ergodic state space, profile HMMs do not

Markov chain Monte Carlo MCMC

  • We want to generate random samples from some target distribution $\pi$, but this distribution is too complex.
  • We construct an ergodic Markov chain whose stationary distribution is $\pi$, so that we can efficiently sample $X^{(t)}$ if we know $X^{(t-1)}$.
  • If we start from any point $X^{(0)}$, after some time t the distribution of $X^{(t)}$ is approximately $\pi$
  • But consecutive samples are not independent!
  • However, we can estimate expected values of various quantities $\frac{1}{t} \sum_{i=1}^t f(X^{(t)})$ converges to $E_\pi [f(X)]$

Gibbs Sampling

  • The target distribution $\pi(X)$ is over vectors of length n $X=(x_1,…x_n)$
  • In each step, we sample one component $x_i$ from the conditional probability $\Pr(x_i|x_1,\dots,x_{i-1},x_{i+1},\dots x_n)$
  • The other values remain the same as in the previous step
  • The value $i$ is chosen randomly or cycled periodically $i=1,2,\dots,n$

Proof of Correctness of Gibbs Sampling

  • Gibbs sampling is not always ergodic if some combinations of values have zero probability. Therefore when designing Gibbs sampling, we must ensure that the chain is ergodic.
  • For general Gibbs sampling, we can shown that if it is ergodic, then our chosen $\pi$ is its stationary distribution
  • Definition: We say that matrices P and distribution $\pi$ satisfy detailed balance if for every pair of states (two value vectors) x and y we have $\pi(x)p_{x,y} = \pi(y)p_{y,x}$
  • Lemma: if for some chain P and some distribution $\pi$ detailed balance holds, then $\pi$ is a stationary distribution for P
    • Proof: $\sum_x \pi(x)p_{x,y} = \sum_x \pi(y)p_{y,x} = \pi(y)\sum_x p_{y,x} = \pi(y)$
  • Lemma: for the Gibbs sampling chain, detailed balance holds with respect to the target distribution $\pi$
    • Proof: consider two consecutive value vectors x and y that differ in the i-th coordinate. Let $x_{-i}$ be the values of all other variables except $x_i$
    • $\pi(x)p_{x,y} = \pi(x)\Pr(y_i|x_{-i}) = \Pr(x_{-i})\Pr(x_i|x_{-i}) \Pr(y_i|x_{-i}) = \pi(y)\Pr(x_i|x_{-i}) = \pi(y)\Pr(x_i|y_{-i}) = \pi(y)p_{y,x}$

More Proper Gibbs Sampling for Motifs

Given for interest - according to the paper Siddharthan, R., Siggia, E.D. and Van Nimwegen, E., 2005. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS computational biology, 1(7), p.e67.

Probabilistic model

  • We extend the model so that O and W are also random variables, so we have the distribution $\Pr(S,W,O)$
    • Then we want to sample from $\Pr(O|S)$ (marginalizing over all values of $W$)
  • A random probability matrix W is generated (e.g. from a uniform distribution over all matrices)
  • In each sequence i, a window $o_i$ of length L is chosen (uniformly from m-L+1 possibilities)
  • In the window, the sequence is generated according to profile W, and outside the window according to the null hypothesis (as before)

Gibbs sampling

  • Given S, we sample O ($O^{(0)}, O^{(1)}, \dots$) (if needed, from $O^{(t)}$ we can construct the matrix $W^{(t)}$)
    • start with random windows $O^{(0)}$
    • in step $t+1$, choose one sequence $i$ and for all positions $o’_i$ compute $\Pr(o’_i|O^{(t)}_{-i},S)$, where $O_{-i}=(o_1\dots o_{i-1}o_{i+1}\dots o_n)$, i.e. all occurrence positions except the $i$-th.
    • randomly choose one $o’_i$ proportional to these probabilities
    • $O^{(t+1)}$ is obtained from $O^{(t)}$ by replacing the position in sequence i with the newly chosen one
    • repeat many times
  • Converges to the target distribution $\Pr(O|S)$, but samples are not independent
  • Other possible steps in sampling: shift all windows by a constant left or right
  • Other possible model/algorithm extensions: add a distribution over L and randomly increase/decrease L, allow skipping the motif in some sequences, search for multiple motifs at once, …

How to compute $\Pr(o_i|O_{-i},S)$?

  • We are not interested in normalization constants, we can easily normalize by summing over all $o’_i$
  • $\Pr(o_i|O_{-i},S) = \Pr(O|S) / \Pr(O_{-i}|S)$, but the denominator is a constant
  • $\Pr(O|S) = \Pr(S|O)\Pr(O)/\Pr(S)$, where $\Pr(S)=\sum_{O’} \Pr(S|O’)\Pr(O’)$
    • Again, the denominator does not interest us (normalization constant)
    • $\Pr(O)$ is also a constant (uniform distribution of window positions)
  • Thus, $\Pr(o_i|O_{-i},S)$ is proportional to $\Pr(S|O)$
  • We can easily compute $\Pr(S|W,O)$, but we need to “eliminate” W, see below
  • We try all possible values $o’_i$, compute the probability $\Pr(S|O)$, and sample proportional to that

Further details of computing $\Pr(S|O)$:

  • Let $S_o$ be the sequences in the windows and $S_{-o}$ outside the windows. We have $\Pr(S|O) = \Pr(S_o|O)\Pr(S_{-o}|O)$
  • $\Pr(S_{-o}|O)$ can be easily computed (does not depend on W)
  • $\Pr(S_o|O) = \int \Pr(S_o|O,W)\Pr(W)dW$ where the integral is over values of matrix $W$ where $w_{a,i}\ge 0$ and $\sum_a w_{a,i} = 1\,$
  • $\Pr(W)$ is a constant (uniform distribution; not a probability but a density), $\Pr(S_o|O,W) = \prod_{i=1}^L \prod_a (w_{a,i})^{n_{a,i}}$, where $n_{a,i}$ is the number of occurrences of base $a$ at position $i$ in windows $o_1\dots o_n$
  • $\Pr(S_o|O) = \prod_{i=1}^L 3!/(n+3)! \prod_a n_{a,i}!$ (without proof)