CAP5937 Bioinformatics

Home Work No. 3

Due: October 4, 2004

 

1. Given the following sequences, obtain a multiple alignment using the center star alignment. Use the BLOSSUM62 alignment matrix. You can use the program you have written for homework-2 to compute the alignment scores/alignments.

 

Name                                       Sequence

ENGC_NITEU/88-236       MFYRELGYPVLEISAKISVQPLIPLLSGQTSLLAQVLA

ENGC_PASMU/123-272      RIYQQIGYQTLMISALSGENMEKLTALFDEGTSIFVQVIH

ENGC_PORGI/88-241       AVYTAIGYPCCHVSAITGEGLPDLKSLLDGKLTLLAHVIA

ENGC_PROMA/97-245       KKLQTWGYQPIPISIVNGEGIQKLSARLKSKLGVLCPVIY

ENGC_PSEAE/120-269      NVYRTLGYPLIEVSAFNGLAMDELRGALDGHVSVFVQVVA

 

2. Semi-global alignment of two sequences is an alignment that ignores the end spaces which are spaces which appear before the first or at the end of the last character of the sequence in the sequences. For example, for the following alignment, the spaces in the second sequence are all end sequences whereas the space in the first sequence is not an end sequence.

                       C A G C A – C T T G G A T T C T C G G

                        -  -   - C A G CG T G G  -  -  -  -  -   -  -  -

With the scoring scheme, match=+1, mismatch=-1 and space (insert or delete)=-2, this alignment gives a score of  -19. But, the following alignment

                      C A G C A  C T T G G A T T C T C G G

                      C A G  C  -  -  -  -   -  G  - T  -  -  -  -  G G 

gives a better global score of -12. But, the first alignment is a better one for the biologists.

 

Write a modified dynamic programming algorithm to obtain an optimal semi-global alignment. Note your algorithm must incorporate affine gap penalty for internal gaps but must ignore the end spaces.

 

 

3. Tandem Repeats. Let P be a pattern of length n and T a text of length m. Let Pm be the concatenation of P with itself m times, so Pm has length mn. We want to compute a local alignment between Pm and T. That will find an interval in T that has the best global alignment (according to standard alignment criteria) with some tandem repeat of P. The particular problem arises in studying the secondary structure of proteins that form what is called a coiled coil. In that context, P represents a motif or domain (a pattern for our purposes) that can repeat in the protein an unknown number of times, and T represents the protein. Local alignment between Pm and T picks out an interval of T that “optimally” consists of tandem repeats of the motif (with errors allowed). If Pm is explicitly created, then standard local alignment will solve the problem in O(nm2)  time. But because Pm consists of identical copies of P, an O(nm)-time solution is possible. The method essentially simulates what the dynamic programming algorithm for local alignment would do if it were executed with Pm and T explicitly. Below we outline the method.

 

The dynamic Programming algorithm will fill an m+1 by n+1 table V, whose rows are numbered 0 to n, and whose columns are numbered 0 to m. Row 0 and column 0 are initialized to all 0 entries. Then in each row i, from 1 to m, the algorithm does the following: It executes the standard local alignment recurrences in row i; it sets V(i,0) to V(i,n); and then it executes the standard local alignment recurrences in row i again. After completely filling each row, the algorithm selects the cell with largest V value, as in the standard solution to the local alignment problem.

 

Clearly, this algorithm takes O(nm) time. Prove that it correctly finds the value of the optimal local alignment between Pm and T. Then give the details of the traceback to construct the optimal local alignment. Discuss why P was (conceptually) expanded to Pm and not a longer or shorter string. (Gusfield, 11.32 pg. 247)

 

4. We say that a string C is a merge of strings A and B, if it is obtained by merging the characters of A and B in their original relative order together. For example, string “FLORIDA” is a merge of strings “FOR” and “LIDA”, string “JEANNE” is a merge of strings “JANE” and “EN”.  Present a dynamic programming algorithm that takes three strings A[1…n], B[1…m] and C[1…n+m] and tests whether C is a merge of A and B. What is the complexity?