Bioinformatics Algorithms and Data Structures.ppt
《Bioinformatics Algorithms and Data Structures.ppt》由会员分享,可在线阅读,更多相关《Bioinformatics Algorithms and Data Structures.ppt(43页珍藏版)》请在麦多课文档分享上搜索。
1、Bioinformatics Algorithms and Data Structures,Chapter 1: Exact String Matching Instructor: Dr. Rose January 14, 2003,Exact String Matching,Types of Abstract Problems: Pattern matching, i.e., find pattern P in string S Similarity comparison, i.e., what is the longest common substring in S1 and S2? Ca
2、n we find P P in S? We can think of P as a mutation of P. What are the regions of similarity in S1 and S2? We can also do this with mutations,Exact String Matching,Q: What is the underlying theme common to these abstract problems? A: Correlation, i.e., correlation between two signals, strings, etc.,
3、Exact String Matching,Q: What is the simplest way to compare two strings? A: Look for a mapping of one string into the other.,Exact String Matching,Given two strings S1 and S2, Where length(S1) = length(S2) : Start at the beginning of S1 and S2 Compare corresponding characters, i.e., S11 & S21, S12
4、& S22, etc Continue until either: All the characters in S1 have been matched or A mismatch is found,Exact String Matching,If there is a mismatch shift 1 character position along S2 and start over, e.g., compare S11& S22, S12& S23, etc Keep doing this until a match is found or the possible starting p
5、ositions in S2 are exhausted.,Exact String Matching,Example: S1 = adab, S2=abaracadabaraa b a r a c a d a b a r a 1: a d d != a 2: _ a a != b 3: _ a d d != r 4: _a a != r 5: _a d d != c 6: _ a a != c 7: _ a d a b Finally! Q: How many comparisons? A: 13, looks ok?,Exact String Matching,Example 2: S1
6、= aaaaab, length(S1) = 6 S2 = aaaaaaaaaaab, length(S2) = 12a a a a a a a a a a a b 1: a a a a a b b != a 2: _ a a a a a b b != a 3: _a a a a a b b != a 4: _ a a a a a b b != a 6: _a a a a a b b != a 7: _ a a a a a b b != a,Exact String Matching,Example 2 continued from previous slidea a a a a a a a
7、a a a b 8: _ a a a a a b Finally!Q: How many comparisons were made? A: 42 = 7 X 6 = (12 6 + 1) X 6= (N M + 1) X M Where length(S2) = N and length(S1) = M Q: Where did this come from? A: There are N M + 1 possible match positions in S2,Exact String Matching,Bottom line, the time complexity is Q(NM) O
8、bservation: Notice that many of the characters in S2 are involved in multiple comparisons. WHY? A: Because the nave approach doesnt learn from previous matches. By the time the first mismatch occurs, we know what the first 6 characters of S1 and S2 are.,Exact String Matching,Note: A smarter approach
9、 would not involve the first 6 characters of S2 in subsequent comparisons. Fast matching algorithms take advantage of this insight. Q: Where does this insight come from? A: Preprocessing either S1 or S2.,Exact String Matching,Insight: if a match fails1) dont make redundant comparisons2) skip to the
10、first next possible match position. Note: the next possible match position may not be 1 character shift away. Lets consider both of these ideas with respect to examples 1 and 2,Exact String Matching,Lets review example 2: a a a a a a a a a a a b S2 1: a a a a a b S1 b != a, we have seen the first 6
11、characters 2: _a a a a a b b != a, we already know the as match,we only need to try to match the b 3: _a a a a a b b != a, ditto 4: _ a a a a a b b != a, ditto 6: _a a a a a b b != a, ditto 7: _ a a a a a b b != a, ditto 8: _a a a a a b Finally! The number of comparisons is 12 instead of the previou
12、s 42,Exact String Matching,Lets review example 1: S1 = adab, S2=abaracadabaraa b a r a c a d a b a r a 1:a d d != b, we have seen the first 2 charactersThe next possible match must be at least two positions away 2: _ a d d != r, we have seen the first 4 chars of S2The next possible match must be at
13、least two positions away 3: _ a d d != c, we have seen the first 6 chars of S2The next possible match must be at least two positions away 4: _ a d a b Finally! Q: How many comparisons? A: 10. The previous approach took 13 comparisons,Preprocessing a String,Core IdeaFor each position i1 in string S,
14、identify the maximal substring that matches a prefix of S. Q: Why do we want to do this? A: We will use this information in two ways:1) This tells us how far to skip for the next possible match. (Recall example 1)2) Knowledge of prefix matches allows us to avoid redundant comparisons (Recall example
15、 2) Do we need to go back and review examples 1 and 2?,Preprocessing a String,Let M(Si) denote the maximal substring that matches a prefix of S at position i1 Example: S = aabcaabxaaz (from textbook) M(S2) = a M(S3) = M(S4) = M(S5) = aab,Preprocessing a String,Let Z(Si) denote the length of the maxi
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- BIOINFORMATICSALGORITHMSANDDATASTRUCTURESPPT
