All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
解题思路:
1、用map存储已经扫描过的子串,并对之计数。时间复杂度为O(n)。代码如下:
class Solution { public: vector<string> findRepeatedDnaSequences(string s) { map<string, int> count; vector<string> result; int len = s.length(); for(int i=0; i<len-10; i++){ string str = s.substr(i, 10); map<string, int>::iterator it=count.find(str); if(it!=count.end()){ count[str]=1; }else{ if(it->second==1){ result.push_back(str); } count[str]++; } } return result; } };但是会报内存溢出错误。
2、为AGCT分别编码,共有4种,故只需两位便能编码。共十个字符,只需20位就能表示任何一种组合。int类型32位,因此可以用一个int类型来存储10个字符串。每检查一个字符之后,需要将最高位置为0。代码如下:
class Solution { public: vector<string> findRepeatedDnaSequences(string s) { const int subStrLen = 10; const int mask = 0x3ffff; map<int, int> count; map<char, int> cCode; cCode['A']=0; cCode['C']=1; cCode['G']=2; cCode['T']=3; vector<string> result; int len = s.length(); int code=0; if(len>subStrLen){ string str = s.substr(0, subStrLen); for(int i=0; i<subStrLen; i++){ code <<= 2; code |= cCode[str[i]]; } count[code] = 1; } for(int i=subStrLen; i<len; i++){ code &= mask; //清空最高位 code <<= 2; code |= cCode[s[i]]; count[code]++; if(count[code]==2){ result.push_back(s.substr(i-subStrLen + 1, subStrLen)); } } return result; } };在leetCode中,我跑上面的代码需要269ms,看了很多其他人的时间,都比我快很多。不知各位有更好的方法没有。
[LeetCode] Repeated DNA Sequences
原文地址:http://blog.csdn.net/kangrydotnet/article/details/44979491