All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
解题思路:
1、用map存储已经扫描过的子串,并对之计数。时间复杂度为O(n)。代码如下:
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
map<string, int> count;
vector<string> result;
int len = s.length();
for(int i=0; i<len-10; i++){
string str = s.substr(i, 10);
map<string, int>::iterator it=count.find(str);
if(it!=count.end()){
count[str]=1;
}else{
if(it->second==1){
result.push_back(str);
}
count[str]++;
}
}
return result;
}
};但是会报内存溢出错误。
2、为AGCT分别编码,共有4种,故只需两位便能编码。共十个字符,只需20位就能表示任何一种组合。int类型32位,因此可以用一个int类型来存储10个字符串。每检查一个字符之后,需要将最高位置为0。代码如下:
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
const int subStrLen = 10;
const int mask = 0x3ffff;
map<int, int> count;
map<char, int> cCode;
cCode['A']=0;
cCode['C']=1;
cCode['G']=2;
cCode['T']=3;
vector<string> result;
int len = s.length();
int code=0;
if(len>subStrLen){
string str = s.substr(0, subStrLen);
for(int i=0; i<subStrLen; i++){
code <<= 2;
code |= cCode[str[i]];
}
count[code] = 1;
}
for(int i=subStrLen; i<len; i++){
code &= mask; //清空最高位
code <<= 2;
code |= cCode[s[i]];
count[code]++;
if(count[code]==2){
result.push_back(s.substr(i-subStrLen + 1, subStrLen));
}
}
return result;
}
};在leetCode中,我跑上面的代码需要269ms,看了很多其他人的时间,都比我快很多。不知各位有更好的方法没有。
[LeetCode] Repeated DNA Sequences
原文地址:http://blog.csdn.net/kangrydotnet/article/details/44979491