标签:
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
非常好的思路: 转换成位操作。
首先考虑将ACGT进行二进制编码
A -> 00
C -> 01
G -> 10
T -> 11
在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如
ACGTACGTAC -> 00011011000110110001
AAAAAAAAAA -> 00000000000000000000
20位的二进制数,至多有2^20种组合,因此hash table的大小为2^20,即1024 * 1024,将hash table设计为bool hashTable[1024 * 1024];
vector<string> findRepeatedDnaSequences(string s) { int hashMap[1048576] = {0}; vector<string> ans; int len = s.size(),hashNum = 0; if (len < 11) return ans; for (int i = 0;i < 9;++i) hashNum = hashNum << 2 | (s[i] - ‘A‘ + 1) % 5; for (int i = 9;i < len;++i) if (hashMap[hashNum = (hashNum << 2 | (s[i] - ‘A‘ + 1) % 5) & 0xfffff]++ == 1) ans.push_back(s.substr(i-9,10)); return ans; }
LeetCode() Repeated DNA Sequences 看的非常的过瘾!
标签:
原文地址:http://www.cnblogs.com/yanqi110/p/4996001.html