码迷,mamicode.com
首页 > 编程语言 > 详细

UVA 题目11512 - GATTACA(后缀数组求出现次数最多的子串及重复次数)

时间:2015-08-26 20:14:45      阅读:251      评论:0      收藏:0      [点我收藏+]

标签:

The Institute of Bioinformatics and Medicine (IBM) of your country has been studying the DNA
sequences of several organisms, including the human one. Before analyzing the DNA of an organism,
the investigators must extract the DNA from the cells of the organism and decode it with a process
called “sequencing”.
A technique used to decode a DNA sequence is the “shotgun sequencing”. This technique is a
method applied to decode long DNA strands by cutting randomly many copies of the same strand to
generate smaller fragments, which are sequenced reading the DNA bases (A, C, G and T) with a special
machine, and re-assembled together using a special algorithm to build the entire sequence.
Normally, a DNA strand has many segments that repeat two or more times over the sequence (these
segments are called “repetitions”). The repetitions are not completely identified by the shotgun method
because the re-assembling process is not able to differentiate two identical fragments that are substrings
of two distinct repetitions.
The scientists of the institute decoded successfully the DNA sequences of numerous bacterias from
the same family, with other method of sequencing (much more expensive than the shotgun process)
that avoids the problem of repetitions. The biologists wonder if it was a waste of money the application
of the other method because they believe there is not any large repeated fragment in the DNA of the
bacterias of the family studied.
The biologists contacted you to write a program that, given a DNA strand, finds the largest substring
that is repeated two or more times in the sequence.
Input
The first line of the input contains an integer T specifying the number of test cases (1 ≤ T ≤ 100). Each
test case consists of a single line of text that represents a DNA sequence S of length n (1 ≤ n ≤ 1000).
You can suppose that each sequence S only contains the letters ‘A’, ‘C’, ‘G’ and ‘T’.
Output
For each sequence in the input, print a single line specifying the largest substring of S that appears two
or more times repeated in S, followed by a space, and the number of ocurrences of the substring in S.
If there are two or more substrings of maximal length that are repeated, you must choose the least
according to the lexicographic order.
If there is no repetition in S, print ‘No repetitions found!’.
Sample Input
6
GATTACA
GAGAGAG
GATTACAGATTACA
TGAC
TGTAC
TTGGAACC
Sample Output
A 3
GAGAG 2
GATTACA 2
No repetitions found!
T 2

A 2

ac代码

技术分享

#include<stdio.h>           
#include<string.h>           
#include<algorithm>           
#include<iostream>          
#define min(a,b) (a>b?b:a)       
#define max(a,b) (a>b?a:b)    
#define N 1000005      
using namespace std;          
char str[1010];        
int sa[1010],Rank[1010],rank2[1010],height[1010],c[1010],*x,*y,s[1010],k; 
void cmp(int n,int sz)      
{      
    int i;      
    memset(c,0,sizeof(c));      
    for(i=0;i<n;i++)      
        c[x[y[i]]]++;      
    for(i=1;i<sz;i++)      
        c[i]+=c[i-1];      
    for(i=n-1;i>=0;i--)      
        sa[--c[x[y[i]]]]=y[i];      
}      
void build_sa(int *s,int n,int sz)      
{      
    x=Rank,y=rank2;      
    int i,j;      
    for(i=0;i<n;i++)      
        x[i]=s[i],y[i]=i;      
    cmp(n,sz);      
    int len;      
    for(len=1;len<n;len<<=1)      
    {      
        int yid=0;      
        for(i=n-len;i<n;i++)      
        {      
            y[yid++]=i;      
        }      
        for(i=0;i<n;i++)      
            if(sa[i]>=len)      
                y[yid++]=sa[i]-len;      
            cmp(n,sz);      
        swap(x,y);      
        x[sa[0]]=yid=0;      
        for(i=1;i<n;i++)      
        {      
            if(y[sa[i-1]]==y[sa[i]]&&sa[i-1]+len<n&&sa[i]+len<n&&y[sa[i-1]+len]==y[sa[i]+len])      
                x[sa[i]]=yid;      
            else      
                x[sa[i]]=++yid;      
        }      
        sz=yid+1;      
        if(sz>=n)      
            break;      
    }      
    for(i=0;i<n;i++)      
        Rank[i]=x[i];      
}      
void getHeight(int *s,int n)      
{      
    int k=0;      
    for(int i=0;i<n;i++)      
    {      
        if(Rank[i]==0)      
            continue;      
        k=max(0,k-1);      
        int j=sa[Rank[i]-1];      
        while(s[i+k]==s[j+k])      
            k++;      
        height[Rank[i]]=k;      
    }      
} 
int main()
{
	//int k;
	int t;
	scanf("%d",&t);
	while(t--)
	{
		int i,j;
		scanf("%s",str);
		int n=strlen(str);
		for(i=0;i<n;i++)
		{
			s[i]=str[i]-'A'+1;
		}
		s[n]=0;
		build_sa(s,n+1,26);
		getHeight(s,n);
		int ans=0;
		for(i=1;i<=n;i++)//保证<span id="transmark"></span>最小字典序
		{
			if(height[i]>ans)
				ans=height[i];
		}
		if(ans==0)
		{
			printf("No repetitions found!\n");
			continue;
		}
		for(i=1;i<=n;i++)
		{
			if(height[i]>=ans)
				break;
		}
		int k=1;
		for(j=i;j<=n&&height[j]>=ans;j++)
			k++;
		for(j=0;j<ans;j++)
		{
			printf("%c",str[sa[i]+j]);
		}
		printf(" %d\n",k);
	}
}


版权声明:本文为博主原创文章,未经博主允许不得转载。

UVA 题目11512 - GATTACA(后缀数组求出现次数最多的子串及重复次数)

标签:

原文地址:http://blog.csdn.net/yu_ch_sh/article/details/48008119

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!