1, 什么是GATK?
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data.
The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance.
Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
2, 如何用GATK call SNP?
用来call snp的数据为经过处理过的bam文件。如何处理另见博文。用到的工具为HaplotypeCaller。假如我有四个bam文件,
LC17-1_L005.sorted.rmp.rg.recal.bam,
LC17-2_L008.sorted.rmp.rg.recal.bam,
RC17-1_L003.sorted.rmp.rg.recal.bam,
RC17-3_L004.sorted.rmp.rg.recal.bam,
都是经过处理,符合GATK要求的bam文件,这四个文件都属于样本C17,我现在要用对样本C17 call snp, 具体命令如下:
java -jar ./GenomeAnalysisTK.jar -nct 50 -T HaplotypeCaller -R RAP_cDAN.fasta \
-I LC17-1_L002.sorted.rmp.rg.recal.bam -I LC17-1_L005.sorted.rmp.rg.recal.bam \
-I LC17-2_L006.sorted.rmp.rg.recal.bam -I LC17-2_L008.sorted.rmp.rg.recal.bam \
-I LC17-3_L002.sorted.rmp.rg.recal.bam -I RC17-1_L003.sorted.rmp.rg.recal.bam \
-I RC17-2_L004.sorted.rmp.rg.recal.bam -I RC17-3_L004.sorted.rmp.rg.recal.bam \
-o gatk.vcf
以上几行命令要在同一行,所以看到每行最后有换行符,工具选用的是GATK中的HaplotypeCaller,
-R后跟参考序列,-I 后是bam文件,这几个bam文件都属于一个sample, -o后跟输出文件名字。
-nct 是指定线程数,目前并不能多线程,只能用一个cpu。
结果文件就为gatk.vcf。
如何用GATK call snp,布布扣,bubuko.com
原文地址:http://www.cnblogs.com/freemao/p/3763885.html