码迷,mamicode.com
首页 > 数据库 > 详细

GATK的BaseRecalibration程序在无标准SNP数据库情况下,是否还能进行BaseRecalibration?

时间:2016-07-24 17:54:09      阅读:477      评论:0      收藏:0      [点我收藏+]

标签:

GATK(Genome Analysis Toolkit)是美国Broad Institute研发的一套检测SNP(SNP calling)的流程。测序后的碱基质量由于测序过程中的物理化学反应和测序仪的瑕疵会导致碱基质量偏离真实情况,为了矫正碱基质量,BaseRecalibrator程序被开发出来。在碱基质量矫正过程中,已知的标准SNP数据库是非常重要的输入文件,例如人类的dbSNP数据库。但如果研究的基因组是一个比较新的物种,没有标准的SNP数据库的话,对碱基指令进行校正是否还可行?答案是仍有必要,这时应该利用现有数据模拟出一个标准SNP数据库来。下面转发GATK网站上的相关描述(原网址:https://software.broadinstitute.org/gatk/documentation/article?id=44)。

I‘m working on a genome that doesn‘t really have a good SNP database yet. I‘m wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn‘t be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here‘s how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

问题:我现在正在研究的基因组还没有一个好的SNP数据库,我想知道在没有已知的SNP的数据库的情况下还能否进行碱基质量矫正?

回答:碱基质量分数矫正器会将每个与参考基因组错配的碱基都视为机器错误。真的多态性位点是合法的错配,因此不应该被算作是碱基质量的错误导致的错配。我们用已知的多态性位点数据库来跳过大部分的多态性位点。不幸的是,如果没有这个信息,数据将会变得完全不可使用,因为碱基的质量分数会被推测为远低于它作为一个与参考基因组错配的SNP位点实际应有的质量分数。

然而,如果你愿意进行一个实验,碱基质量仍然能够被校正。你可以自己建立一个已知的SNP数据库。步骤如下:

1.首先对你原始的、没有进行校正的数据进行一遍SNP calling.

2. 然后选择那些你最有把握的SNP位点作为一个已知的SNP数据库,将其以VCF文件的形式传给碱基质量分数矫正器。

3.最后,进行一次使用校正数据的、真正的SNP calling。这些步骤可以重复几次直到结果收敛。

GATK的BaseRecalibration程序在无标准SNP数据库情况下,是否还能进行BaseRecalibration?

标签:

原文地址:http://www.cnblogs.com/jiaobingke/p/5701013.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!