标签:
PH525.4x第二周内容围绕GRange类的操作和使用Annotation进行数据关联两个主题展开,并展示了几个比较“炫”的功能。由于内容繁多,信息量大,故笔记之以便日后参考。
该课程的演示数据为ChIP-seq的实验数据,背景为人类肝细胞(cell line:HepG2和GM12878)中被ESRRA (estrogen related receptor alpha)绑定的基因片段。
在展示数据操作之前,首先检查bioconductor的版本号,不同版本的输出可能存在差异。
library(BiocInstaller) biocVersion() #[1] ‘3.0’
载入演示数据:
library(devtools) install_github("genomicsclass/ERBS") library(ERBS) data(GM12878) data(HepG2) class(HepG2) #[1] "GRanges" #attr(,"package") #[1] "GenomicRanges" HepG2 # GRanges object with 303 ranges and 7 metadata columns: # seqnames ranges strand | name score col signalValue pValue # <Rle> <IRanges> <Rle> | <numeric> <integer> <logical> <numeric> <numeric> # [1] chr2 [ 20335378, 20335787] * | <NA> 0 <NA> 58.251 75.899 # [2] chr20 [ 328285, 329145] * | <NA> 0 <NA> 10.808 69.685 # [3] chr6 [168135432, 168136587] * | <NA> 0 <NA> 17.103 54.311 # [4] chr19 [ 1244419, 1245304] * | <NA> 0 <NA> 12.427 43.855 # [5] chr11 [ 64071828, 64073069] * | <NA> 0 <NA> 10.85 40.977 # ... ... ... ... ... ... ... ... ... ... # [299] chr4 [ 1797182, 1797852] * | <NA> 0 <NA> 9.681 10.057 # [300] chr1 [110198573, 110199126] * | <NA> 0 <NA> 7.929 10.047 # [301] chr17 [ 17734052, 17734469] * | <NA> 0 <NA> 5.864 9.99 # [302] chr1 [ 48306453, 48306908] * | <NA> 0 <NA> 5.66 9.948 # [303] chr12 [123867207, 123867554] * | <NA> 0 <NA> 13.211 9.918 # qValue peak # <numeric> <integer> # [1] 6.143712e-72 195 # [2] 5.028065e-66 321 # [3] 7.930665e-51 930 # [4] 1.359756e-40 604 # [5] 7.333863e-38 492 # ... ... ... # [299] 1.423343e-08 402 # [300] 1.442076e-08 197 # [301] 1.638918e-08 227 # [302] 1.799414e-08 211 # [303] 1.921805e-08 163 # ------- # seqinfo: 93 sequences (1 circular) from hg19 genome
我们可以看见该HepG2数据有3个column为位置信息,7个为特征信息(metadata column)。我们可以直接使用granges提取所有位置信息或seqnames,ranges,strand提取相关信息,并支持IRanges操作,譬如start,end,width等;使用values或mcols函数可以提取metadata;提取row的方法同dataframe类操作或使用chr=="xxx"的形式索引。值得注意的是seqnames的输出并不是character类,有character需求的话需要转换。同时seqnames按原始order计算染色体位置出现频率,即同一个染色体位置如果分开于不相邻的row按不同染色体位置计数,需要直接计算每一个染色体位置出现的频率可以对原始数据order以后再提取或使用table函数。
seqnames(HepG2) # factor-Rle of length 303 with 292 runs # Lengths: 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 # Values : chr2 chr20 chr6 chr19 chr11 chr20 chr19 chr2 ... chr17 chr3 chr4 chr1 chr17 chr1 chr12 # Levels(93): chr1 chr2 chr3 chr4 chr5 ... chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249 seqnames(HepG2[order(HepG2)]) # factor-Rle of length 303 with 23 runs # Lengths: 18 38 15 4 8 24 14 11 ... 21 6 16 27 3 5 4 # Values : chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 ... chr17 chr18 chr19 chr20 chr21 chr22 chrX # Levels(93): chr1 chr2 chr3 chr4 chr5 ... chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
与GRanges类相关的是IRanges类的操作,以下代码一图读懂基本操作~
ir <- IRanges(5,10) # set up a plotting window so we can look at range operations plot(0,0,xlim=c(0,23),ylim=c(0,13),type="n",xlab="",ylab="",xaxt="n") axis(1,0:15) abline(v=0:14 + .5,col=rgb(0,0,0,.5)) # plot the original IRange plotir <- function(ir,i) { arrows(start(ir)-.5,i,end(ir)+.5,i,code=3,angle=90,lwd=3) } plotir(ir,1) # draw a red shadow for the original IRange polygon(c(start(ir)-.5,start(ir)-.5,end(ir)+.5,end(ir)+.5),c(-1,15,15,-1),col=rgb(1,0,0,.2),border=NA) # draw the different ranges plotir(shift(ir,-2), 2) plotir(narrow(ir, start=2), 3) plotir(narrow(ir, end=5), 4) plotir(flank(ir, width=3, start=TRUE, both=FALSE), 5) plotir(flank(ir, width=3, start=FALSE, both=FALSE), 6) plotir(flank(ir, width=3, start=TRUE, both=TRUE), 7) plotir(ir * 2, 8) plotir(ir * -2, 9) plotir(ir + 2, 10) plotir(ir - 2, 11) plotir(resize(ir, 1), 12) text(rep(15,12), 1:12, c("ir","shift(ir,-2)","narrow(ir,start=2)", "narrow(ir,end=5)", "flank(ir, start=T, both=F)", "flank(ir, start=F, both=T)", "flank(ir, start=T, both=T)", "ir * 2","ir * -2","ir + 2","ir - 2", "resize(ir, 1)"), pos=4)
GRanges类是IRanges类的拓展,除了IRanges类的基本特征之外,还需要提供染色体位置seqnames和正链还是反链strand两个关键信息(strand默认为“+”)。c函数可以合并两个GRanges类,gaps可以寻找未被原始序列覆盖的序列(补集),disjoint可以将原始序列分割成互不相交的序列,reduce输出整个序列覆盖的范围和gaps互补。%over%命令,如"HepG2 %over% GM12878",可以生成长度为HepG2长度的逻辑向量判断HepG2的各序列是否与GM12878存在交集。值得一提的是findOverlaps函数,输出Hits类对象:
fo <- findOverlaps(HepG2, GM12878) fo # Hits of length 75 # queryLength: 303 # subjectLength: 1873 # queryHits subjectHits # <integer> <integer> # 1 1 12 # 2 2 78 # 3 4 777 # 4 5 8 # 5 8 13 # ... ... ... # 71 285 621 # 72 287 174 # 73 291 1855 # 74 294 512 # 75 300 144 queryHits(fo) subjectHits(fo)
queryHits输出在HepG2中与GM12878重叠的序列,subjectHits反之。当然也可以用%over%。
HepG2 %over% GM12878 HepG2[HepG2 %over% GM12878]
除了findOverlaps,还可以使用distanceToNearest函数找到最近序列,输出类似findOverlaps,distance可以输出距离值。
distanceToNearest(HepG2[17],GM12878) # Hits of length 1 # queryLength: 1 # subjectLength: 1873 # queryHits subjectHits distance # <integer> <integer> <integer> # 1 1 945 2284 mcols(d)$distance
当然这里所有的操作都需要注意strand方向。
接着就看看我们的ERBS究竟绑了些什么基因。载入基因注释(annotation)数据Homo.sapiens并查看基因信息。
library(Homo.sapiens) ghs = genes(Homo.sapiens) genome(ghs)
我们捕捉在两种cell line,HepG2和GM12878,ERBS绑定一致的区域。
res = findOverlaps(HepG2,GM12878) erbs = HepG2[queryHits(res)] erbs = granges(erbs)
#Another way
erbs = intersect(HepG2, GM12878)
precede函数可以找出转录起始端在绑定位置上游的基因。
index <- precede(erbs, subject = ghs) ghs[index,]
更好的做法是首先对ghs使用resize函数收集转录起始端位置,再使用distanceToNearest函数,可以找出绑定位置即为转录起始端的基因和对距离进行筛选(precede命令不能做到)。
tssge <- resize(ghs, 1) d <- distanceToNearest(erbs, tssge) queryHits(d) dist <- values(d)$distance index <- subjectHits(d)[dist < 1000] tssge[index,] # GRanges object with 41 ranges and 1 metadata column: # seqnames ranges strand | GENEID # <Rle> <IRanges> <Rle> | <IntegerList> # 80023 chr20 [ 327370, 327370] + | 80023 # 2101 chr11 [ 64073044, 64073044] + | 2101 # 7086 chr3 [ 53290130, 53290130] - | 7086 # 5478 chr7 [ 44836241, 44836241] + | 5478 # 286262 chr9 [140095163, 140095163] - | 286262 # ... ... ... ... ... ... # 55090 chr17 [ 17380300, 17380300] + | 55090 # 11165 chr6 [ 34360457, 34360457] - | 11165 # 56181 chr1 [ 26146397, 26146397] + | 56181 # 347734 chr6 [ 44225283, 44225283] - | 347734 # 2948 chr1 [110198698, 110198698] + | 2948 # ------- # seqinfo: 93 sequences (1 circular) from hg19 genome
使用select函数对Homo.sapiens进行注释查询,select参数包括查询数据库,keys和对应的key type(关联注释的关键)及注释column。
首先我们看Homo.sapiens支持的key type。
keytypes(Homo.sapiens) # [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION" "ENTREZID" "PFAM" # [7] "IPI" "PROSITE" "ACCNUM" "ALIAS" "CHR" "CHRLOC" # [13] "CHRLOCEND" "ENZYME" "MAP" "PATH" "PMID" "REFSEQ" # [19] "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME" # [25] "UNIPROT" "GO" "EVIDENCE" "GOALL" "EVIDENCEALL" "ONTOLOGYALL" # [31] "OMIM" "UCSCKG" "GENEID" "TXID" "TXNAME" "EXONID" # [37] "EXONNAME" "CDSID" "CDSNAME"
我们使用GENEID进行关联,关联注释选择“SYMBOL”和“GENENAME”。注意keys一般接受character输入,需要转换。
res <- select(Homo.sapiens, keys = as.character(values(tssge[index, ])$GENEID), keytype = "GENEID", columns = c("SYMBOL", "GENENAME")) head(res) # GENEID SYMBOL GENENAME # 1 80023 NRSN2 neurensin 2 # 2 2101 ESRRA estrogen-related receptor alpha # 3 7086 TKT transketolase # 4 5478 PPIA peptidylprolyl isomerase A (cyclophilin A) # 5 286262 TPRN taperin # 6 79696 ZC2HC1C zinc finger, C2HC-type containing 1C
接着查找绑定的序列(motif),需要载入人类基因组数据库。注意在这里载入Hsapines数据和载入BSgenome.Hsapiens.UCSC.hg19是一样的(不是载入包,而是数据!)。提取单条染色体基因可以使用$。
library(BSgenome.Hsapiens.UCSC.hg19) Hsapines # Human genome # | # | organism: Homo sapiens (Human) # | provider: UCSC # | provider version: hg19 # | release date: Feb. 2009 # | release name: Genome Reference Consortium GRCh37 # | 93 sequences: # | chr1 chr2 chr3 chr4 # | chr5 chr6 chr7 chr8 # | chr9 chr10 chr11 chr12 # | chr13 chr14 chr15 chr16 # | chr17 chr18 chr19 chr20 # | ... ... ... ... # | chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 # | chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 # | chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 # | chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 # | chrUn_gl000249 # | (use ‘seqnames()‘ to see all the sequence names, use the ‘$‘ or ‘[[‘ operator to access a given # | sequence)
getSeq命令可以输出目标位置的基因序列以便对其进一步分析。
hepseq <- getSeq(Hsapiens, erbs) # A DNAStringSet instance of length 75 # width seq # [1] 410 GAGACAGGGTTTCACCATGTTGGCCAGGCTGGTTTCGAACTCCTGA...GAAGGAGATGCAACCTTCCAGGAAGCAGAAATGTTCAAGGACTCTC # [2] 861 TGGGAAGGACACAACTGAATGAGGCTGTGCAGAGGCGACAGATTCC...CCGTGTGACTCCAAGCAGAACCTCCAACCGTGTGTGTGTGTGTGTG # [3] 886 GTGAAGGCCCTGGAGTAGGCGGTGCGTACCCGGTGTCCCGAGGCCC...GTGCCGAATCTGCAGTGTTTTTGGCACCTCCGTGGGCACCTAGGCT # [4] 1242 CATCCTCCACCTTAACACTCAGCACCCTTAGAGATGCAAAGTTCCC...GGCCGCGATGTCCTTTTGTGTCCTACAAGCAGCCGGCGGCGCCGCC # [5] 477 AGGAATGAGCACCCACGGAGAAATGCAGTCCCGACCCCTTCCACCC...GTGATTGGACAAGGATGGTCAAAATGTATCCTTCTGGAATCTGAGC # ... ... ... # [71] 429 TGAGCCCCAACAAAACAAAAAAGCCAATCCTGCTGGTCTAAACTGG...CCATTCAACAAAGCACAGTATGCACCATGCAGTATTTTCTTGTGAT # [72] 551 GTGGCGGAAGGAAGCGGCGCGCTTGGCCGCCAGGAGCCGCTTGCGC...CAGACAAGGTCCAGGGACCTGGCGGCTGAGGGGACCCTCAAGGCTC # [73] 805 TACCAGGCCGAGGGATCTCCAAACGCAGCCAACTCCTCCCGGGGAC...GGAGAGGGTCCGCGCGGGTCTCCGGCCACTGCTGCGCGCGGCAAAG # [74] 1037 ATCCCCTCCGCCCTCCAGAGGGGACCACACTGGAGTAGTGCCCGGA...CGAGGCTCTACATGGCCTCGCCCGCGTTCCTCGCCCACTCACTCGC # [75] 554 TCCGGAGAAGAAGAAACGGGGGAAGAACTTTTCTCTTACGATCTGG...CGCGTGGGCGGGAAGTGCCGAGCGGCTGGGGACCGGCTCTAGGGAC
使用countPattern,vcountPattern,alphabetFrequency等函数可对某种pattern在序列中的出现进行统计分析。
vcountPattern("GC", hepseq) # [1] 34 75 94 139 28 52 47 79 9 109 187 58 32 21 113 18 23 83 110 58 142 33 106 40 89 # [26] 24 22 116 64 25 91 128 58 19 38 101 42 89 85 68 23 184 88 58 38 138 56 40 23 83 # [51] 101 90 43 20 47 104 62 71 133 62 144 67 42 119 82 127 78 27 47 99 22 64 98 121 51
待续...
HarvardX: PH525.4x Introduction to Bioconductor第二周笔记
标签:
原文地址:http://www.cnblogs.com/nutastray/p/4416020.html