还是没能忍住,想看一下用JAVA语言处理上一篇文章的任务能快多少,毕竟编译语言远快于脚本语言。废话不多说,直接上代码:
import java.io.FileReader; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileWriter; import java.io.IOException; public class Split{ public static void main(String[] args) throws IOException { long startTime = System.currentTimeMillis(); BufferedReader read_line = new BufferedReader(new FileReader("head_10000000.vcf"), 5000000); BufferedWriter write_line = new BufferedWriter(new FileWriter("result.tsv"), 5000000); String current_line = read_line.readLine(); while(current_line != null) { while(current_line.startsWith("#")) { current_line = read_line.readLine(); } String[] split1 = current_line.split("\t"); String info = split1[7]; String[] split2 = info.split(";AF="); String str1 = split2[1]; String[] split3 = str1.split(";"); write_line.write(current_line + " " + split3[0]); write_line.newLine(); current_line = read_line.readLine(); } write_line.flush(); write_line.close(); read_line.close(); long endTime = System.currentTimeMillis(); System.out.println("run time:"+(endTime-startTime)+"ms"); } }
程序运行结果:
run time:47473ms
检验结果:
$ wc -l result.tsv 10000000 result.tsv
$ sed -n ‘3435534p‘ result.tsv
2 29509274 rs114511873 C A 100 PASS AA=C;AN=2184;AVGPOST=0.9997;VT=SNP;THETA=0.0006;AC=14;SNPSOURCE=LOWCOV;LDAF=0.0065;ERATE=0.0003;RSQ=0.9798;AF=0.01;AFR_AF=0.03 0.01
$ sed -n ‘7546563p‘ result.tsv
3 84580386 rs191768644 T C 100 PASS RSQ=0.6088;AA=T;AN=2184;VT=SNP;AVGPOST=0.9991;SNPSOURCE=LOWCOV;AC=1;THETA=0.0007;ERATE=0.0002;LDAF=0.0008;AF=0.0005;AFR_AF=0.0020 0.0005
$ sed -n ‘987345p‘ result.tsv
1 74709013 rs185004386 A C 100 PASS AN=2184;LDAF=0.0018;THETA=0.0005;VT=SNP;AA=A;SNPSOURCE=LOWCOV;RSQ=0.7110;ERATE=0.0003;AVGPOST=0.9987;AC=3;AF=0.0014;ASN_AF=0.01 0.0014
我们检查了文件的总行数以及随机抽取了若干行,发现结果正确。相比较于前面的R语言计算效率,这个结果表示十分震惊! 相差太远!!!
Time(java代码编写 + 编译 + 运行) < Time(R脚本运行)