(2)找到Move Operator对应的源码task类, org.apache.hadoop.hive.ql.exec.MoveTask.javaStage: Stage-1Map ReduceAlias -> Map Operator Tree:dualTableScanalias: dualSelect Operatorexpressions:expr: ‘1‘type: stringexpr: ‘2‘type: stringoutputColumnNames: _col0, _col1File Output Operatorcompressed: trueGlobalTableId: 1table:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: default.testStage: Stage-4Move Operatorfiles:hdfs directory: truedestination: hdfs://hadoop_namenode/tmp/hive-root/hive_2015-01-07_18-07-13_120_2026314954951095577/-ext-10000Stage: Stage-0Move Operatortables:partition:day 20140101replace: true --overwritetable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: default.test
由执行计划能看出,前面的mapreduce过程不会影响到表分区路径的新建或删除,而真正影响到数据的操作是Move Operator
Partition oldPart = getPartition(tbl, partSpec, false);Path oldPartPath = null;if(oldPart != null) {
oldPartPath = oldPart.getDataLocation(); //表分区定义的location,即我们例子中的 /{warehouse}/test/20140101/
}
if (inheritTableSpecs) {//默认值为true
Path partPath = new Path(tbl.getDataLocation(),Warehouse.makePartPath(partSpec));newPartPath = new Path(loadPath.toUri().getScheme(), loadPath.toUri().getAuthority(),partPath.toUri().getPath());//值为由表的location信息和分区值组成的路径,即我们例子中的/{warehouse}/test/day=20140101/
if(oldPart != null) {
/** If we are moving the partition across filesystem boundaries* inherit from the table properties. Otherwise (same filesystem) use the* original partition location.** See: HIVE-1707 and HIVE-2117 for background*//*fs.hdfs.impl.disable.cache 这个参数 就影响到以下两个操作,决定了oldPartPathFS与loadPathFS 是否指向同一个对象,进而影响到 newPartPath 的值到底取什么*/FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());//分区的locationFileSystem loadPathFS = loadPath.getFileSystem(getConf());//来源数据if (oldPartPathFS.equals(loadPathFS)) {
newPartPath = oldPartPath;
}
}
}else {
newPartPath = oldPartPath;
(4)目标路径的取值}newPartPath 这个变量就是决定数据move操作时的目的路径,所以只要确定newPartPath 的值,我们就知道数据是怎么移动的
if(conf.getBoolean(disableCacheName, false))public FileSystem getFileSystem(Configuration conf)throws IOException{return FileSystem.get(toUri(), conf);}继续跟踪代码FileSystem.get(toUri(), conf),跟到类org.apache.hadoop.fs.FileSystem.java,跟踪方法public static FileSystem get(URI uri, Configuration conf){...},看主要代码段:String disableCacheName = String.format("fs.%s.impl.disable.cache", new Object[] { scheme });
(5) 移动数据,回到类org.apache.hadoop.hive.ql.metadata.Hive.java根据这段的分析,再执行(3)中的代码时,如下if (oldPartPathFS.equals(loadPathFS)) {newPartPath = oldPartPath;}//如果设置了fs.hdfs.impl.disable.cache=false,则oldPartPathFS.equals(loadPathFS)返回true,newPartPath 取值为oldPartPath,值为上例中的 /{warehouse}/test/20140101/;否则newPartPath 的值保持不变,为/{warehouse}/test/day=20140101/
由于我们在操作中设置了fs.hdfs.impl.disable.cache=true,所以导致newPartPath 值为/{warehouse}/test/day=20140101/
/* 由于我们使用的操作是insert overwrite ,所以 replace为true,最终数据就是移动到newPartPath*/
if (replace) { // 判断是否替换掉原来的数据
Hive.replaceFiles(loadPath, newPartPath, oldPartPath, getConf());
} else {
FileSystem fs = tbl.getDataLocation().getFileSystem(conf);Hive.copyFiles(conf, loadPath, newPartPath, fs);
}跟踪到方法 void replaceFiles(Path srcf, Path destf, Path oldPath, HiveConf conf){...},看下对数据的操作这个方法主要有两个操作1.删除原来的数据 , oldPath ,即我们例子中的/{warehouse}/test/day=20140101/
// use FsShell to move data to .Trash first rather than delete permanentlyif (fs2.exists(oldPath)) {
FsShell fshell = new FsShell();
fshell.setConf(conf);
fshell.run(new String[]{"-rmr", oldPath.toString()});
}
2.rename源数据到目标路径,完成数据移动,srcf->destf,上例中此时的destf为/{warehouse}/test/day=20140101/
boolean b = renameFile(conf, srcs[0].getPath(), destf, fs, true);
根据上面的分析,我们可以看出,由于设置了fs.hdfs.impl.disable.cache=true,,无法再缓存中取FileSystem对象,所以导致newPartPath的值无法取到oldPartPath的值,最终为/{warehouse}/test/day=20140101/,所以最终会在hdfs上面新建一个目录,然后删除了oldPartPath原来的数据,导致/{warehouse}/test/20140101/目录及下面的文件都被删除掉,所以出现了上面的情况!
原文地址:http://blog.csdn.net/jyl1798/article/details/42521789