hadoop数据去重

时间：2014-06-16 13:22:07 阅读：309 评论：0 收藏：0 [点我收藏+]

"数据去重"主要是为了掌握和利用并行化思想来对数据进行有意义的筛选。统计大数据集上的数据种类个数、从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重。下面就进入这个实例的MapReduce程序设计。

1.1 实例描述

　　对数据文件中的数据进行去重。数据文件中的每行都是一个数据。

　　样例输入如下所示：

　　1.txt 内容

　　2.txt内容

　　样例输出如下：

　　将测试数据上传到hdfs上/input目录下，同时确保输出目录不存在。

1.2 设计思路

　　数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次。

　　Map输入key 为行号，value为行的内容（假设这是我们所需要取得值，当然如果不是，可是使用字符分割得到）

　　Map输出key为行的内容，value为NullWritable(无值)

　　Reduce输入Key为值，value为NullWritable

　　Reduce原样输出

　　中间可在map端使用combine

1.3 程序代码

package test;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Dedub {
	
	public static class Map extends Mapper<Object, Text, Text, NullWritable>{
		
		protected void map(Object key, Text value, Context context) 
				throws java.io.IOException ,InterruptedException {
			
			context.write(value, NullWritable.get());
		};
	}
	public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable>{
		
		protected void reduce(Text key, Iterable<NullWritable> values, Context context) 
				throws java.io.IOException ,InterruptedException {
			
			context.write(key, NullWritable.get());
		};
	}
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
	    if (otherArgs.length != 2) {
	    	System.err.println("Usage: Data Deduplication <in> <out>");
	    	System.exit(2);
        }
	    Job job = new Job(conf, "Data Duplication");
	    job.setJarByClass(Dedub.class);
	    
	    //设置Map、Combine和Reduce处理类
	    job.setMapperClass(Map.class);
	    job.setCombinerClass(Reduce.class);
	    job.setReducerClass(Reduce.class);
	    
	    //设置输出类型
	    job.setOutputKeyClass(Text.class);
	    job.setOutputValueClass(NullWritable.class);
	    
	    //设置输入和输出目录
	    FileInputFormat.addInputPath(job, new Path(args[0]));
	    FileOutputFormat.setOutputPath(job, new Path(args[1]));
	    
	    System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

4、设置程序输入参数，myeclipse设置。运行，得到结果：

14/06/15 22:01:32 INFO mapred.JobClient: map 100% reduce 100%
14/06/15 22:01:32 INFO mapred.JobClient: Job complete: job_local_0001
14/06/15 22:01:32 INFO mapred.JobClient: Counters: 19
14/06/15 22:01:32 INFO mapred.JobClient:   File Output Format Counters
14/06/15 22:01:32 INFO mapred.JobClient:     Bytes Written=9
14/06/15 22:01:32 INFO mapred.JobClient:   FileSystemCounters
14/06/15 22:01:32 INFO mapred.JobClient:     FILE_BYTES_READ=81479
14/06/15 22:01:32 INFO mapred.JobClient:     HDFS_BYTES_READ=43
14/06/15 22:01:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=279482
14/06/15 22:01:32 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=9
14/06/15 22:01:32 INFO mapred.JobClient:   File Input Format Counters
14/06/15 22:01:32 INFO mapred.JobClient:     Bytes Read=17
14/06/15 22:01:32 INFO mapred.JobClient:   Map-Reduce Framework
14/06/15 22:01:32 INFO mapred.JobClient:     Map output materialized bytes=31
14/06/15 22:01:32 INFO mapred.JobClient:     Map input records=9
14/06/15 22:01:32 INFO mapred.JobClient:     Reduce shuffle bytes=0
14/06/15 22:01:32 INFO mapred.JobClient:     Spilled Records=10
14/06/15 22:01:32 INFO mapred.JobClient:     Map output bytes=17
14/06/15 22:01:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=492109824
14/06/15 22:01:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=190
14/06/15 22:01:32 INFO mapred.JobClient:     Combine input records=9
14/06/15 22:01:32 INFO mapred.JobClient:     Reduce input records=5
14/06/15 22:01:32 INFO mapred.JobClient:     Reduce input groups=5
14/06/15 22:01:32 INFO mapred.JobClient:     Combine output records=5
14/06/15 22:01:32 INFO mapred.JobClient:     Reduce output records=5
14/06/15 22:01:32 INFO mapred.JobClient:     Map output records=9

hadoop数据去重,布布扣,bubuko.com

hadoop数据去重

标签：style class blog java ext color

原文地址：http://www.cnblogs.com/jsunday/p/3790023.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行