Hadoop之倒排索引

时间：2016-05-13 03:03:24 阅读：186 评论：0 收藏：0 [点我收藏+]

标签：

倒排索引：
以前的是先找到文件位置—>找到文件—->找到单词
现在：
根据单词，返回它在哪个文件中出现过，而且频率是多少的结果。
这就像百度里的搜索，你输入一个关键字，那么百度引擎就迅速的
在它的服务器里找到有该关键字的文件，并根据频率和其他一些策略
（如页面点击投票率）等来给你返回结果。这个过程中，倒排索引就起到很关键的作用

将多个文本的单词，分解，统计数量，确定位置，整合。
分为三个过程：Map ,Combiner,Reduce过程。
一，map过程：
1，通过FileSplit取到单词的存储的路径。
2，通过StringTokenizer将每个文件里的单词解析成标记，默认是通过空格解析的。
3，循环这个StringTokenizer，将单词+路径设置成key，词频即value默认设置为1。
然后通过content写出key value，注意这里即使单词有相同的在不同的文件里的，还是
不会合并成一个迭代，因为key里的路径不同。

二，Combiner过程：
在这个过程之前，传过来的key相同（即单词+路径相同，相同文件的单词），它的value(词频)，会存到一个
value迭代里面（Iterable），使用同一个key。
1，通过循环迭代values来统计每个文件里单词出现的词频。
2，通过对输入key的截取，将截取到的单词，设置为key,
将路径和词频设置为value。
在下一个Reduce过程中，相同的key（单词），对应的value（路径+词频），
会形成一个迭代，有利于下个的合并。

三，Reduce过程：
这个过程实际上就是一个合并的过程，通过迭代key对应的vlaues，
然后在循环里将路径+词频，合并到一个value，再写出。
写出的key,还是原来的key。

代码分析：
1，输入四个文件，里面写有语句：
技术分享

2，输出结果：
技术分享

代码展示：

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
    public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text>{
        private Text keyInfo = new Text();  // 存储单词和URI的组合
        private Text valueInfo = new Text(); //存储词频
        private FileSplit split;  // 存储split对象。
        @Override
        protected void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            //获得<key,value>对所属的FileSplit对象。
            split = (FileSplit) context.getInputSplit();
            System.out.println("******split======"+split);
            //System.out.println("-------value===="+value.toString());
            StringTokenizer itr = new StringTokenizer( value.toString());
            while( itr.hasMoreTokens() ){
                // key值由单词和URI组成。
                keyInfo.set( itr.nextToken()+":"+split.getPath().toString());
            //  System.out.println("***********map---keyInfo===="+keyInfo);
                //词频初始为1
                valueInfo.set("1");
                //System.out.println("+++++++valueInfo======"+valueInfo);
                context.write(keyInfo, valueInfo);
            }
        }
    }

    public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text>{

        private Text info = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
    //System.out.println("***Combiner***values===="+values.toString());
            //统计词频
            int sum = 0;
            for (Text value : values) {
                sum += Integer.parseInt(value.toString() );
            }
        //System.out.println("--Combiner----sum====="+sum);
            int splitIndex = key.toString().indexOf(":");

            //重新设置value值由URI和词频组成
            info.set( key.toString().substring( splitIndex + 1) +":"+sum );

            //重新设置key值为单词
            key.set( key.toString().substring(0,splitIndex));
    //      System.out.println("*****************************");
            System.out.println("Combiner-----key===="+key.toString());
            System.out.println("-----------------------");
            System.out.println("Combiner------info==="+info.toString());
            context.write(key, info);
        }
    }


    public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text>{

        private Text result = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {

            //生成文档列表
            String fileList = new String();
            for (Text value : values) {
                fileList += value.toString()+";";
            }
            result.set(fileList);
            context.write(key, result);
        }

    }

    public static void main(String[] args) {
        try {
            Configuration conf = new Configuration();

            Job job = Job.getInstance(conf,"InvertedIndex");
            job.setJarByClass(InvertedIndex.class);

            //实现map函数，根据输入的<key,value>对生成中间结果。
            job.setMapperClass(InvertedIndexMapper.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            job.setCombinerClass(InvertedIndexCombiner.class);
            job.setReducerClass(InvertedIndexReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);


            FileInputFormat.addInputPath(job, new Path("hdfs://192.168.61.128:9000/daopai2/"));
            FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.61.128:9000/outdaopai1/"+System.currentTimeMillis()+"/"));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        } catch (IllegalStateException e) {
            e.printStackTrace();
        } catch (IllegalArgumentException e) {
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

    }
}

Hadoop之倒排索引

标签：

原文地址：http://blog.csdn.net/young_so_nice/article/details/51340708

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行