Hadoop之仿写搜索引擎

时间：2016-05-12 11:32:45 阅读：174 评论：0 收藏：0 [点我收藏+]

标签：

这篇文章，可能比较长，如果你觉得写得好可以把它看完，希望对你有所帮助。
写搜索引擎先整理下思路大致分为三步：
从网上爬取数据，对拿到的数据进行整理即分词，然后通过关键字匹配拿到数据。我会详细介绍这三步。
先讲讲我要实现的效果，从网上拿到标题包括对应的地址，对标题进行分词，通过输入的关键字匹配分词，返回对应的网址。

一，爬取数据：
开始的时候对网站进行的一个垂直爬取，拿到它的个标题，A标签，后面发现在处理数据的时候，速度太慢了，就简化了操作，只对单个页面进行爬取。
1，爬取用到的包：
技术分享

2，通过模拟浏览器，去访问网站，如果返回的code为200，说明访问成功，就可以将这个网页的数据进行下载。
3，拿到爬取后的数据，即将网页转化成一个String的形式，存下来了。然后通过正则表达式，选取我们所需要的标签，这里取的是a标签，这里会对A标签进行过滤和分组，只取到有连接的，在写正则时写多个分组，有利于后面拿到，标题，跟对应的地址。
4，拿到标题跟地址后，将地址中的“/”等进行替换，因为后面要将地址作为文件的名字，标题作为内容存到hdfs中。
爬取代码展示：
下载页面代码：

    import java.io.BufferedInputStream;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.Scanner;
    public class DownLoadTool {
        //下载页面内容
        public String downLoadUrl(String addr){
            StringBuffer sb=new StringBuffer();
            try {
                URL url=new URL(addr);
                HttpURLConnection con= (HttpURLConnection) url.openConnection();
                con.setConnectTimeout(5000);
                con.connect();
                if(con.getResponseCode()==200){
                    BufferedInputStream bis=new BufferedInputStream(con.getInputStream());
                    Scanner sc=new Scanner(bis,"GBK");
                    while(sc.hasNextLine()){
                        sb.append(sc.nextLine());
                    }
                }
            } catch (Exception e) {

                e.printStackTrace();
            } 
            return sb.toString();
        }
    }

正则表达式匹配代码：

import java.io.File;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
//标题下载类
public class TiltelDownLoad {
    //取出图片的html标记<a[^>]*? href=[""‘](?<url>[^""‘]*?)[""‘][^>]*?>(?<text>[\w\W]*?)</a>
    //<\\s*a\\s+([^>]*)\\s*></a> 
    //<\\s*a\\s+([^>]*)\\s*>([^>]*)</a>";
    //<\\s*a\\s+([^>]*)\\s*>([^>]*)</a>
    //<\\s*a\\s+href\\s*=\\s*\"?http://(.*?)(\"|>|\\s+)([^>]*)\\s*>([^>]*)</a>
    static  String a_url="<\\s*a\\s+href\\s*=\\s*\"?http://(.*?)(\"|>|\\s+)([^>]*)\\s*>(.+?)</a>";
    //取出图片中的src内容
    static String href_url="href\\s*=\\s*\"?(.*?)(\"|>|\\s+)";
    //取出图片中的alt内容
    static String alt_url="alt\\s*=\\s*\"?(.*?)(\"|>|\\s+)";

    //取出图片的标签
    public Set<String> getTilteLink(String html){
        Set<String> result=new HashSet<String>();
        //创建一个Pattern模式类，编译这个正则表达式
        Pattern p=Pattern.compile(a_url,Pattern.CASE_INSENSITIVE);
        //定义一个匹配器的类
        Matcher matcher=p.matcher(html);
        while(matcher.find()){
            result.add(matcher.group(4).trim()+"\t"+matcher.group(1).trim());
        }
        return result;
    }

    public Set<String> getTitleSrc(Set<String> tilteLinks){
        Set<String> result=new HashSet<String>();
        //创建一个Pattern模式类，编译这个正则表达式
        Pattern p=Pattern.compile(href_url,Pattern.CASE_INSENSITIVE);
        for(String tiltelLink:tilteLinks){
            Matcher matcher=p.matcher(tiltelLink);
            while(matcher.find()){
                result.add(matcher.group(0));
            }
        }
        return result;
    }
}

将数据上传到hdfs上：

public class Test {
public static void main(String[] args) throws URISyntaxException, IOException {
    Configuration conf=new Configuration();
    URI uri=new URI("hdfs://192.168.61.128:9000");
    FileSystem hdfs=FileSystem.get(uri,conf);
    TiltelDownLoad pcl=new TiltelDownLoad();
    String addr="http://www.sohu.com";
    DownLoadTool dlt=new DownLoadTool();
    String html=dlt.downLoadUrl(addr);
    Set<String> srcs=pcl.getTilteLink(html);
    for(String title:srcs){
        String[] sts=title.split("\t");
        String url=sts[1].replaceAll("/", "-");
        String url1=url.replaceAll("？", "#");
        //String url2=url1.replace("?","*");
        String content=sts[0];
        Path dfs=new Path("hdfs://192.168.61.128:9000/sohu/1/"+url1+".txt");
        FSDataOutputStream outputStream=hdfs.create(dfs);

        System.out.println(url+"-------"+sts[0]);
        outputStream.write((content+"\n").getBytes());
    }
}

}

上传部分结果展示：
技术分享
单个文件内容展示：

二，分词：
我们会对内容进行分词，即对标题进行分词，为后面的搜索配对做准备。
实际上这个过程有点像倒排索引，写的那个单词计数，不过以前的倒排因为是对单词进行计数，默认的是用空格分隔，可是这里的是包含中英文的，所以就引入了lucene分词器。
1，在map过程引入分词，默认的是这个StringTokenizer itr = new StringTokenizer( value.toString());现在我们需要用分词替换它，来对标题进行分词。
2，将词分好后，将词对应的地址，权重，这里是出现的次数，写出。
词—>key 地址，权重—–>value;权重默认值1
3，通过两个Reducer过程，对分好的词做一个权重的统计，和地址的合并，要是同一个词，对应两个地址，会用“；”分开。
4，最后将结果写到hdfs上。
代码展示：
导入的包：
技术分享

public class InvertedIndex {
    public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text>{
        private Text keyInfo = new Text();  // 存储单词和URI的组合
        private Text valueInfo = new Text(); //存储词频
        private FileSplit split;  // 存储split对象。
        @Override
        protected void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            //获得<key,value>对所属的FileSplit对象。
            split = (FileSplit) context.getInputSplit();
             // 自定义停用词  
            String st=value.toString();
            String[] self_stop_words = {  "了",  "，",  "：", "," };  
            CharArraySet cas = new CharArraySet( 0, true);  
            for (int i = 0; i < self_stop_words.length; i++) {  
                cas.add(self_stop_words[i]);  
            }  
            // 加入系统默认停用词  
            Iterator<Object> itor = SmartChineseAnalyzer.getDefaultStopSet().iterator();  
            while (itor.hasNext()) {  
                cas.add(itor.next());  
            }  
             // 中英文混合分词器(其他几个分词器对中文的分析都不行)  
            SmartChineseAnalyzer sca = new SmartChineseAnalyzer( cas);  

            TokenStream ts = sca.tokenStream("field", st);  
            CharTermAttribute ch = ts.addAttribute(CharTermAttribute.class);  

            ts.reset();  
            while (ts.incrementToken()) {  
                String path=split.getPath().toString();
                System.out.println("path====="+path);
                int indexone=path.lastIndexOf("/");
                int indextow=path.lastIndexOf(".");
                String path1=path.substring(indexone+1,indextow);
                //System.out.println("path1====="+path1);
                String path2=path1.replaceAll("-","/");
                String path3=path2.replaceAll("#", "？");
                keyInfo.set(ch.toString()+":"+path3); 
                valueInfo.set("1");
                context.write(keyInfo, valueInfo);
            }  
            ts.end();  
            ts.close();  
        }
    }

    public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text>{
        private Text info = new Text();
        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
    //System.out.println("***Combiner***values===="+values.toString());
            //统计词频
            int sum = 0;
            for (Text value : values) {
                sum += Integer.parseInt(value.toString() );
            }
        //System.out.println("--Combiner----sum====="+sum);
            int splitIndex = key.toString().indexOf(":");

            //重新设置value值由URI和词频组成
            info.set( key.toString().substring( splitIndex + 1) +":"+sum );

            //重新设置key值为单词
            key.set( key.toString().substring(0,splitIndex));
            context.write(key, info);
        }
    }


    public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text>{
        private Text result = new Text();
        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {

            //生成文档列表
            String fileList = new String();
            for (Text value : values) {
                fileList += value.toString()+";";
            }
            result.set(fileList);
            context.write(key, result);
        }

    }

    public static void main(String[] args) {
        try {
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf,"InvertedIndex");
            job.setJarByClass(InvertedIndex.class);
            //实现map函数，根据输入的<key,value>对生成中间结果。
            job.setMapperClass(InvertedIndexMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);
            job.setCombinerClass(InvertedIndexCombiner.class);
            job.setReducerClass(InvertedIndexReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
            FileInputFormat.addInputPath(job, new Path("hdfs://192.168.61.128:9000/sohu/"));
            FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.61.128:9000/sohuout/"+System.currentTimeMillis()+"/"));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        } catch (IllegalStateException e) {
            e.printStackTrace();
        } catch (IllegalArgumentException e) {
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

    }
}

部分结果展示：
技术分享

三，搜索：
我们输入一句话，通过系统返回网址。
前面已经将标题分好了词，相当于建立好了索引文件，我们拿到这个文件，进行操作。
1，先在map过程中将我们输入的句子，按照想开始相同的分词分法将句子分好词。
2，拿着本次分好的词，去匹配输入文本的key，这个key 就是我们开始标题分好的词。如果相同，就将其key和value写出。
3，在Reducer过程，会对value进行一个降噪的处理，将权重消去，只返回一个地址，和关键字。将结果输出。
代码展示：

public class FindWord {

public static class FindMapper extends Mapper<Text, Text, Text, Text> {
    @Override
    protected void map(Text key, Text value, Mapper<Text, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        try {
            // 要处理的文本
            String text = "我的万套别墅";

            // 自定义停用词
            String[] self_stop_words = { "的", "了", "呢", "，", "0", "：", ",", "是", "流" };
            CharArraySet cas = new CharArraySet(0, true);
            for (int i = 0; i < self_stop_words.length; i++) {
                cas.add(self_stop_words[i]);
            }

            // 加入系统默认停用词
            Iterator<Object> itor = SmartChineseAnalyzer.getDefaultStopSet().iterator();
            while (itor.hasNext()) {
                cas.add(itor.next());
            }

            // 中英文混合分词器(其他几个分词器对中文的分析都不行)
            SmartChineseAnalyzer sca = new SmartChineseAnalyzer(cas);

            TokenStream ts = sca.tokenStream("field", text);
            CharTermAttribute ch = ts.addAttribute(CharTermAttribute.class);

            ts.reset();
            while (ts.incrementToken()) {
                if (key.toString().equals(ch.toString())) {
                    context.write(key, value);
                }
            }
            ts.end();
            ts.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

public static class FindReducer extends Reducer<Text, Text, Text, Text> {
    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        String val=null;
        // 生成文档列表
        for (Text text : values) {
            System.out.println("********"+text.toString());
            String sts[] =text.toString().split(";");
            for(int i=0;i<sts.length;i++){
                    String stt=sts[i].toString().substring(0,sts[i].toString().indexOf(":"));
                    val+=stt+";";
            }
            Text value=new Text();
            value.set(val);
            context.write(key, value);
        }
    }
}

public static void main(String[] args) {
    try {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "InvertedIndex");
        job.setJarByClass(InvertedIndex.class);
        // 实现map函数，根据输入的<key,value>对生成中间结果。
        job.setMapperClass(FindMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setInputFormatClass(KeyValueTextInputFormat.class);
        job.setReducerClass(FindReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job,
                new Path("hdfs://192.168.61.128:9000/sohuout/1462889264074/part-r-00000"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.61.128:9000/sohufind/"+System.currentTimeMillis()+"/"));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    } catch (IllegalStateException e) {
        e.printStackTrace();
    } catch (IllegalArgumentException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

}

部分结果展示：
技术分享
新手上路做的不好的还请多多包涵！

Hadoop之仿写搜索引擎

标签：

原文地址：http://blog.csdn.net/young_so_nice/article/details/51376146

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行