手把手教你做关键词匹配项目（搜索引擎）---- 第二十天

时间：2014-09-04 16:55:59 阅读：249 评论：0 收藏：0 [点我收藏+]

标签：des style blog http color os io ar strong

客串：屌丝的坑人表单神器、数据库那点事儿

面向对象升华：面向对象的认识----新生的初识、面向对象的番外----思想的梦游篇（1）、面向对象的认识---如何找出类

负载均衡：负载均衡----概念认识篇、负载均衡----实现配置篇（Nginx）

吐槽：有人反馈了这样的一个信息，说该文章越到最后越难看懂，跟不上节奏，也有的人说小帅帅的能力怎么飙的那么快，是不是我比较蠢。也有的直接看文字，不看代码，代码太难懂了。

其实我这几天也一直在思考这个问题，所以没办法就去开展了一些面向对象的课程，希望对那些跟不上的有些帮助。其实说真的，读者不反馈的话，我只好按照我认为的小帅帅去开展课程了。

第二十天

起点：手把手教你做关键词匹配项目（搜索引擎）---- 第一天

回顾：手把手教你做关键词匹配项目（搜索引擎）---- 第十九天

话说小帅帅为了解决那个分词算法写出了初版，他拿给于老大看的时候，被要求重写了。

原因有以下几点：

1. 如何测试，测试数据呢？

2. Splitter是不是做了太多事情?

3. 连衣裙xxl裙连衣裙这种有重复词组怎么办？

小帅帅拿着这些问题，开始重构。

首先他发现了这点，中文、英文和中英文的判断，以及长度的计算，他把这个写成了类：

<?php

class UTF8 {

    /**
     * 检测是否utf8
     * @param $char
     * @return bool
     */
    public static function is($char){
        return (preg_match("/^([".chr(228)."-".chr(233)."]{1}[".chr(128)."-".chr(191)."]{1}[".chr(128)."-".chr(191)."]{1}){1}/",$char) ||
            preg_match("/([".chr(228)."-".chr(233)."]{1}[".chr(128)."-".chr(191)."]{1}[".chr(128)."-".chr(191)."]{1}){1}$/",$char) ||
            preg_match("/([".chr(228)."-".chr(233)."]{1}[".chr(128)."-".chr(191)."]{1}[".chr(128)."-".chr(191)."]{1}){2,}/",$char));
    }

    /**
     * 计算utf8字的个数
     * @param $char
     * @return float|int
     */
    public static function length($char) {

        if(self::is($char))
            return ceil(strlen($char)/3);
        return strlen($char);
    }

    /**
     * 检测是否为词组
     * @param $word
     * @return bool
     */
    public static function isPhrase($word){

        if(self::length($word)<=1)
            return false;
        return true;
    }

}

小帅帅又考虑到词典的来源有可能来自多个地方，比如我给的测试数据，这样不就是可以解决于老大说到无法测试的问题了，小帅帅把词典的来源抽成了个类，类如下：

<?php

class DBSegmentation {

    public $cid;

    /**
     * 获取类目下分词的词组数据
     * @return array
     */
    public function transferDictionary(){
        $ret = array();
        $sql = "select word from category_linklist where cid=‘$this->cid‘";
        $words = DB::makeArray($sql);
        foreach($words as $strWords){
            $words = explode(",",$strWords);

            foreach($words as $word){
                if(UTF8::isPhrase($word)){
                    $ret[] = $word;
                }
            }
        }
        return $ret;
    }
} 

class TestSegmentation {
    
    public function transferDictionary(){
        $words = array(
            "连衣裙,连衣",
            "XXL,xxl,加大,加大码",
            "X码,中码",
            "外套,衣,衣服,外衣,上衣",
            "女款,女士,女生,女性"
        );

        $ret = array();
        foreach($words as $strWords){
            $words = explode(",",$strWords);

            foreach($words as $word){
                if(UTF8::isPhrase($word)){
                    $ret[] = $word;
                }
            }
        }
        return $ret;

    }
}

那么Splitter 就专心分词把，代码如下：

class Splitter {

    public $keyword;
    private $dictionary = array();

    public function setDictionary($dictionary = array()){

        usort($dictionary,function($a,$b){
            return (UTF8::length($a)>UTF8::length($b))?1:-1;
        });

        $this->dictionary = $dictionary;
    }

    public function getDictionary(){
        return $this->dictionary;
    }

    /**
     * 把关键词拆分成词组或者单词
     * @return KeywordEntity $keywordEntity
     */
    public function split(){

        $remainKeyword = $this->keyword;

        $keywordEntity = new KeywordEntity($this->keyword);

        foreach($this->dictionary as $phrase){

            $matchTimes = preg_match_all("/$phrase/",$remainKeyword,$matches);
            if($matchTimes>0){
                $keywordEntity->addElement($phrase,$matchTimes);

                $remainKeyword = str_replace($phrase,"::",$remainKeyword);
            }
        }

        $remainKeywords = explode("::",$remainKeyword);
        foreach($remainKeywords as $splitWord){

            if(!empty($splitWord)){
                $keywordEntity->addElement($splitWord);
            }
        }

        return $keywordEntity;

    }

}


class KeywordEntity {

    public $keyword;
    public $elements = array();

    public function __construct($keyword){
        $this->keyword = $keyword;
    }

    public function addElement($word,$times=1){

        if(isset($this->elements[$word])){
            $this->elements[$word]->times += $times;
        }else
            $this->elements[] = new KeywordElement($word,$times);
    }

    /**
     * @desc 计算UTF8字符串权重
     * @param string $word
     * @return float
     */
    public function calculateWeight($word)
    {
        $element = $this->elements[$word];
        return ROUND(strlen($element->word)*$element->times / strlen($this->keyword), 3);
    }
}


class KeywordElement {
    public $word;
    public $times;

    public function __construct($word,$times){
        $this->word = $word;
        $this->times = $times;
    }
}

他把算权重的也丢给了一个类专门去处理。

小帅帅写完之后，也顺手写了测试实例：

<?php

$segmentation = new TestSegmentation();

$splitter = new Splitter();
$splitter->setDictionary($segmentation->transferDictionary());
$splitter->keyword = "连衣裙xxl裙连衣裙";
$keywordEntity = $splitter->split();

var_dump($keywordEntity);

这样就算你的算法怎么改，它也能从容面对了。

小帅帅理解了这个，当你觉得类做的事情太多的时候，可以考虑下单一职责原则。

单一职责原则：一个类，只有一个引起它变化的原因。应该只有一个职责。每一个职责都是变化的一个轴线，如果一个类有一个以上的职责，这些职责就耦合在了一起。这会导致脆弱的设计。当一个职责发生变化时，可能会影响其它的职责。另外，多个职责耦合在一起，会影响复用性。例如：要实现逻辑和界面的分离。【来自百度百科】

当于老大提到是不是有其他分词算法的时候，我们能不能拿来用，小帅帅很高兴，因为现在它的代码是多么美好。

小帅帅如何玩转第三方分词扩展，请继续关注下回分解：手把手教你做关键词匹配项目（搜索引擎）---- 第二十一天

手把手教你做关键词匹配项目（搜索引擎）---- 第二十天

标签：des style blog http color os io ar strong

原文地址：http://www.cnblogs.com/oshine/p/3956322.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行