码迷,mamicode.com
首页 > 其他好文 > 详细

Elasticsearch的分析过程,内置字符过滤器、分析器、分词器、分词过滤器(真是变态多啊!美滋滋)

时间:2019-08-24 00:36:15      阅读:163      评论:0      收藏:0      [点我收藏+]

标签:analyzer   mapping   inf   发送   att   email   onclick   gif   div   

分析过程

当数据被发送到elasticsearch后并加入倒排序索引之前,elasticsearch会对文档进行处理:   

  • 字符过滤:使用字符过滤器转变字符。
  • 文本切分为分词:将文本(档)分为单个或多个分词。
  • 分词过滤:使用分词过滤器转变每个分词。
  • 分词索引:最终将分词存储在Lucene倒排索引中。

整体流程:

技术图片

目的是达到人性化的分词

内置字符过滤器

技术图片

HTML字符过滤器、映射字符过滤器、模式替换过滤器

HTML字符过滤器 

POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

 

 结果

{
  "tokens" : [
    {
      "token" : """

I‘m so happy!

""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

 

自定义HTML过滤器

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

 

映射字符过滤

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      },
      "char_filter":{
          "my_char_filter":{
            "type":"mapping",
            "mappings":["苍井空 => 666","武藤兰 => 888"]
          }
        }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text":"苍井空热爱武藤兰,可惜苍井空后来结婚了"
}

 

结果

技术图片
 1 {
 2   "tokens" : [
 3     {
 4       "token" : "666热爱888,可惜666后来结婚了",
 5       "start_offset" : 0,
 6       "end_offset" : 19,
 7       "type" : "word",
 8       "position" : 0
 9     }
10   ]
11 }
1111111

 

 模式替换过滤器

PUT my_index1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my_index1/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

 

结果

技术图片
{
  "tokens" : [
    {
      "token" : "My",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "credit",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "card",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "123_456_789",
      "start_offset" : 18,
      "end_offset" : 29,
      "type" : "<NUM>",
      "position" : 4
    }
  ]
}
1111111

 

 内置分析器

技术图片

内置分词器

技术图片

UAX URL电子邮件分词

1 作者:一线码农
2 来源:未知原文:https://www.cnblogs.com/Mc_HotHog/articles/1111111.html
3 邮箱:22222@qq.com
4 版权声明:本文为博主原创文章,转载请附上博文链接!

 

 

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text":"作者:一线码农来源:未知原文:https://www.cnblogs.com/Mc_HotHog/articles/1111111.html邮箱:22222@qq.com版权声明:本文为博主原创文章,转载请附上博文链接!"
}

 

 

结果

技术图片
{
  "tokens" : [
    {
      "token" : "",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "线",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "https://www.cnblogs.com/Mc_HotHog/articles/1111111.html",
      "start_offset" : 15,
      "end_offset" : 70,
      "type" : "<URL>",
      "position" : 12
    },
    {
      "token" : "",
      "start_offset" : 70,
      "end_offset" : 71,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    },
    {
      "token" : "",
      "start_offset" : 71,
      "end_offset" : 72,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    },
    {
      "token" : "22222@qq.com",
      "start_offset" : 73,
      "end_offset" : 85,
      "type" : "<EMAIL>",
      "position" : 15
    },
    {
      "token" : "",
      "start_offset" : 85,
      "end_offset" : 86,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "",
      "start_offset" : 86,
      "end_offset" : 87,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "",
      "start_offset" : 87,
      "end_offset" : 88,
      "type" : "<IDEOGRAPHIC>",
      "position" : 18
    },
    {
      "token" : "",
      "start_offset" : 88,
      "end_offset" : 89,
      "type" : "<IDEOGRAPHIC>",
      "position" : 19
    },
    {
      "token" : "",
      "start_offset" : 90,
      "end_offset" : 91,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "",
      "start_offset" : 91,
      "end_offset" : 92,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "",
      "start_offset" : 92,
      "end_offset" : 93,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "",
      "start_offset" : 93,
      "end_offset" : 94,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    },
    {
      "token" : "",
      "start_offset" : 94,
      "end_offset" : 95,
      "type" : "<IDEOGRAPHIC>",
      "position" : 24
    },
    {
      "token" : "",
      "start_offset" : 95,
      "end_offset" : 96,
      "type" : "<IDEOGRAPHIC>",
      "position" : 25
    },
    {
      "token" : "",
      "start_offset" : 96,
      "end_offset" : 97,
      "type" : "<IDEOGRAPHIC>",
      "position" : 26
    },
    {
      "token" : "",
      "start_offset" : 97,
      "end_offset" : 98,
      "type" : "<IDEOGRAPHIC>",
      "position" : 27
    },
    {
      "token" : "",
      "start_offset" : 98,
      "end_offset" : 99,
      "type" : "<IDEOGRAPHIC>",
      "position" : 28
    },
    {
      "token" : "",
      "start_offset" : 100,
      "end_offset" : 101,
      "type" : "<IDEOGRAPHIC>",
      "position" : 29
    },
    {
      "token" : "",
      "start_offset" : 101,
      "end_offset" : 102,
      "type" : "<IDEOGRAPHIC>",
      "position" : 30
    },
    {
      "token" : "",
      "start_offset" : 102,
      "end_offset" : 103,
      "type" : "<IDEOGRAPHIC>",
      "position" : 31
    },
    {
      "token" : "",
      "start_offset" : 103,
      "end_offset" : 104,
      "type" : "<IDEOGRAPHIC>",
      "position" : 32
    },
    {
      "token" : "",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<IDEOGRAPHIC>",
      "position" : 33
    },
    {
      "token" : "",
      "start_offset" : 105,
      "end_offset" : 106,
      "type" : "<IDEOGRAPHIC>",
      "position" : 34
    },
    {
      "token" : "",
      "start_offset" : 106,
      "end_offset" : 107,
      "type" : "<IDEOGRAPHIC>",
      "position" : 35
    },
    {
      "token" : "",
      "start_offset" : 107,
      "end_offset" : 108,
      "type" : "<IDEOGRAPHIC>",
      "position" : 36
    },
    {
      "token" : "",
      "start_offset" : 108,
      "end_offset" : 109,
      "type" : "<IDEOGRAPHIC>",
      "position" : 37
    }
  ]
}
11111

 

 内置分词过滤器

技术图片

了解更多https://www.elastic.co/guide/en/elasticsearch/reference/6.5/index.html

 

Elasticsearch的分析过程,内置字符过滤器、分析器、分词器、分词过滤器(真是变态多啊!美滋滋)

标签:analyzer   mapping   inf   发送   att   email   onclick   gif   div   

原文地址:https://www.cnblogs.com/Alexephor/p/11396724.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!