Elasticsearch使用总结

时间：2017-10-26 10:22:20 阅读：392 评论：0 收藏：0 [点我收藏+]

原文出自：https://www.2cto.com/database/201612/580142.html

自己存的东西方便以后看

这是官方对Elasticsearch的定位。通俗的讲，Elasticsearch就是一款面向文档的NoSQL数据库，使用JSON作为文档序列化格式。但是，它的高级之处在于，使用Lucene作为核心来实现所有索引和搜索的功能，使得每个文档的内容都可以被索引、搜索、排序、过滤。同时，提供了丰富的聚合功能，可以对数据进行多维度分析。对外统一使用REST API接口进行沟通，即Client与Server之间使用HTTP协议通信。
首先，来看看在存储上的基本概念，这里将其与MySQL进行了对比，从而可以更清晰的搞清楚每个概念的意义。

Elasticsearch	MySQL
index（索引，名词）	database
doc type（文档类型）	table
document（文档）	row
field（字段）	column
mapping（映射）	schema
query DSL（查询语言）	SQL

然后，来看看倒排索引的概念（官方解释）。倒排索引是搜索引擎的基石，也是Elasticsearch能实现快速全文搜索的根本。归纳起来，主要是对一个文档内容做两步操作：分词、建立“单词-文档”列表。举个例子，假如有下面两个文档：

1. {"content": "The quick brown fox jumped over the lazy dog"}
2. {"content": "Quick brown foxes leap over lazy dogs in summer"}

Elasticsearch会使用分词器对content字段的内容进行分词，再根据单词在文档中是否出现建立如下所示的列表，√表示单词在文档中有出现。假如我们想搜索“quick brown”，只需要找到每个词在哪个文档中出现即可。如果有多个文档匹配，可以根据匹配的程度进行打分，找出相关性高的文档。

Term	Doc_1	Doc_2
Quick		√
The	√
brown	√	√
dog	√
dogs		√
fox	√
foxes		√
in		√
jumped	√
lazy	√	√
leap		√
over	√	√
quick	√
summer		√
the	√

最后，我们再回过头看看上面的映射的概念。类似于MySQL在db schema中申明每个列的数据类型、索引类型等，Elasticsearch中使用mapping来做这件事。常用的是，在mapping中申明字段的数据类型、是否建立倒排索引、建立倒排索引时使用什么分词器。默认情况下，Elasticsearch会为所有的string类型数据使用standard分词器建立倒排索引。

查看mapping：GET https://localhost:9200/<index name="">/_mapping
NOTE: 这里的index是blog，doc type是test
{
    "blog": {
        "mappings": {
            "test": {
                "properties": {
                    "activity_type": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "address": {
                        "type": "string",
                        "analyzer": "ik_smart"
                    },
                    "happy_party_id": {
                        "type": "integer"
                    },
                    "last_update_time": {
                        "type": "date",
                        "format": "yyyy-MM-dd HH:mm:ss"
                    }
                }
            }
        }
    }
}</index>

数据插入

在MySQL中，我们需要先建立database和table，申明db schema后才可以插入数据。而在Elasticsearch，可以直接插入数据，系统会自动建立缺失的index和doc type，并对字段建立mapping。因为半结构化数据的数据结构通常是动态变化的，我们无法预知某个文档中究竟有哪些字段，如果每次插入数据都需要提前建立index、type、mapping，那就失去了其作为NoSQL的优势了。

 1 直接插入数据：POST https://localhost:9200/blog/test
 2 {
 3     "count": 5,
 4     "desc": "hello world"
 5 }
 6  
 7 查看索引：GET https://localhost:9200/blog/_mapping
 8 {
 9     "blog": {
10         "mappings": {
11             "test": {
12                 "properties": {
13                     "count": {
14                         "type": "long"
15                     },
16                     "desc": {
17                         "type": "string"
18                     }
19                 }
20             }
21         }
22     }
23 }

然而这种灵活性是有限，比如上文我们提到，默认情况下，Elasticsearch会为所有的string类型数据使用standard分词器建立倒排索引，那么如果某些字段不想建立倒排索引怎么办。Elasticsearch提供了dynamic template的概念来针对一组index设置默认mapping，只要index的名称匹配了，就会使用该template设置的mapping进行字段映射。
??下面所示即创建一个名称为blog的template，该template会自动匹配以”blog_”开头的index，为其自动建立mapping，对文档中的所有string自动增加一个.raw字段，并且该字段不做索引。这也是ELK中的做法，可以查看ELK系统中Elasticsearch的template，会发现有一个名为logstash的template。

 1 创建template：POST https://localhost:9200/_template/blog
 2 {
 3     "template": "blog_*",
 4     "mappings": {
 5         "_default_": {
 6             "dynamic_templates": [{
 7                 "string_fields": {
 8                     "mapping": {
 9                         "type": "string",
10                         "fields": {
11                             "raw": {
12                                 "index": "not_analyzed",
13                                 "ignore_above": 256,
14                                 "type": "string"
15                             }
16                         }
17                     },
18                     "match_mapping_type": "string"
19                 }
20             }],
21             "properties": {
22                 "timestamp": {
23                     "doc_values": true,
24                     "type": "date"
25                 }
26             },
27             "_all": {
28                 "enabled": false
29             }
30         }
31     }
32 }
33  
34 直接插入数据：POST https://localhost:9200/blog_2016-12-25/test
35 {
36     "count": 5,
37     "desc": "hello world"
38 }

插入问题还有个话题就是批量插入。Elasticsearch提供了bulk API用来做批量的操作，你可以在该API中自由组合你要做的操作和数据，一次性发送给Elasticsearch进行处理，其格式是这样的。

 1 action_and_meta_data\n
 2 optional_source\n
 3 action_and_meta_data\n
 4 optional_source\n
 5 ....
 6 action_and_meta_data\n
 7 optional_source\n
 8  
 9 比如：
10 { "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
11 { "field1" : "value1" }
12 { "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
13 { "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
14 { "field1" : "value3" }
15 { "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} }
16 { "doc" : {"field2" : "value2"} }

如果是针对相同的index和doc type进行操作，则在REST API中指定index和type即可。批量插入的操作举例如下：

 1 批量插入：POST https://localhost:9200/blog_2016-12-24/test/_bulk
 2 {"index": {}}
 3 {"count": 5, "desc": "hello world 111"}
 4 {"index": {}}
 5 {"count": 6, "desc": "hello world 222"}
 6 {"index": {}}
 7 {"count": 7, "desc": "hello world 333"}
 8 {"index": {}}
 9 {"count": 8, "desc": "hello world 444"}
10  
11 查看插入的结果：GET https://localhost:9200/blog_2016-12-24/test/_search

数据查询

Elasticsearch的查询语法（query DSL）分为两部分：query和filter，区别在于查询的结果是要完全匹配还是相关性匹配。filter查询考虑的是“文档中的字段值是否等于给定值”，答案在“是”与“否”中；而query查询考虑的是“文档中的字段值与给定值的匹配程度如何”，会计算出每份文档与给定值的相关性分数，用这个分数对匹配了的文档进行相关性排序。
??在实际使用中，要注意两点：第一，filter查询要在没有做倒排索引的字段上做，即上面mapping中增加的.raw字段；第二，通常使用filter来缩小查询范围，使用query进行搜索，即二者配合使用。举例来看，注意看三个不同查询在写法上的区别：

 1 1. 只使用query进行查询：
 2 POST https://localhost:9200/user_action/_search
 3 查询的结果是page_name字段中包含了wechat所有文档
 4 这里使用size来指定返回文档的数量，默认Elasticsearch是返回前100条数据的
 5 {
 6     "query": {
 7         "bool": {
 8             "must": [{
 9                 "match": {
10                     "page_name": "wechat"
11                 }
12             },
13             {
14                 "range": {
15                     "timestamp": {
16                         "gte": 1481218631,
17                         "lte": 1481258231,
18                         "format": "epoch_second"
19                     }
20                 }
21             }]
22         }
23     },
24     "size": 2
25 }
26  
27 2. 只使用filter进行查询：
28 POST https://localhost:9200/user_action/_search
29 查询的结果是page_name字段值等于"example.cn/wechat/view.html"的所有文档
30 {
31     "filter": {
32         "bool": {
33             "must": [{
34                 "term": {
35                     "page_name.raw": "example.cn/wechat/view.html"
36                 }
37             },
38             {
39                 "range": {
40                     "timestamp": {
41                         "gte": 1481218631,
42                         "lte": 1481258231,
43                         "format": "epoch_second"
44                     }
45                 }
46             }]
47         }
48     },
49     "size": 2
50 }
51  
52 3. 同时使用query与filter进行查询：
53 POST https://localhost:9200/user_action/_search
54 查询的结果是page_name字段值等于"example.cn/wechat/view.html"的所有文档
55 {
56     "query": {
57         "bool": {
58             "filter": [{
59                 "bool": {
60                     "must": [{
61                         "term": {
62                             "page_name.raw": "job.gikoo.cn/wechat/view.html"
63                         }
64                     },
65                     {
66                         "range": {
67                             "timestamp": {
68                                 "gte": 1481218631,
69                                 "lte": 1481258231,
70                                 "format": "epoch_second"
71                             }
72                         }
73                     }]
74                 }
75             }]
76         }
77     },
78     "size": 2
79 }

聚合分析

类似于MySQL中的聚合由分组和聚合计算两种，Elasticsearch的聚合也有两部分组成：Buckets与Metrics。Buckets相当于SQL中的分组group by，而Metrics则相当于SQL中的聚合函数COUNT，SUM，MAX，MIN等等。聚合分析自然离不开对多个字段值进行分组，在MySQL中，我们只要使用“group by c1, c2, c3”就可以完成这样的功能，但是Elasticsearch没有这样的语法。Elasticsearch提供了另一种方法，即Buckets嵌套，仔细想想，似乎这种设计更加符合人的思维方式。举例来看具体操作方法：

1. 最简单的聚合查询
POST https://localhost:9200/user_action/_search
为了简单，这里删除了query的条件描述
将符合条件的文档按照公司进行聚合
这里有两个size，和aggs并列的size=0表示返回结果不包含查询结果，只返回聚合结果，terms里面的size表示返回的聚合结果数量
{
    "aggs": {
        "company_terms": {
            "terms": {
                "field": "company",
                "size": 2
            }
        }
    },
    "size": 0
}
 
2. Buckets与Metric配合
POST https://localhost:9200/user_action/_search
将符合条件的文档按照公司进行聚合，并获取每个公司最近一次操作的时间
{
    "aggs": {
        "company_terms": {
            "terms": {
                "field": "company",
                "size": 2
            },
            "aggs": {
                "latest_record": {
                    "max": {
                        "field": "timestamp"
                    }
                }
            }
        }
    },
    "size": 0
}
 
3. Buckets嵌套
POST https://localhost:9200/user_action/_search
将符合条件的文档先按照公司进行聚合，再对每个公司下的门店进行聚合，并获取每个门店最近一次操作的时间
{
    "aggs": {
        "company_terms": {
            "terms": {
                "field": "company",
                "size": 1
            },
            "aggs": {
                "store_terms": {
                    "terms": {
                        "field": "store",
                        "size": 2
                    },
                    "aggs": {
                        "latest_record": {
                            "max": {
                                "field": "timestamp"
                            }
                        }
                    }
                }
            }
        }
    },
    "size": 0
}

Elasticsearch使用总结

标签：ase time com base span use 相关批量插入之间

原文地址：http://www.cnblogs.com/yzw23333/p/7735369.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行