Mysql 统计标签出现次数（一行变多行）

时间：2020-06-24 16:34:59 阅读：59 评论：0 收藏：0 [点我收藏+]

需求背景

需求

一张数据表
其中有sid字段，代表tag，每行数据可能有多个tag字段
统计全量数据中所有tag出现的次数（按tag分组，分别有多少数据）

source table demo

id	sid
1	a3,a4,a1,a2,a5
2	a5,a3,a4,a2
3	a5,a3,a4

target table demo

sid_tag	sid_occurrence
a1	1
a2	2
a3	3
a4	3
a5	3

SQL query in mysql

select substring_index( substring_index( sid , ‘,‘, id_table.help_topic_id + 1 ), ‘,‘,- 1 ) as sid_tag,count(sid) as sid_occurrence from src_t JOIN mysql.help_topic id_table on id_table.help_topic_id < (
  length(src_t.sid) - length(replace(src_t.sid, ‘,‘, ‘‘)) + 1
) GROUP BY sid_tag

hive

select tag,count(*) from(
select tag from table_name
lateral view explode(split(taglist,‘,‘)) r1 AS tag) a group by a.tag;

解析

利用系统表mysql.help_topic进行了index的记录（可以使用其他表），更暴力的解法是直接新建一张只有id列的表，可以预防系统表id不够用（一般足够了）
JOIN a.id < b.id，解决了复制行的操作；若b.id = 5，a.id 为（1，2，3，4，5），返回数据为 [(1,5),(2,5),(3,5),(4,5)]
substring_index( substring_index( sid , ‘,‘, id_table.help_topic_id + 1 ), ‘,‘,- 1 ) 利用index表中的index，作为子串索引进行数据拆分。索引构建基于上一步的join。
随后正常groupby统计

实验

mysql

select * from (
  SELECT concat(‘a‘,topic.help_topic_id) as sid,topic.help_topic_id AS id
  FROM mysql.help_topic topic
  WHERE help_topic_id IN (‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘)
) as id_table JOIN (
  SELECT concat(‘b‘,topic.help_topic_id) as sid,topic.help_topic_id AS id
  FROM mysql.help_topic topic
  WHERE help_topic_id IN (‘1‘, ‘2‘, ‘3‘)
) as id_table2 on id_table2.id <= id_table.id

结果

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

Mysql 统计标签出现次数（一行变多行）

标签：暴力多个 view 索引 targe cat count 字段多行

原文地址：https://www.cnblogs.com/suanec/p/13187736.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3