trident介绍

时间：2015-06-30 16:24:58 阅读：168 评论：0 收藏：0 [点我收藏+]

标签：

（一）理论基础
更多理论以后再补充，或者参考书籍
1、trident是什么？
Trident is a high-level abstraction for doing realtime computing on top of Storm. It allows you to seamlessly intermix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you‘re familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar – Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.

简单的说，trident是storm的更高层次抽象，相对storm，它主要提供了2个方面的好处：

（1）提供了更高层次的抽象，将常用的count,sum等封装成了方法，可以直接调用，不需要自己实现。

（2）提供了一次原语，如groupby等。

（3）提供了事务支持，可以保证数据均处理且只处理了一次。

2、trident每次处理消息均为batch为单位，即一次处理多个元组。

（二）看官方提供的示例

package org.ljh.tridentdemo;

import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.LocalDRPC;
import backtype.storm.StormSubmitter;
import backtype.storm.generated.StormTopology;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import storm.trident.TridentState;
import storm.trident.TridentTopology;
import storm.trident.operation.BaseFunction;
import storm.trident.operation.TridentCollector;
import storm.trident.operation.builtin.Count;
import storm.trident.operation.builtin.FilterNull;
import storm.trident.operation.builtin.MapGet;
import storm.trident.operation.builtin.Sum;
import storm.trident.testing.FixedBatchSpout;
import storm.trident.testing.MemoryMapState;
import storm.trident.tuple.TridentTuple;


public class TridentWordCount {
    public static class Split extends BaseFunction {
        @Override
        public void execute(TridentTuple tuple, TridentCollector collector) {
            String sentence = tuple.getString(0);
            for (String word : sentence.split(" ")) {
                collector.emit(new Values(word));
            }
        }
    }

    public static StormTopology buildTopology(LocalDRPC drpc) {
        FixedBatchSpout spout =
                new FixedBatchSpout(new Fields("sentence"), 3, new Values(
                        "the cow jumped over the moon"), new Values(
                        "the man went to the store and bought some candy"), new Values(
                        "four score and seven years ago"),
                        new Values("how many apples can you eat"), new Values(
                                "to be or not to be the person"));
        spout.setCycle(true);

        //创建拓扑对象
        TridentTopology topology = new TridentTopology();
        
        //这个流程用于统计单词数据，结果将被保存在wordCounts中
        TridentState wordCounts =
                topology.newStream("spout1", spout)
                        .parallelismHint(16)
                        .each(new Fields("sentence"), new Split(), new Fields("word"))
                        .groupBy(new Fields("word"))
                        .persistentAggregate(new MemoryMapState.Factory(), new Count(),
                                new Fields("count")).parallelismHint(16);
        //这个流程用于查询上面的统计结果
        topology.newDRPCStream("words", drpc)
                .each(new Fields("args"), new Split(), new Fields("word"))
                .groupBy(new Fields("word"))
                .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
                .each(new Fields("count"), new FilterNull())
               .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
        return topology.build();
    }

    public static void main(String[] args) throws Exception {
        Config conf = new Config();
        conf.setMaxSpoutPending(20);
        if (args.length == 0) {
            LocalDRPC drpc = new LocalDRPC();
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("wordCounter", conf, buildTopology(drpc));
            for (int i = 0; i < 100; i++) {
                System.out.println("DRPC RESULT: " + drpc.execute("words", "cat the dog jumped"));
                Thread.sleep(1000);
            }
        } else {
            conf.setNumWorkers(3);
            StormSubmitter.submitTopologyWithProgressBar(args[0], conf, buildTopology(null));
        }
    }
}

实例实现了最基本的wordcount功能，然后将结果输出。关键步骤如下：

1、定义了输入流

        FixedBatchSpout spout =
                new FixedBatchSpout(new Fields("sentence"), 3, new Values(
                        "the cow jumped over the moon"), new Values(
                        "the man went to the store and bought some candy"), new Values(
                        "four score and seven years ago"),
                        new Values("how many apples can you eat"), new Values(
                                "to be or not to be the person"));
        spout.setCycle(true);

（1）使用FixedBatchSpout创建一个输入spout，spout的输出字段为sentence，每3个元组作为一个batch。
（2）数据不断的重复发送。

2、统计单词数量

        TridentState wordCounts =
                topology.newStream("spout1", spout)
                        .parallelismHint(16)
                        .each(new Fields("sentence"), new Split(), new Fields("word"))
                        .groupBy(new Fields("word"))
                        .persistentAggregate(new MemoryMapState.Factory(), new Count(),
                                new Fields("count")).parallelismHint(16);

这个流程用于统计单词数据，结果将被保存在wordCounts中。6行代码的含义分别为：

（1）首先从spout中读取消息，spout1定义了zookeeper中用于保存这个拓扑的节点名称。

（2）并行度设置为16，即16个线程同时从spout中读取消息。

（3）each中的三个参数分别为：输入字段名称，处理函数，输出字段名称。即从字段名称叫sentence的数据流中读取数据，然后经过new Split()处理后，以word作为字段名发送出去。其中new Split()后面介绍，它的功能就是将输入的内容以空格为界作了切分。

（4）将字段名称为word的数据流作分组，即相同值的放在一组。

（5）将已经分好组的数据作统计，结果放到MemoryMapState，然后以count作为字段名称将结果发送出去。这步骤会同时存储数据及状态，并将返回TridentState对象。

（6）并行度设置。

3、输出统计结果

        topology.newDRPCStream("words", drpc)
                .each(new Fields("args"), new Split(), new Fields("word"))
                .groupBy(new Fields("word"))
                .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
                .each(new Fields("count"), new FilterNull())
               .aggregate(new Fields("count"), new Sum(), new Fields("sum"));

这个流程从上述的wordCounts对象中读取结果，并返回。6行代码的含义分别为：

（1）等待一个drpc调用，从drpc服务器中接受words的调用来提供消息。调用代码如下：

drpc.execute("words", "cat the dog jumped")
（2）输入为上述调用中提供的参数，经过Split()后，以word作为字段名称发送出去。

（3）以word的值作分组。

（4）从wordCounts对象中查询结果。4个参数分别代表：数据来源，输入数据，内置方法（用于从map中根据key来查找value），输出名称。

（5）过滤掉空的查询结果，如本例中，cat和dog都没有结果。

（6）将结果作统计，并以sum作为字段名称发送出去，这也是DRPC调用所返回的结果。如果没有这一行，最后的输出结果

DRPC RESULT: [["cat the dog jumped","the",2310],["cat the dog jumped","jumped",462]]
加上这一行后，结果为：
DRPC RESULT: [[180]]

4、split的字义

    public static class Split extends BaseFunction {
        @Override
        public void execute(TridentTuple tuple, TridentCollector collector) {
            String sentence = tuple.getString(0);
            for (String word : sentence.split(" ")) {
                collector.emit(new Values(word));
            }
        }
    }

注意它最后会发送数据。

5、创建并启动拓扑

    public static void main(String[] args) throws Exception {
        Config conf = new Config();
        conf.setMaxSpoutPending(20);
        if (args.length == 0) {
            LocalDRPC drpc = new LocalDRPC();
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("wordCounter", conf, buildTopology(drpc));
            for (int i = 0; i < 100; i++) {
                System.out.println("DRPC RESULT: " + drpc.execute("words", "cat the dog jumped"));
                Thread.sleep(1000);
            }
        } else {
            conf.setNumWorkers(3);
            StormSubmitter.submitTopologyWithProgressBar(args[0], conf, buildTopology(null));
        }
    }

（1）当无参数运行时，启动一个本地的集群，及自已创建一个drpc对象来输入。
（2）当有参数运行时，设置worker数量为3，然后提交拓扑到集群，并等待远程的drpc调用。

trident介绍

标签：

原文地址：http://blog.csdn.net/jinhong_lu/article/details/46696549

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行