标签:
1、cassandra 准备
启动cqlsh,
CQLSH_HOST=172.16.163.131 bin/cqlsh
cqlsh>CREATE KEYSPACE productlogs WITH REPLICATION = { ‘class‘ : ‘org.apache.cassandra.locator.SimpleStrategy‘, ‘replication_factor‘: ‘2‘ } 
cqlsh>CREATE TABLE productlogs.logs (
    ids uuid,
    app_name text,
    app_version text,
    city text,
    client_time timestamp,
    country text,
    created_at timestamp,
    cs_count int,
    device_id text,
    id int,
    modle_name text,
    province text,
    remote_ip text,
    updated_at timestamp,
    PRIMARY KEY (ids)
)
2、spark cassandra conector jar包
新建空项目,使用sbt,引入connector,打包为spark-cassandra-connector-full.jar
这步的意义在于:官方的connector包没有将依赖打进去,所以,直接使用官方包的时候,需要自己将依赖找出来。不同版本依赖的包及版本也不相同,简单起见,直接打一个full包
3、启动spark-shell
/opt/db/spark-1.5.2-bin-hadoop2.6/bin/spark-shell --master spark://u1:7077 --jars ~/spark-cassandra-connector-full.jar
以下为sparkshell 命令
4、准备数据源:
//可能大多数文档都先stop掉当前sc,再重启一个,其实根本没必要,直接在原有sc上添加cassandra的参数就好 scala>sc.getConf.set("spark.cassandra.connection.host", "172.16.163.131") //读取HDFS上的数据源 scala>val df = sc.textFile("/data/logs") //引入需要的命令空间 scala>import org.apache.spark.sql._ scala>import org.apache.spark.sql.types._ scala>import com.datastax.spark.connector._ scala>import java.util.UUID //定义shcmea scala>val schema = StructType( StructField("ids", StringType, true) :: StructField("id", IntegerType, true) :: StructField("app_name", StringType, true) :: StructField("app_version", StringType, true) :: StructField("client_time", TimestampType, true) :: StructField("device_id", StringType, true) :: StructField("modle_name", StringType, true) :: StructField("cs_count", IntegerType, true) :: StructField("created_at", TimestampType, true) :: StructField("updated_at", TimestampType, true) :: StructField("remote_ip", StringType, true) :: StructField("country", StringType, true) :: StructField("province", StringType, true) :: StructField("city", StringType, true) :: Nil) //指定数据源的schema scala>val rowRDD = df.map(_.split("\t")).map(p => Row(UUID.randomUUID().toString(), p(0).toInt, p(1), p(2), java.sql.Timestamp.valueOf(p(3)), p(4), p(5), p(6).toInt, java.sql.Timestamp.valueOf(p(7)), java.sql.Timestamp.valueOf(p(8)), p(9), p(10), p(11), p(12))) scala>val df= sqlContext.createDataFrame(rowRDD, schema) scala>df.registerTempTable("logs") //看下结果 scala>sqlContext.sql("select * from logs limit 1").show
5、将数据存入cassandra
scala>import org.apache.spark.sql.cassandra._ scala>df.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "logs", "keyspace" -> "productlogs")).save()
6、取出刚存的数据:
scala>import org.apache.spark.sql.cassandra._
scala>val cdf = sqlContext.read.
  format("org.apache.spark.sql.cassandra").
  options(Map("table" -> "logs", "keyspace" -> "productlogs")).
  load().registerTempTable("logs")
scala>sqlContext.sql("select * from logs_jsut_save limit 1").show
标签:
原文地址:http://www.cnblogs.com/piaolingzxh/p/5427568.html