标签:str names log ttl users hat apach ocs lambda
Are you ready for Apache Spark 2.0?
If you are just getting started with Apache Spark, the 2.0 release is the one to start with as the APIs have just gone through a major overhaul to improve ease-of-use.
If you are using an older version and want to learn what has changed then this article will give you the low down on why you should upgrade and what the impact to your code will be.
Let’s start with the good news, and there’s plenty.
CSV support is now built-in and based on the DataBricks spark-csv project, making it a breeze to create Datasets from CSV data with little coding.
Spark 2.0 is a major release, and there are some breaking changes that mean you may need to rewrite some of your code. Here are some things we ran into when updating our apache-spark-examples.
Get help upgrading to Apache Spark 2.0 or making the transition from Java to Scala. Contact Us!
Both the RDD API and the Dataset API represent data sets of a specific class. For instance, you can create an RDD[Person] as well as a Dataset[Person] so both can provide compile-time type-safety. Both can also be used with the generic Row structure provided in Spark for cases where classes might not exist that represent the data being manipulated, such as when reading CSV files.
RDDs can be used with any Java or Scala class and operate by manipulating those objects directly with all of the associated costs of object creation, serialization and garbage collection.
Datasets are limited to classes that implement the Scala Product trait, such as case classes. There is a very good reason for this limitation. Datasets store data in an optimized binary format, often in off-heap memory, to avoid the costs of deserialization and garbage collection. Even though it feels like you are coding against regular objects, Spark is really generating its own optimized byte-code for accessing the data directly.
RDD
1
2
3
|
// raw object manipulation
val rdd: RDD[Person] = …
val rdd2: RDD[String] = rdd.map(person => person.lastName)
|
Dataset
1
2
3
|
// optimized direct access to off-heap memory without deserializing objects
val ds: Dataset[Person] = …
val ds2: Dataset[String] = ds.map(person => person.lastName)
|
Here are some code samples to help you get started fast with Apache Spark 2.0 and Scala.
SparkSession is now the starting point for a Spark driver program, instead of creating a SparkContext and a SQLContext.
1
2
3
4
5
6
7
8
|
val spark = SparkSession.builder
.master("local[*]")
.appName("Example")
.getOrCreate()
// accessing legacy SparkContext and SQLContext
spark.sparkContext
spark.sqlContext
|
SparkSession provides a createDataset method that accepts a collection.
1
|
var ds: Dataset[String] = spark.createDataset(List("one","two","three"))
|
SparkSession provides a createDataset method for converting an RDD to a Dataset. This only works if you import spark.implicits_ (where spark is the name of the SparkSession variable).
1
2
3
4
5
|
// always import implicits so that Spark can infer types when creating Datasets
import spark.implicits._
val rdd: RDD[Person] = ??? // assume this exists
val dataset: Dataset[Person] = spark.createDataset[Person](rdd)
|
A DataFrame (which is really a Dataset[Row]) can be converted to a Dataset of a specific class by performing a map() operation.
1
2
3
4
5
6
7
8
|
// read a text file into a DataFrame a.k.a. Dataset[Row]
var df: Dataset[Row] = spark.read.text("people.txt")
// use map() to convert to a Dataset of a specific class
var ds: Dataset[Person] = spark.read.text("people.txt")
.map(row => parsePerson(row))
def parsePerson(row: Row) : Person = ??? // fill in parsing logic here
|
The built-in CSV support makes it easy to read a CSV and return a Dataset of a specific case class. This only works if the CSV contains a header row and the field names match the case class.
1
2
3
4
|
val ds: Dataset[Person] = spark.read
.option("header","true")
.csv("people.csv")
.as[Person]
|
Here are some code samples to help you get started fast with Spark 2.0 and Java.
1
2
3
4
5
6
7
|
SparkSession spark = SparkSession.builder()
.master("local[*]")
.appName("Example")
.getOrCreate();
// Java still requires of the JavaSparkContext
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
|
SparkSession provides a createDataset method that accepts a collection.
1
2
3
4
|
Dataset<Person> ds = spark.createDataset(
Collections.singletonList(new Person(1, "Joe", "Bloggs")),
Encoders.bean(Person.class)
);
|
SparkSession provides a createDataset method for converting an RDD to a Dataset.
1
2
3
4
|
Dataset<Person> ds = spark.createDataset(
javaRDD.rdd(), // convert a JavaRDD to an RDD
Encoders.bean(Person.class)
);
|
A DataFrame (which is really a Dataset[Row]) can be converted to a Dataset of a specific class by performing a map() operation.
1
2
3
4
5
6
7
8
|
Dataset<Person> ds = df.map(new MapFunction<Row, Person>() {
@Override
public Person call(Row value) throws Exception {
return new Person(Integer.parseInt(value.getString(0)),
value.getString(1),
value.getString(2));
}
}, Encoders.bean(Person.class));
|
The built-in CSV support makes it easy to read a CSV and return a Dataset of a specific case class. This only works if the CSV contains a header row and the field names match the case class.
1
2
3
4
|
Dataset<Person> ds = spark.read()
.option("header", "true")
.csv("testdata/people.csv")
.as(Encoders.bean(Person.class));
|
Using Apache Spark with Java is harder than using Apache Spark with Scala and we spent significantly longer upgrading our Java examples than we did with our Scala examples, including running into some confusing runtime errors that were hard to track down (for example, we hit a runtime error with Spark’s code generation because one of our Java classes was not declared as public).
Also, we weren’t always able to use concise lambda functions even though we are using Java 8, and had to revert to anonymous inner classes with verbose (and confusing) syntax.
Spark 2.0 represents a significant milestone in the evolution of this open source project and provides cleaner APIs and improved performance compared to the 1.6 release.
The Scala API is a joy to code with, but the Java API can often be frustrating. It’s worth biting the bullet and switching to Scala.
Full source code for a number of examples is available from our github repo here.
Get help upgrading to Spark 2.0 or making the transition from Java to Scala. Contact Us!
APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
标签:str names log ttl users hat apach ocs lambda
原文地址:https://www.cnblogs.com/felixzh/p/9527758.html