SparkContext自定义扩展textFiles，支持从多个目录中输入文本文件

时间：2015-10-20 10:27:51 阅读：365 评论：0 收藏：0 [点我收藏+]

标签：

需求

扩展

class SparkContext(pyspark.SparkContext):
 
    def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None):
        pyspark.SparkContext.__init__(self, master=master, appName=appName, sparkHome=sparkHome, pyFiles=pyFiles,
                                      environment=environment, batchSize=batchSize, serializer=serializer, conf=conf, gateway=gateway, jsc=jsc)
 
    def textFiles(self, dirs):
        hadoopConf = {"mapreduce.input.fileinputformat.inputdir": ",".join(
            dirs), "mapreduce.input.fileinputformat.input.dir.recursive": "true"}
 
        pair = self.hadoopRDD(inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
                              keyClass="org.apache.hadoop.io.LongWritable", valueClass="org.apache.hadoop.io.Text", conf=hadoopConf)
 
        text = pair.map(lambda pair: pair[1])
 
        return text

示例

from pyspark import SparkConf
from dip.spark import SparkContext
 
conf = SparkConf().setAppName("spark_textFiles_test")
 
sc = SparkContext(conf=conf)
 
dirs = ["hdfs://dip.cdh5.dev:8020/user/yurun/dir1",
        "hdfs://dip.cdh5.dev:8020/user/yurun/dir2"]
 
 
def printLines(lines):
    if lines:
        for line in lines:
            print line
 
lines = sc.textFiles(dirs).collect()
 
printLines(lines)
 
sc.stop()

SparkContext自定义扩展textFiles，支持从多个目录中输入文本文件

标签：

原文地址：http://www.cnblogs.com/yurunmiao/p/4893946.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行