data audit on hadoop fs

时间：2014-07-18 17:33:40 阅读：246 评论：0 收藏：0 [点我收藏+]

最近项目中遇到了存储在HDFS上的数据格式不对，是由于数据中带有\r\n的字符，程序处理的时候没有考虑到这些情况。历史数据大概有一年的时间，需要把错误的数据或者重复的数据给删除了，保留正确的数据，项目中使用Pig来进行数据处理，所以我写了一个UDF的JAVA类去过滤
那些错误的数据，把错误的数据和正确的数据分别存了一份，然后写了以下脚本统计数据的Schema和条数，记录下来，以后项目可以参考。

#!/bin/sh

curDir=`cd "$(dirname $0)";pwd`

summary(){
        files=""

        printf "job\ttotalQueries\tgoodQueries\tbadQueries\n" > $2
        while read job
        do
                if [ -z files ]; then
                        files="$job/par*"
                else
                        files="$files $job/par*"
                fi

                totalQueries=`hadoop fs -text $job/par* | wc -l`
                goodQueries=`hadoop fs -text /user/chran/txt$job/par* |wc -l`
                badQueries=`hadoop fs -text /user/chran/txt/error$job/par* | wc -l`
                #distinctQueries=`hadoop fs -text $job/par* | awk -F ‘\a‘ ‘{ print NF }‘ | sort | uniq`
                printf "$job\t$totalQueries\t$goodQueries\t$badQueries\n" >> $2
        done < $1
}

check(){
        tempDir=$curDir/temp

        if [ ! -d $tempDir ]; then
                mkdir -p $tempDir
        fi

        #clean up result files
        output=$tempDir/$2
        rm $output

        if ! hadoop fs -test -d $1 ; then
                echo "$1 in HDFS doesn‘t exist"
                exit -1
        fi

        #list all sub folders
        folderList=$tempDir/$2.folderlist.temp
        #hadoop fs -ls $1 | awk ‘{ print $NF }‘ | uniq | sort > $folderList
        hadoop fs -lsr $1 | grep "/[0-9][0-9]\$" | grep "00\$" | awk ‘{ print $NF }‘ | uniq | sort > $folderList

        summary $folderList $output

        rm $folderList
}

check "/apps/risk/ars/social/raw/SOCIAL_FACEBOOK_RAW" "check_facebook.output.txt"

data audit on hadoop fs,布布扣,bubuko.com

标签：style blog java color 使用数据

原文地址：http://www.cnblogs.com/hugeshi/p/3853286.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行