目前的状况是:
1. 在我一个文件夹下面有许多文件名是这样的数据文件
part-m-0000
part-m-0001
part-m-0002
part-m-0003
...
2. 其中每个文件夹里的数据是这样格式:
"460030730101160","3","0","0","0","2013/8/31 0:21:42"现在需要将数字上的引号去掉,同时将最后一列的时间的小时提取出来,下面是我用python处理的过程:
1. 先遍历当前文件夹下所有的以‘part‘开头的文件;
2. 对每一个文件,读取每一行,根据“,”进行分割;
3. 之后读每一部分取引号中间的部分,对最后一项时间取小时数部分,这里需要判断小时的位数是1还是2;
4. 每读一行就写一行
下面是具体的待买
#coding: utf-8 import os for root,dir,files in os.walk("./"): for file in files: if file.startswith("part"): filepath = "./"+file #This is the current file path print filepath newfilepath = "./data_handled/"+file[7:] # This is file used to write into file = open(filepath) newfile = open(newfilepath,'w') for line in file: string = "" line_ = line.split(',') for i in range(len(line_)-1): j = line_[i][1:len(line_[i])-1] #Delte the " " string += j string += ',' len1 = len(line_) if len(line_[len1-1]) > 12: if line_[len1-1][12]==':': k = line_[len1-1][11:12] else: k = line_[len1-1][11:13] else : k = "-1" string += k newfile.write(string+"\n") newfile.close()
原文地址:http://blog.csdn.net/michael_kong_nju/article/details/39482903