码迷,mamicode.com
首页 > 编程语言 > 详细

哗啦啦python金融量化之路 - 1 - 简单的数据处理并画图

时间:2016-07-11 18:48:45      阅读:297      评论:0      收藏:0      [点我收藏+]

标签:

金融量化的第一步:数据统计和分析。

我选择的教材是:利用python进行数据分析 O‘reilly出版


实用案例

1. 处理来自bit.ly的1.usa.gov数据。

 

  1) 数据: http://www.usa.gov/About/developer-resources/1usagov.shtml

    该数据为常见的json格式

 

  2)将json转换成字典

    注意事项:我是将该数据以TXT格式保存到本地进行处理的。需要去掉分隔符,同时因为内部有BOM字符,需要去除这些字符。再将这些字典读到列表中。

import os
import json,pickle

from collections import defaultdict
from collections import Counter

records = [] for line in open("haha6.txt", encoding = "utf8"): line = line.strip("\n") if line.startswith(u\ufeff): line = line.encode(utf8)[3:].decode(utf8) #去掉Bom字符 line = json.loads(line, encoding = "utf-8") records.append(line)

print(records[0])

#output:
第一行数据如下:#{‘u‘: ‘http://today.lbl.gov/2016/06/24/saudi-minister-of-energy-visits-lab-on-june-20/#main‘,
#‘_id‘: ‘27e6808c-3750-e5ac-002a-cfb577e72a48‘, ‘r‘: ‘direct‘, ‘sl‘: ‘2963Ceb‘, ‘h‘: ‘2963Ceb‘,
#‘k‘: ‘‘, ‘a‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML‘, ‘c‘: ‘FR‘,
#‘hc‘: 1466804416, ‘nk‘: 0, ‘ll‘: [48.8582, 2.3387], ‘g‘: ‘2963Fqo‘,
#‘t‘: 1467187377, ‘hh‘: ‘1.usa.gov‘, ‘l‘: ‘anonymous‘, ‘i‘: ‘‘, ‘tz‘: ‘Europe/Paris‘}

 

  3) 查找所有的时区,并对其计数

time_zones = [rec["tz"] for rec in records if "tz" in rec]

##时区统计,列表里的字典元素的key的统计
#方法1
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
counts = get_counts(time_zones)
print(counts["America/New_York"])

#方法2
def get_counts1(sequence):
    counts = defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts
counts = get_counts1(time_zones)
print(counts["America/New_York"])

#output: 353

 

  4) 取出前十的时区及其计数值

#方法1
def top_counts(count_dict, n = 10):
    value_key_pairs = [(count,tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
print(top_counts(counts))
#方法2
counts = Counter(time_zones)
counts.most_common(10)
print(counts.most_common(10))

 

  5) 用pandas简化,对时区进行计数,并给出前十的柱状图

#用pandas对时区进行计数
from pandas import DataFrame
import pandas as pd
import numpy as np
frame = DataFrame(records)
#print(frame)
#tz_counts = frame["tz"].value_counts()
#print(tz_counts[:10])
clean_tz = frame["tz"].fillna("missing") #  缺失值处理
clean_tz[clean_tz == ""] = "unknown" # 空字符串处理
tz_counts = clean_tz.value_counts()
print(tz_counts[:10].plot(kind = "barh", rot=0))

#output是柱状图

 

哗啦啦python金融量化之路 - 1 - 简单的数据处理并画图

标签:

原文地址:http://www.cnblogs.com/hualala/p/5661251.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!