标签:1.5 不同类 exp copy tutorial return classes __init__ 4.4
http://www.cnblogs.com/yonghao/p/5061873.html
DT 是一个监督学习方法(supervised learning method)
DT is a supervised learning method, thus we need labeled data
It is one process only thus it is not good for giant datasets
PS: It is pretty good on small and clean datasets
690 data entries, relatively small dataset
15 attributes, pretty tiny to be honest
missing value is only 5%
2 class data
By looking at these two, we know DT should work well for our dataset
Copy and paste your code to function readfile(file_name)
under the comment # Your code here
.
Make sure your input and output matches how I descirbed in the docstring
Make a minor improvement to handle missing data, in this case let‘s use string "missing"
to represent missing data. Note that it is given as "?"
.
is_missing(value)
, class_counts(rows)
, is_numeric(value)
as directed in the docstringDetermine
. This object represents a node of our DT. 这个对象表示的是决策树的节点。It has 2 inputs and a function. 有两个输入,一个方法
We can think of it as the Question we are asking at each node. 可以理解成决策树中每个节点我们所提出的“问题”
partition(rows, question)
as described in the docstringUse Determine class to partition data into 2 groups
gini(rows)
as described in the docstringinfo_gain(left, right, current_uncertainty)
as described in the docstringdef readfile(file_name):
"""
This function reads data file and returns structured and cleaned data in a list
:param file_name: relative path under data folder
:return: data, in this case it should be a 2-D list of the form
[[data1_1, data1_2, ...],
[data2_1, data2_2, ...],
[data3_1, data3_2, ...],
...]
i.e.
[[‘a‘, 58.67, 4.46, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 3.04, ‘t‘, ‘t‘, 6.0, ‘f‘, ‘g‘, ‘00043‘, 560.0, ‘+‘],
[‘a‘, 24.5, 0.5, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 1.5, ‘t‘, ‘f‘, 0.0, ‘f‘, ‘g‘, ‘00280‘, 824.0, ‘+‘],
[‘b‘, 27.83, 1.54, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 3.75, ‘t‘, ‘t‘, 5.0, ‘t‘, ‘g‘, ‘00100‘, 3.0, ‘+‘],
...]
Couple things you should note:
1. You need to handle missing data. In this case let‘s use "missing" to represent all missing data
2. Be careful of data types. For instance,
"58.67" and "0.2356" should be number and not a string
"00043" should be string but not a number
It is OK to treat all numbers as float in this case. (You don‘t need to worry about differentiating integer and float)
"""
# Your code here
data_ = open(file_name, ‘r‘)
# print(data_)
lines = data_.readlines()
output = []
# never use built-in names unless you mean to replace it
for list_str in lines:
str_list = list_str[:-1].split(",")
# keep it
# str_list.remove(str_list[len(str_list)-1])
data = []
for substr in str_list:
if substr.isdigit():
if len(substr) > 1 and substr.startswith(‘0‘):
data.append(substr)
else:
substr = int(substr)
data.append(substr)
else:
try:
current = float(substr)
data.append(current)
except ValueError as e:
if substr == ‘?‘:
substr = ‘missing‘
data.append(substr)
output.append(data)
return output
?
?
?
?
def is_missing(value):
"""
Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
:param value: value to be checked
:return: boolean (True, False) of whether the input value is the same as our "missing" notation
"""
return value == ‘missing‘
?
?
def class_counts(rows):
"""
Count how many data samples there are for each label
数每个标签的样本数
:param rows: Input is a 2D list in the form of what you have returned in readfile()
:return: Output is a dictionary/map in the form:
{"label_1": #count,
"label_2": #count,
"label_3": #count,
...
}
"""
# 这个方法是一个死方法 只使用于当前给定标签(‘+’,‘-’)的数据统计 为了达到能使更多不确定标签的数据的统计 扩展出下面方法
# label_dict = {}
# count1 = 0
# count2 = 0
# # rows 是readfile返回的结果
# for row in rows:
# if row[-1] == ‘+‘:
# count1 += 1
# elif row[-1] == ‘-‘:
# count2 += 1
# label_dict[‘+‘] = count1
# label_dict[‘-‘] = count2
# return label_dict
?
# 扩展方法一
# 这个方法可以完成任何不同标签的数据的统计 使用了两个循环 第一个循环是统计出所有数据中存在的不同类型的标签 得到一个标签列表lable_list
# 然后遍历lable_list中的标签 重要的是在其中嵌套了遍历所有数据的循环 同时在当前循环中统计出所有数据的标签中和lable_list中标签相同的总数
# label_dict = {}
# lable_list = []
# for row in rows: