标签:stop fusion 取数据 print red ext ati 逆向 mat
1.数据预处理
xgb 训练的数据是 DataFream 不能是List
利用pandas 读取数据,这里读取的是excel数据
data = pd.read_excel(filename, sheet_name=‘tabelname‘)
利用jieba分词搞定
seg1 = jieba.cut(str(text)) # 这个利用join函数连接分词后的结果
seg1 = ‘ ‘.join(seg1)
seg1 = jieba.icut(str(text)) # 这个返回的是一个列表
写个函数 将分词处理完成的数据传入此函数,一定记得返回字符串
def ting(content):
content = content.split(" ")
content = [w for w in content if w not in stopwords]
return " ".join(content)
x_train, x_test, y_train, y_test = train_test_split(data, label, test_size=0.2)
# CountVectorizer会将文本中的词语转换为词频矩阵
vectorizer = CountVectorizer(max_features=5000)
# TfidfTransformer用于统计vectorizer中每个词语的TF-IDF值
tf_idf_transformer = TfidfTransformer()
# vectorizer.fit_transform()计算每个词出现的次数
# tf_idf_transformer.fit_transform()将词频矩阵统计成TF-IDF值
tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))
x_train_weight = tf_idf.toarray() # 训练集TF-IDF权重矩阵
tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_test_weight = tf_idf.toarray() # 测试集TF-IDF权重矩阵
2.XGboost实现
具体参考链接https://blog.csdn.net/hbpartty/article/details/96098495
参数参考链接https://blog.csdn.net/iyuanshuo/article/details/80142730
第一步转化为DMatrix格式的数据
dtrain = xgb.DMatrix(x_train_weight, label=y_train)
dtest = xgb.DMatrix(x_test_weight, label=y_test)
第二步定义参数 开始训练
param = {‘silent‘: 0,
‘eta‘: 0.3,
‘max_depth‘: 6,
‘objective‘: ‘multi:softmax‘,
‘num_class‘: 16,
‘eval_metric‘: ‘merror‘} # 参数
evallist = [(dtrain, ‘train‘), (dtest, ‘test‘)]
num_round = 100 # 循环次数
xgb_model = xgb.train(param, dtrain, num_round, evallist)
# 保存训练模型
xgb_model.save_model(‘data/xgb_model‘)
y_predict = xgb_model.predict(dtest) # 模型预测
label_all = categories
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
df = pd.DataFrame(confusion_mat, columns=label_all)
df.index = label_all
print(‘准确率:‘, metrics.accuracy_score(y_test, y_predict))
print(‘confusion_matrix:‘, df)
print(‘分类报告:‘, metrics.classification_report(y_test, y_predict))
标签:stop fusion 取数据 print red ext ati 逆向 mat
原文地址:https://www.cnblogs.com/acthis/p/13269597.html