标签:
有了数据,剩下的就是流水线上的活:利用某种机器学习算法学习得到模型,在用模型进行预测,评价模型的性能。
1 分割训练集和测试集
Python的机器学习包sklearn非常强大,它不仅包含了不论监督学习、非监督学习的算法,同时包括了进行常用预处理和其他流程的函数。分割训练集和测试集的函数虽然很简单,但也包含在sklearn包内。
通常情况下用X表示一定数目样本的特征数据,可以是Python的list类型,那么X是嵌套列表的列表,len(X)会是训练样本的数目,X[0]就是一个特征样本
通常用y表示目标值,在本问题中,需要预测一个小方块图片中是否包含数字,目标值就是True或False
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
以上代码表示,留20%的数据用于测试,80%的数据用于训练得到模型
2 训练模型
sklearn包的机器学习算法有非常统一的接口,这就意味着简单的修改几个地方就能将一种学习算法换成另一种学习算法
from sklearn import linear_model clf = linear_model.LogisticRegression()
这里使用最简单的逻辑回归进行学习,在导入包后,需要进行类初始化,得到一个分类器classifier。有的学习算法在初始化的时候需要初始化参数,比如支持向量机需要指定kernel的类型,逻辑回归不需要。
clf.fit(X_train,y_train)
对模型进行训练,不同的算法在这一步的背后有着完全不同的运算,但在sklearn中的接口却都是一样的。这样就完成了模型的拟合fit/学习。
y_predict = clf.predict(X_test)
分类器的predict方法可以对测试集的样本进行预测,返回预测的结果y_predict
3 评价模型的性能
这也完全不用你操心,sklearn包全包办了
from sklearn import metrics print(‘Confusion matrix:\n%s‘ % metrics.confusion_matrix(y_test,y_predict))
系统地评价模型性能需要借助混淆矩阵confusion matrix,它包含真阳性、真阴性、假阳性、假阴性样本的数目
输出的结果会是这样子的:
真实值 | |||
False | True | ||
预测值 | False | 真阴性76 | 假阴性1 |
True | 假阳性4 | 真阳性56 |
当然我添加了表格的头和说明。
很久以前我很不解什么真和假、阳性和阴性的含义,似乎是书本上觉得太简单以至于不需要写下来。但我看有很多人还是不理解的,也许是中文的翻译又增加了一层理解的复杂度。
阳性和阴性指预测的结果。
阳性也就是预测为True,一般为待分类的两类中罕见的一类,比如患有某种疾病、图片中包含数字;阴性也就是预测为False。
真指预测的结果正确,真阴性True Positive就是预测的结果为阴性,预测结果正确,真实也是阴性;假指的是预测结果错误。
没想到简单的逻辑回归,大多数预测的结果都为“真”,也就是预测正确。
通过真阳性、真阴性、假阳性、假阴性评价模型还不够直观,进一步有通过这几个值计算得到的预测精度precision和Recall(我叫它检出率)
print(‘Classification report for classifier %s:\n%s\n‘ % (clf,metrics.classification_report(y_test,y_predict)))
precision recall f1-score support
False 0.95 0.99 0.97 77
True 0.98 0.93 0.96 60
avg / total 0.96 0.96 0.96 137
精度precision为预测的正确率,又可以分为预测结果为真的正确率和预测结果为假的正确率
precision(True)=真阳性/(真阳性+假阳性)
检出率recall为真实结果被准确预测的比率,
recall(True)=真阳性/(真阳性+假阴性)
f1-score又是综合考量precision和recall的结果,只要recall和precision有一个接近0就会很小
f1-score=2precision*recall/(precision+recall)
总的来说逻辑回归的性能还是出乎意料的,都有95+%的预测精度和检出率。
4 魔法的背后behaid magic
逻辑回归是一个非常简单的线性模型,和回归分析一样将各个特征值乘以对应的常数,结果大于0就认为包含数字,结果小于0就认为没有数字
clf.coef_
就可以查询模型的系数,每一个系数对应于待分析图片的一个特征,也就是一个像素
array([[ -2.56139945e-03, -2.75258012e-03, -5.31968495e-04, 3.97503197e-03, -4.44468541e-03, 6.81884865e-03, 5.02848092e-03, 3.71932312e-03, 5.39551134e-03, 9.25196949e-03, 4.36739083e-03, 7.08357030e-03, 5.84996428e-03, 5.05861727e-03, 5.86927009e-03, 2.12006563e-04, 1.83236348e-03, 2.86887367e-04, -1.60054788e-03, 6.11420888e-04, 1.08336757e-04, 2.49622737e-03, 3.74562382e-03, 6.13236412e-03, 3.33607269e-03, -3.24881692e-03, 5.74140904e-04, -1.22561879e-03, 4.37700792e-03, 1.76217248e-03, -1.24557500e-03, 2.61096358e-03, 2.30601120e-03, -2.83905385e-03, -1.19670904e-03, -8.19275158e-04, -6.44944632e-04, -5.05038691e-04, 5.52497690e-03, 2.05702811e-03, -2.43458886e-03, -2.83737410e-05, 6.78199654e-04, -1.28987251e-03, 4.56909934e-03, -1.01416535e-03, -4.23644789e-05, -2.83648771e-03, 1.68822571e-04, -7.60660440e-05, 3.36552860e-03, -1.11415804e-03, -9.63607637e-04, 3.31942394e-03, 5.72105593e-03, -1.35952444e-05, -6.58437051e-04, -3.82020702e-04, -1.27826080e-03, -7.99044797e-04, -5.67146839e-03, -4.25316734e-03, 1.83626714e-03, -2.78343826e-03, -2.07640734e-03, -3.49593939e-03, -1.70463105e-03, -3.84863781e-03, 8.24664241e-04, 1.50409312e-03, 2.90331874e-03, -3.03167979e-03, 1.81563441e-03, -1.52265512e-03, 3.36457675e-03, 4.81122573e-04, 2.26554206e-03, -2.35301784e-03, 8.52133302e-04, -3.47625137e-03, -4.69526778e-03, -9.23085091e-04, -2.65283197e-03, -1.13519152e-03, -4.13610316e-03, -9.66252318e-04, 7.55483133e-04, 3.15259161e-03, -5.27083518e-03, 2.07319627e-03, 1.03384540e-03, 1.32133461e-03, -1.97213479e-03, 4.00445941e-03, -3.39089764e-03, 2.66239249e-04, -5.56404297e-04, -8.03310870e-03, -3.00343377e-03, -6.56676988e-03, 2.26530299e-03, -3.93015386e-03, 2.89514964e-03, -3.20929410e-03, -2.43164834e-03, 2.08445894e-03, 1.66398867e-03, -5.30108888e-03, -1.34884685e-03, 5.18203522e-04, 5.01436351e-04, -6.67828433e-04, -1.91048336e-03, -9.78206074e-04, -1.01926859e-02, -3.76850966e-03, -4.69293942e-03, -5.78586568e-03, -3.10393223e-03, -5.32801075e-03, 1.50549060e-03, 2.52518032e-03, -2.37414841e-03, -9.49611291e-04, -1.01359478e-03, 5.64377850e-03, 2.53662479e-03, 2.08825692e-04, -3.38701845e-03, 5.19172076e-04, -1.46759524e-03, -3.27752577e-04, 8.11867682e-04, 3.59883974e-03, -2.54373438e-03, -5.90755272e-03, -4.83954063e-03, -9.04861523e-03, -1.38052393e-03, 9.42032985e-04, 1.90854533e-03, -3.78755042e-03, 4.42240294e-04, -3.72275984e-05, 1.12836339e-03, -5.13071609e-04, -1.38829079e-03, 6.76082019e-04, -1.43772760e-03, -5.17576299e-03, -7.29235584e-03, -3.08174424e-03, 2.12773740e-03, -3.76542728e-03, -3.51670263e-04, -6.59119706e-03, -3.79001246e-03, -8.97712108e-04, 1.39573714e-03, 1.03794597e-03, 7.38581141e-03, -6.09842283e-04, 4.28880895e-03, -5.85887446e-03, 2.75440511e-03, 4.75591104e-03, 2.70668279e-03, -6.12732885e-03, -2.63392240e-03, -1.22843444e-03, -6.19796298e-03, -5.48805535e-03, 7.95625236e-05, 6.97994750e-04, -2.60386848e-03, 3.67855058e-04, 3.14745357e-03, -2.93531270e-03, 5.39926700e-03, 3.27031297e-03, 4.72582671e-03, 1.95212471e-03, -5.23686231e-03, 1.04283598e-04, -4.96368657e-03, -1.41585781e-04, -2.21140099e-03, 1.20421926e-03, -5.00210160e-03, 1.38909431e-03, -3.53666741e-03, 6.19806131e-04, 2.75729680e-03, 7.31164464e-04, -6.29700739e-03, -5.93013031e-04, 6.04825323e-03, -2.84917346e-03, 7.99351601e-03, 4.47589057e-03, 3.10468824e-03, 2.51596859e-03, 2.57717786e-04, -3.01800336e-03, 5.77308452e-04, 2.11532790e-03, 9.56314260e-06, -1.31857971e-03, 5.59309822e-04, -3.46348089e-03, -3.18747290e-03, -1.23120806e-03, -3.74417132e-03, 4.91736080e-04, 1.06464721e-03, 1.51992610e-03, 3.06938016e-03, 3.91741249e-03, 1.23027608e-02, 4.67528488e-04, 2.77043461e-03, -9.76654188e-04, -1.07911245e-02, -5.08900112e-03, -2.32087989e-03, -6.48131799e-03, -5.74448577e-03, -3.15094097e-04, 2.34358750e-03, -2.86364443e-03, -4.95540054e-04, 2.80312553e-03, 1.10865982e-03, 1.44602453e-03, 5.76924197e-03, 7.13387692e-05, 9.62853757e-04, 8.73790791e-04, 5.19818527e-03, 2.37576646e-03, 5.79825096e-04, 3.03416588e-05, -4.04365432e-03, -9.24804973e-04, 5.84764772e-03, 1.99951794e-03, -2.93143644e-03, 1.33716200e-04, -7.73417123e-05, 6.13021426e-03, -1.17824922e-03, 4.51548244e-03, 2.01647381e-03, 3.35221498e-03, 2.92103954e-03, -1.65440967e-03, -1.84581127e-03, -6.64682948e-03, 3.89793301e-04, 3.35493706e-03, -3.56240877e-03, -6.02756394e-03, -1.53553401e-03, 1.32827858e-03, -3.74875999e-03, -3.36515528e-03, 9.11400046e-04, 4.68510214e-03, 3.81594242e-03, 4.43658898e-03, 1.53881614e-03, 2.82551066e-03, 1.53655132e-03, 2.54271293e-03, -2.69429440e-03, 5.69739019e-04, 2.15592781e-03, -2.27916466e-03, -1.49687487e-03, 5.19139428e-05, 2.81137298e-03, -1.22697041e-03, -1.15348586e-03, -2.14934244e-03, 4.47759284e-04, -1.00424467e-03, 2.08708304e-03, 2.75652879e-03, 3.38016036e-03, 2.33732861e-03, 1.46094873e-03, 7.35184525e-03, 1.50941689e-03, -7.40743881e-04, -1.61913515e-03, -5.25394588e-03, 2.72163685e-03, -2.78606942e-03, -4.16177660e-03, -2.68669363e-03, -5.19658681e-03, -5.16910663e-03, -2.63763143e-03, -3.62434399e-03, -1.02610653e-03, -9.10417767e-04, 2.47384678e-03, 1.47010078e-03, 3.97049150e-03, 1.59091370e-03, -2.15634092e-03, -1.71951499e-03, -1.77312622e-03, -4.59520849e-03, 4.11194688e-03, -6.92854270e-03, -4.37689748e-03, -5.21307441e-03, -2.63619132e-03, 4.23279802e-03, -2.26747150e-03, 2.06543571e-03, 5.32133709e-03, 2.70080747e-03, 3.30225323e-03, 5.25671231e-03, 2.49122812e-03, 4.64310922e-03, -4.76939533e-03, -3.57712728e-03, 4.47400505e-03, -3.04562602e-03, -5.72868439e-03, 1.66318591e-04, -1.04108616e-03, -2.03108548e-03, -4.74736009e-04, 1.72270514e-03, 1.11208635e-03, 4.40334390e-04, -2.48325165e-03, 5.50780677e-03, 3.64594260e-03, 1.94247691e-03, 2.73757992e-03, 4.95431117e-03, -1.04369763e-03, -4.29123006e-04, 3.26100602e-03, -5.71818631e-03, -4.65439326e-03, -8.02655959e-03, -3.45473022e-03, 4.02523699e-03, 3.40183005e-03, -1.02538809e-03, -6.02059733e-04, 1.33523607e-03, -1.83807032e-03, 3.93462664e-03, -2.54725629e-03, 1.42075384e-03, 4.34467357e-03, 4.95094928e-03, 1.77358310e-04, -4.56130544e-03, 1.25794322e-03, 5.90246498e-04, 4.63643052e-04, -7.69648223e-03, -2.80739980e-03, -6.78112020e-03, -2.14894858e-03, -1.81469401e-03, -2.11669943e-03, 4.02096206e-03, 2.51420874e-03, 1.80614576e-04, 1.51796043e-03, 3.86622406e-03, 1.59411717e-03, 2.20409364e-03, 6.29895833e-04, -4.22056706e-03, -4.09798177e-03, 4.16094897e-04, -8.46606579e-04, -2.68709134e-03, -2.69588890e-03, -2.82040061e-03, -1.34632735e-03, 7.53324811e-04, -8.18104595e-04, 1.64211467e-03, 4.54944121e-03, 1.27077831e-03, -4.71765564e-03, 7.77776618e-04, 9.94912884e-04, 5.11114494e-05, 2.67684556e-04, -2.82292759e-03, -3.96944658e-03, -6.13793890e-03, -6.52427326e-04, -1.85522869e-03, -1.42620355e-03, 2.70045514e-04, 3.07247472e-03, 3.54542386e-03, 5.50694470e-03, -1.48702671e-03, 1.19550942e-03, 2.22658765e-03, 2.10573442e-03, -5.88441942e-05, 2.21007257e-03, -1.07699489e-04, -4.54425504e-03, -1.07385611e-03, -3.67573528e-04, 3.44609201e-04, -3.16044812e-03, -3.36530877e-03, -3.95622536e-03, 1.43149147e-03, -3.31763110e-03, -3.44537238e-03, -2.35134639e-03, 1.79640507e-03, 1.02597557e-03, 3.45353045e-03, -4.86053025e-03, 1.83903418e-03, -4.08906445e-04, -1.86879935e-05, 1.56767365e-03, 4.69210716e-04, 1.56072497e-03, 3.19265117e-03, -3.30414162e-03, 9.49158185e-05, 3.12229776e-03, -5.12022873e-03, -5.85707486e-03, -2.29236547e-03, -6.34190433e-03, -7.35152452e-03, -8.71900345e-04, 3.36665007e-03, -4.83359118e-03, -4.07594388e-03, 1.90616778e-03, 2.77873920e-03, 4.97290557e-03, 7.76535909e-03, 4.32362637e-03, 1.29321850e-03, -3.67396968e-03, -8.44682654e-04, 3.28271837e-03, -1.20730993e-03, -1.96092533e-03, 4.12536967e-03, 1.37496600e-03, 3.62493853e-03, -5.82427193e-03, -7.32347050e-03]])
如果不信效果有这么好,可以看看一张验证码没拆分成小方块后,预测的结果
红色边框表示预测该小方块包含数字
必须得说一句,第一天在教研室的感觉真好
标签:
原文地址:http://www.cnblogs.com/meelo/p/4314423.html