标签:
版权声明:<—— 本文为作者呕心沥血打造,若要转载,请注明出处@http://blog.csdn.net/gamer_gyt <——
======================================================================
本系列博客主要参考 Scikit-Learn 官方网站上的每一个算法进行,并进行部分翻译,如有错误,请大家指正
======================================================================
另外一篇基于《机器学习实战》的Logistic回归分析的博客请参考:点击阅读,其主要是采用Python代码实现回归模型
还有一篇是纯实战案例博客请参考 ,Logistic回归模型案例实战:《机器学习实战》Logistic回归算法(2)
目录:
1、概念
2、简单线性回归(Simple Liner Regession)
3、多元性回归(Mutiple Regession)
4、非线性回归(Logistic Regession)
一:概念
1:集中趋势衡量
1.1均值(平均值,平均数)(mean)
1.2中位数(median):将数据中的所有数按大小排列顺序,位于中间的拿个书就是中位数
个数为奇数,取中间值
个数为偶数,取中间两个数的平均值
1.3众数:数据中出现最多的数
2:离散程度的衡量
2.1方差(variance)
2.2标准差(standard deviation)
3:回归中的相关度
3.1:皮尔逊相关度
衡量两个相关强度的量,取值范围是[-1,1],计算公式为:
4:R平方值
决定系数(可决系数,拟合优度),反应因变量的全部变异能通过回归关系被自变量解释的比例,取值范围[0,1],可决系数越大,说明在总变差中由模型作出了解释的部分占的比重越大,模型拟合优度越好。反之可决系数小,说明模型对样本观测值的拟合程度越差。
描述:如R平方为0.8,则表示回归关系可以解释因变量80%的变异,换句话说,如果我们能控制自变量不变,则因变量变异程度将会减少80%
对于 简单线性回归来说,R^2= r * r
对于多元线性回归来说,
SSR表示由模型引起的误差平方和,SST表示由实际值引起的差值平方和
R平方也有局限性,会随着自变量的增大而增大
5:皮尔逊相关系数和R平方值计算示例
- ‘‘
- import numpy as np
- import math
-
- def computeCorrelation(X, Y):
- xBar = np.mean(X)
- yBar = np.mean(Y)
- SSR = 0
- varX = 0
- varY = 0
- for i in range(0, len(X)):
-
- diffXXBar = X[i] - xBar
- diffYYBar = Y[i] - yBar
- SSR +=(diffXXBar * diffYYBar)
-
- varX += diffXXBar**2
- varY += diffYYBar**2
- SST = math.sqrt(varX * varY)
- return SSR/SST
-
- def polyfit(x, y, degree):
- results = {}
-
- coeffs = np.polyfit(x, y, degree)
-
-
- results[‘polynomial‘] = coeffs.tolist()
-
-
- p = np.poly1d(coeffs)
- yhat = p(x)
-
- ybar = np.sum(y)/len(y)
-
- ssreg = np.sum((yhat - ybar) ** 2)
- sstot = np.sum((y - ybar) ** 2)
- results[‘determination‘] = ssreg / sstot
-
- print" results :",results
- return results
-
-
-
- testX = [1, 3, 8, 7, 9]
- testY = [10, 12, 24, 21, 34]
-
- print "r : ",computeCorrelation(testX, testY)
- print "r^2 : ",str(computeCorrelation(testX, testY)**2)
-
- print polyfit(testX, testY, 1)["determination"]
结果显示为:
二:简单线性回归
1:回归与分类的区别
回归(regession):Y变量为连续型数值,如房价,人数,降雨量
分类(classification):Y变量为类别型,如颜色类别,电脑品牌,有无信誉
2:简单线性回归介绍
回归分析:是指建立方程模拟两个或者多个变量之间如何关联
回归模型:是指被用来描述因变量(y)和自变量(x)以及偏差(error)之间的关系的方程,函数表示为:
简单线性回归方程:模型转变为即为回归方程(类似于一条直线,参数为斜率和y轴的交点)
线性关系包含:正相关,负相关,无关
估计线性方程:
关于偏差:
3:简单线性回归示例
- <span style="font-family:Microsoft YaHei;"><span style="font-size:18px;">
- ‘‘
- import numpy as np
-
- x = [1,3,2,1,3]
- y = [14,24,18,17,27]
-
- def fitSLR(x,y):
- n = len(x)
- denominator = 0
- numerator = 0
- for i in range(0,n):
- numerator += (x[i]-np.mean(x)* (y[i]-np.mean(y)) )
- denominator += (x[i]-np.mean(x))**2
-
- print "denominator:",denominator
- print "numerator:",numerator
-
- b1 = numerator/float(denominator)
- b0 = np.mean(y)-b1*np.mean(x)
-
- return b0,b1
-
- def predict(b0,b1,x):
- return b0+b1*x
-
- b0,b1 = fitSLR(x,y)
-
- x_test = 6
- print "y_test:",predict(b0,b1,x_test)</span></span>
三:多元性回归
1:多元回归简介
与简单线性回归的区别:有多个变量x
多元回归模型:
多元回归方程:
估计多元回归方程:(y变成y_hat,即求得是估计值)
估计方法:
2:多元线性回归示例
我们需要的数据是第二,三,四列的数据
- <span style="font-family:Microsoft YaHei;"><span style="font-size:18px;">
- ‘‘
- from sklearn import linear_model
- import numpy as np
- from numpy import genfromtxt
-
- datapath = "data.csv"
- deliverData = genfromtxt(datapath,delimiter=",")
-
- print "data:",deliverData
-
- X= deliverData[:,:-1]
- Y = deliverData[:,-1]
- print "X:",X
- print "Y:",Y
-
- regr = linear_model.LinearRegression()
- regr.fit(X,Y)
-
- print "coefficients:",regr.coef_
- print "intercept:",regr.intercept_
-
- x_pre = [102,6]
- y_pre = regr.predict(x_pre)
- print "Y-Predict:",y_pre
- </span></span>
3:如果自变量中有分类型变量(categorical data),如何处理?
e g:
首先将分类型变量进行转化为如下形式(第四五六列表示0,1,2,为1表示使用该型号车)
调用的代码其实和上边的是一样的:
- <span style="font-family:Microsoft YaHei;"><span style="font-family:Microsoft YaHei;font-size:18px;">
- ‘‘
- from numpy import genfromtxt
- import numpy as np
- from sklearn import datasets, linear_model
-
- dataPath = "dataDumpy.csv"
- deleveryData = genfromtxt(dataPath, delimiter=‘,‘)
-
- print "data:\n",deleveryData
-
- X = deleveryData[:, :-1]
- Y = deleveryData[:, -1]
- print "X: ",X
- print "Y: ",Y
-
- regr = linear_model.LinearRegression()
- regr.fit(X, Y)
-
- print "Coefficients:",regr.coef_
- print "Intercept:",regr.intercept_
- xPred = [102,6,0,0,1]
- yPred = regr.predict(xPred)
- print "predict y : ",yPred</span></span>
4:关于误差
四:非线性回归
非线性回归又称为逻辑回归
1:概率
对一件事情发生可能性的衡量,取值范围是0~1,计算方法包括,个人置信,历史数据,模拟数据
条件概率:
非线性回归实例:
- <span style="font-family:Microsoft YaHei;"><span style="font-size:18px;">
- ‘‘
- import numpy as np
- import random
-
- ‘‘
- def gradientDescent(X,Y,theta,alpha,m,numIteration):
- x_trains = X.transpose()
- for i in range(0,numIteration):
- hypothesis = np.dot(X,theta)
- loss = hypothesis - Y
-
-
- cos = np.sum(loss**2)/(2*m)
- print "Iteration %d | Cost:%f" % (i,cos)
-
- gradient = np.dot(x_trains,loss)/m
- theta = theta - alpha*gradient
-
- return theta
-
-
- ‘‘
- def genData(numPoints,bias,variance):
- X = np.zeros(shape=(numPoints,2))
- Y = np.zeros(shape=numPoints)
-
- for i in range(0,numPoints):
- X[i][0] = 1
- X[i][1] = i
-
- Y[i] = (i+bias) + random.uniform(0,1)*variance
-
- return X,Y
-
- X,Y = genData(100, 25, 10)
-
- m, n = np.shape(X)
- n_y = np.shape(Y)
-
-
- numIterations =100000
- alpha = 0.0005
- theta = np.ones(n)
- theta = gradientDescent(X, Y, theta, alpha, m, numIterations)
-
- print "theta: " ,theta
-
- </span></span>
点击 进入个人在有道云笔记的回归分析相关,感兴趣的可以看一下 转载: scikit-learn学习之回归分析
标签:
原文地址:http://www.cnblogs.com/harvey888/p/5852757.html