学习大数据第五天：最小二乘法的Python实现（二）

时间：2016-04-26 19:48:44 阅读：290 评论：0 收藏：0 [点我收藏+]

标签：

1.numpy.random.normal

numpy.random.normal

numpy.random.normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently [R250], is often called the bell curve because of its characteristic shape (see the example below).

The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution [R250].

Parameters:

Parameters:	loc : float Mean (“centre”) of the distribution. scale : float Standard deviation (spread or “width”) of the distribution. size : int or tuple of ints, optional Output shape. If the given shape is, e.g., `(m, n, k)`, then `m * n * k` samples are drawn. Default is None, in which case a single value is returned.

loc : float

Mean (“centre”) of the distribution.

scale : float

Standard deviation (spread or “width”) of the distribution.

size : int or tuple of ints, optional

Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

See also

scipy.stats.distributions.norm: probability density function, distribution or cumulative density function, etc.

Notes

The probability density for the Gaussian distribution is

$技术分享$

where $技术分享$ is the mean and $技术分享$ the standard deviation. The square of the standard deviation, $技术分享$ , is called the variance.

The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at $技术分享$ and $技术分享$ [R250]). This implies that numpy.random.normal is more likely to return samples lying close to the mean, rather than those far away.

References

[R249]

Wikipedia, “Normal distribution”, http://en.wikipedia.org/wiki/Normal_distribution

[R250]

(1, 2, 3, 4) P. R. Peebles Jr., “Central Limit Theorem” in “Probability, Random Variables and Random Signal Principles”, 4th ed., 2001, pp. 51, 51, 125.

Examples

Draw samples from the distribution:

>>>
>>> mu, sigma = 0, 0.1 # mean and standard deviation
>>> s = np.random.normal(mu, sigma, 1000)

Verify the mean and the variance:

>>>
>>> abs(mu - np.mean(s)) < 0.01
True

>>>
>>> abs(sigma - np.std(s, ddof=1)) < 0.01
True

Display the histogram of the samples, along with the probability density function:

>>>
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, normed=True)
>>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
...                np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
...          linewidth=2, color=‘r‘)
>>> plt.show()

(Source code, png, pdf)

2.numpy.random.randn

import numpy as np
np.random.randn(2,3)

array([[ 0.59941534,  1.0991949 ,  1.36316028],
       [-0.01979197,  1.30783162, -0.69808199]])

意思是从标准正太分布中随机抽取。

3.scipy.optimize.leastsq

最小二乘法

import numpy as np
from scipy.optimize import leastsq

#待拟合的函数，x是变量，p是参数
def fun(x, p):
a, b = p
return a*x + b

#计算真实数据和拟合数据之间的误差，p是待拟合的参数，x和y分别是对应的真实数据
def residuals(p, x, y):
return fun(x, p) - y

#一组真实数据，在a=2, b=1的情况下得出
x1 = np.array([1, 2, 3, 4, 5, 6], dtype=float)
y1 = np.array([3, 5, 7, 9, 11, 13], dtype=float)

#调用拟合函数，第一个参数是需要拟合的差值函数，第二个是拟合初始值，第三个是传入函数的其他参数
r = leastsq(residuals, [1, 1], args=(x1, y1))

#打印结果，r[0]存储的是拟合的结果，r[1]、r[2]代表其他信息
print r[0]

运行之后，拟合结果是

[2. 1.]

但是在这次实际的使用过程中，我拟合的函数不是这样简单的，其中的一个难点是待拟合函数是一个分段函数，需要判断自变量的值，然后给出不同的函数方程式，举个例子, 这样一个分段函数:当x > 3时，y = ax + b, 当x <= 3 时，y = ax – b, 用Python代码写一下：

def fun(x, p):
a, b = p
if (x > 3):
return a*x + b
else:
return a*x - b

如果我们还是使用原来的差值函数进行拟合，会得到这样的错误：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

原因很简单，我们现在的fun函数只能计算单个值了，如果传入的还是一个array，自然就会报错。那么怎么办呢？我也很郁闷，于是在scipy的maillist里寻求帮助, 外国牛牛们都很热心，很快就指出了问题。其实是我对于差值函数理解错了，leastsq函数所要传入的差值函数需要返回的其实是一个array, 于是我们可以这样修改差值函数：

def residuals(p, x, y):
temp = np.array([0,0,0,0,0,0],dtype=float)
for i in range(0, len(x)):
temp[i] = fun(x[i], p)
return temp - y

import numpy as np #惯例
import scipy as sp #惯例
from scipy.optimize import leastsq #这里就是我们要使用的最小二乘的函数
import pylab as pl

m = 9 #多项式的次数

def real_func(x):
return np.sin(2*np.pi*x) #sin(2 pi x)

def fake_func(p, x):
f = np.poly1d(p) #多项式分布的函数
return f(x)

#残差函数
def residuals(p, y, x):
return y - fake_func(p, x)

#随机选了9个点，作为x
x = np.linspace(0, 1, 9)
#画图的时候需要的“连续”的很多个点
x_show = np.linspace(0, 1, 1000)

y0 = real_func(x)
#加入正态分布噪音后的y
y1 = [np.random.normal(0, 0.1) + y for y in y0]

#先随机产生一组多项式分布的参数
p0 = np.random.randn(m)

plsq = leastsq(residuals, p0, args=(y1, x))

print (‘Fitting Parameters ：‘, plsq[0]) #输出拟合参数

pl.plot(x_show, real_func(x_show), label=‘real‘)
pl.plot(x_show, fake_func(plsq[0], x_show), label=‘fitted curve‘)
pl.plot(x, y1, ‘bo‘, label=‘with noise‘)
pl.legend()
pl.show()

技术分享

学习大数据第五天：最小二乘法的Python实现（二）

标签：

原文地址：http://blog.csdn.net/liangzuojiayi/article/details/51247489

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行