基于近邻的异常值检测方法

时间：2015-10-01 11:41:23 阅读：263 评论：0 收藏：0 [点我收藏+]

标签：

今天写个基于近邻的异常值方法。由于维数灾难，这个方法在高维上慎用。而且对于有多个cluster且密度差别比较大的数据，这个方法是不适用的。

该方法的思路如下:

1) 对与每个样本计算出与其指定半径radius内近邻的个数

2) 如果某个样本近邻的个数少于指定的近邻最小个数minPts, 则为outlier

一般来说所谓的outler，数量会很小，且值会和正常值很不一样。要不然就可能是个cluster的问题了。所以这里默认的minPts我设置为3。那radius怎么办呢。在sklearn.cluster里，有个给MeanShift估计bandwidth的函数estimate_bandwidth。这个函数的主要作用就是算出每个点和它quantile比例的近邻的最大距离，然后再对所有点的平均最大近邻距离

    nbrs = NearestNeighbors(n_neighbors=int(X.shape[0] * quantile))
    nbrs.fit(X)

    bandwidth = 0.
    for batch in gen_batches(len(X), 500):
        d, _ = nbrs.kneighbors(X[batch, :], return_distance=True)
        bandwidth += np.max(d, axis=1).sum()

    return bandwidth / X.shape[0]

这个estimate_bandwidth默认的quantile是0.3，也就是周围30%近邻的平均最大距离。如此大比例的近邻，再加上取得是最大距离，所以用在这里估计我们得radius应该是可行得。当然用户自己也可以给出radius。

接着就是算出基于radius的K Nearest Neighbor了。这里我们基于sklearn来实现。由于sklearn的radius_neighbors会包含自身，所以在最后做近邻计数的时候要减去1( 减去自身 )

from sklearn.neighbors import NearestNeighbors 


def calNeighborsByRadius( X, radius ):
    nhbr = NearestNeighbors(  )
    nhbr.fit( X )
        
    indices = nhbr.radius_neighbors( X, radius, return_distance = False )
    neighborCounts = np.array( [ len( i ) - 1  for i in indices ] )

    return neighborCounts

然后根据minPts和近邻计数来判断是不是outlier

def detectOutliers( X, radius = None, minPts = 3 ):
    neighborCounts = calNeighborsByRadius( X, radius )
    labels = np.array( neighborCounts < minPts, np.int )

    return labels, neighborCounts

最后针对前一篇文章的例子，我们来看看方法的效果如何。绿色表示outlier，都抓出来了。这里数据做了离差标准化。个人经验，在检测异常值的场景中，离差标准化比Z-Score标准化更能正确地反应outlier的问题。

技术分享

代码都是为了展示方便，与运行效率无关

基于近邻的异常值检测方法

标签：

原文地址：http://www.cnblogs.com/zhuyubei/p/4850786.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行