Sparse Coding

时间：2014-09-19 18:57:05 阅读：272 评论：0 收藏：0 [点我收藏+]

Sparse Coding

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. —— 过完备的基，无监督 The aim of sparse coding is to find a set of basis vectors $bubuko.com,布布扣$ such that we can represent an input vector $bubuko.com,布布扣$ as a linear combination of these basis vectors:

$bubuko.com,布布扣$

While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an
over-complete
set of basis vectors to represent input vectors
$bubuko.com,布布扣$
(i.e. such that
k
>
n
).
The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data.
However, with an over-complete basis,
the coefficients a_i are no longer uniquely determined
by the input vector
$bubuko.com,布布扣$
. Therefore, in sparse coding, we introduce the additional criterion of sparsity t
o resolve the degeneracy introduced by over-completeness.

过完备的基能发掘出数据的内在结构和模型，但是其系数表示将不唯一

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients a_i be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. 原子图像的叠加 Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.

We define the sparse coding cost function on a set of m input vectors as

$bubuko.com,布布扣$

where S(.) is a sparsity cost function which penalizes a_i for being far from zero.

稀疏惩罚项

We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $bubuko.com,布布扣$ and the second term as a sparsity penalty which forces our representation of $bubuko.com,布布扣$ to be sparse. The constant λ is a scaling constant to determine the relative importance of these two contributions.

Although the most direct measure of sparsity is the "L₀" norm ( $bubuko.com,布布扣$ ), it is non-differentiable （不可微） and difficult to optimize in general. In practice, common choices for the sparsity cost S(.) are the L₁ penalty $bubuko.com,布布扣$ and the log penalty $bubuko.com,布布扣$ .

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down a_i and scaling $bubuko.com,布布扣$ up by some large constant. To prevent this from happening, we will constrain $bubuko.com,布布扣$ to be less than some constant C. The full sparse coding cost function including our constraint on $bubuko.com,布布扣$ is

$bubuko.com,布布扣$

Probabilistic Interpretation

到目前为止，我们所考虑的稀疏编码，是为了寻找到一个稀疏的、超完备基向量集，来覆盖我们的输入数据空间。现在换一种方式，我们可以从概率的角度出发，将稀疏编码算法当作一种“生成模型”。

我们将自然图像建模问题看成是一种线性叠加，叠加元素包括 k 个独立的源特征 $bubuko.com,布布扣$ 以及加性噪声 ν ：

$bubuko.com,布布扣$

我们的目标是找到一组特征基向量 $bubuko.com,布布扣$ ，它使得图像的分布函数 $bubuko.com,布布扣$ 尽可能地近似于输入数据的经验分布函数 $bubuko.com,布布扣$ 。一种实现方式是，最小化 $bubuko.com,布布扣$ 与 $bubuko.com,布布扣$ 之间的 KL 散度，此 KL 散度表示如下：

$bubuko.com,布布扣$

因为无论我们如何选择 $bubuko.com,布布扣$ ，经验分布函数 $bubuko.com,布布扣$ 都是常量，也就是说我们只需要最大化对数似然函数 $bubuko.com,布布扣$ 。假设 ν 是具有方差σ² 的高斯白噪音，则有下式：

$bubuko.com,布布扣$

为了确定分布 $bubuko.com,布布扣$ ，我们需要指定先验分布 $bubuko.com,布布扣$ 。假定我们的特征变量是独立的，我们就可以将先验概率分解为：

$bubuko.com,布布扣$

此时，我们将“稀疏”假设加入进来——假设任何一幅图像都是由相对较少的一些源特征组合起来的。因此，我们希望 a_i 的概率分布在零值附近是凸起的，而且峰值很高。一个方便的参数化先验分布就是：

$bubuko.com,布布扣$

这里 S(a_i) 是决定先验分布的形状的函数。

当定义了 $bubuko.com,布布扣$ 和 $bubuko.com,布布扣$ 后，我们就可以写出在由 $bubuko.com,布布扣$ 定义的模型之下的数据 $bubuko.com,布布扣$ 的概率分布：

$bubuko.com,布布扣$

那么，我们的问题就简化为寻找：

$bubuko.com,布布扣$

这里 < . > 表示的是输入数据的期望值。

不幸的是，通过对 $bubuko.com,布布扣$ 的积分计算 $bubuko.com,布布扣$ 通常是难以实现的。虽然如此，我们注意到如果 $bubuko.com,布布扣$ 的分布（对于相应的 $bubuko.com,布布扣$ ）足够陡峭的话，我们就可以用 $bubuko.com,布布扣$ 的最大值来估算以上积分。估算方法如下：

$bubuko.com,布布扣$

跟之前一样，我们可以通过减小 a_i 或增大 $bubuko.com,布布扣$ 来增加概率的估算值（因为 P(a_i) 在零值附近陡升）。因此我们要对特征向量 $bubuko.com,布布扣$ 加一个限制以防止这种情况发生。

最后，我们可以定义一种线性生成模型的能量函数，从而将原先的代价函数重新表述为：

$bubuko.com,布布扣$

其中 λ = 2σ²β ，并且关系不大的常量已被隐藏起来。因为最大化对数似然函数等同于最小化能量函数，我们就可以将原先的优化问题重新表述为：

$bubuko.com,布布扣$

使用概率理论来分析，我们可以发现，选择 L₁ 惩罚和 $bubuko.com,布布扣$ 惩罚作为函数 S(.) ，分别对应于使用了拉普拉斯概率 $bubuko.com,布布扣$ 和柯西先验概率 $bubuko.com,布布扣$ 。

Learning

使用稀疏编码算法学习基向量集的方法，是由两个独立的优化过程组合起来的。第一个是逐个使用训练样本 $bubuko.com,布布扣$ 来优化系数 a_i ，第二个是一次性处理多个样本对基向量 $bubuko.com,布布扣$ 进行优化。

如果使用 L₁ 范式作为稀疏惩罚函数，对 $bubuko.com,布布扣$ 的学习过程就简化为求解由 L₁ 范式正则化的最小二乘法问题，这个问题函数在域 $bubuko.com,布布扣$ 内为凸，已经有很多技术方法来解决这个问题（诸如CVX之类的凸优化软件可以用来解决L1正则化的最小二乘法问题）。如果 S(.) 是可微的，比如是对数惩罚函数，则可以采用基于梯度算法的方法，如共轭梯度法。

用 L₂ 范式约束来学习基向量，同样可以简化为一个带有二次约束的最小二乘问题，其问题函数在域 $bubuko.com,布布扣$ 内也为凸。标准的凸优化软件（如CVX）或其它迭代方法就可以用来求解 $bubuko.com,布布扣$ ，虽然已经有了更有效的方法，比如求解拉格朗日对偶函数（Lagrange dual）。

根据前面的的描述，稀疏编码是有一个明显的局限性的，这就是即使已经学习得到一组基向量，如果为了对新的数据样本进行“编码”，我们必须再次执行优化过程来得到所需的系数。这个显著的“实时”消耗意味着，即使是在测试中,实现稀疏编码也需要高昂的计算成本，尤其是与典型的前馈结构算法相比。

Sparse Coding

标签：des http io os 使用 ar strong for 数据

原文地址：http://www.cnblogs.com/sprint1989/p/3981918.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行