Manifold learning-based methods for analyzing single-cell RNA-sequencing data

时间：2018-06-18 16:02:09 阅读：365 评论：0 收藏：0 [点我收藏+]

标签：csdn abi ast sum density mod space 树状最小

https://doi.org/10.1016/j.coisb.2017.12.008

Yale university 2017年12月发布的基于机器学习中流形学习的单细胞降维降噪处理优化。

The manifold learning：

假设数据是均匀采样于一个高维欧氏空间中的低维流形，流形学习就是从高维采样数据中恢复低维流形结构，即找到高维空间中的低维流形，并求出相应的嵌入映射，以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质，找到产生数据的内在规律。

常见的MFL：PCA、MDS、diffusion mapping等，图下为不同方法的优劣简介。

技术分享图片

本文关键词：MFL（Manifold models can also be useful for analyzing data generated from disparate dynamics or profiles as the data can be modeled with several disconnected mani- folds）、DPT（a pseudotime trajectory through the data to describe a latent axis of development or cell state transition）、DPT method（to find a major axis of variability in the data, DPT defines a distance from a source cell to all other cells over a modified transition operator that includes only non- trivial diffusion components. This produces trajec- tories of nonlinear variation across a dataset）

而本文的思路是在分析scRNAseq的数据的第二步使用到了MFL：

gene selection,

manifold learning,

cell organization，

Dimensionality reduction and visualization，

Density estimation and clustering。

而整体的前三步统称为pseudotime methods。

下图清晰的展示出了文章的分析思路，图也草鸡美。我觉得我还要修炼些时日再做图，分析分析思路比较拿手哈哈哈：

技术分享图片

每个plot都会有对应的一个subtitle，理解作者在做什么足够。

其中，

主要的文章算法核心在下图：

技术分享图片

Comparison of pseudotime methods. Pseudotime methods（four kinds of method） may generally be broken down into three stages: gene selection, manifold learning, and cell organization.

从而作者提出了一些现存方法的局限性，

A current limitation of these methods is their reliance to varying degrees on assumptions about the underlying shape of the data （数据潜在形态的假设几何对后期分型影响很大）(e.g. a tree, bifurcating trajectory, etc.)

而他们开发的DPT，也就是最后一种方法：provideing two significant advantages over other pseudotemporal techniques. First, working directly on a diffusion map does not require any greedy computational steps（层级聚类的经典算法，每一步都是贪婪模型，也就是局部最优而不是树的全局最优）. Second and most importantly, because DPT operates directly on the diffusion space, it features the least coarse graining or over-fitting of data into low-dimensional assumptions（DPT的工作对象是整体的扩散空间，而不是二分支结构以及树状结构，所以可以以最小的粗粒度过拟合到低维空间）.

文章最后的验证：

技术分享图片