上一小节,我们对p稳定分布LSH的基本原理进行了介绍(http://blog.csdn.net/jasonding1354/article/details/38237353),在接下来的博文中,我将以E2LSH开源代码为基础,对E2LSH的源码进行注解学习,从而为掌握LSH的基本原理以及未来对相似性搜索的扩展学习打下基础。
1、代码概况
E2LSH的核心代码可以分为3部分:
其他代码说明:
2、主要的数据结构
(1)RNearNeighborStructureT(LocalitySensitiveHashing.h中定义)——R near neighbor数据结构。该结构包含构建数据结构的参数、哈希函数族gi的描述、结构中数据点的索引和用于存储哈希桶的L个哈希表的指针。
typedef struct _RNearNeighborStructT {
IntT dimension; // dimension of points.
IntT parameterK; // parameter K of the algorithm.
IntT parameterL; // parameter L of the algorithm.
RealT parameterW; // parameter W of the algorithm.
IntT parameterT; // parameter T of the algorithm.
RealT parameterR; // parameter R of the algorithm.
RealT parameterR2; // = parameterR^2
// Whether to use <u> hash functions instead of usual <g>
// functions. When this flag is set to TRUE, <u> functions are
// generated (which are roughly k/2-tuples of LSH), and a <g>
// function is a pair of 2 different <u> functions.
BooleanT useUfunctions;
// the number of tuples of hash functions used (= # of rows of
// <lshFunctions>). When useUfunctions == FALSE, this field is equal
// to parameterL, otherwise, to <m>, the number of <u> hash
// functions (in this case, parameterL = m*(m-1)/2 = nHFTuples*(nHFTuples-1)/2
IntT nHFTuples;
// How many LSH functions each of the tuple has (it is <k> when
// useUfunctions == FALSE, and <k/2> when useUfunctions == TRUE).
IntT hfTuplesLength;
// number of points in the data set
Int32T nPoints;
// The array of pointers to the points that are contained in the
// structure. Some types of this structure (of UHashStructureT,
// actually) use indeces in this array to refer to points (as
// opposed to using pointers).
PPointT *points;
// The size of the array <points>
Int32T pointsArraySize;
// If <reportingResult> == FALSE, no points are reported back in a
// <get*> function. In particular any point that is found in the
// bucket is considered to be outside the R-ball of the query point
// (the distance is still computed). If <reportingResult> == TRUE,
// then the structure behaves normally.
BooleanT reportingResult;
// This table stores the LSH functions. There are <nHFTuples> rows
// of <hfTuplesLength> LSH functions.
LSHFunctionT **lshFunctions;
// Precomputed hashes of each of the <nHFTuples> of <u> functions
// (to be used by the bucket hashing module).
Uns32T **precomputedHashesOfULSHs;
// The set of non-empty buckets (which are hashed using
// PUHashStructureT).
PUHashStructureT *hashedBuckets;
// ***
// The following vectors are used only for temporary operations
// within this R-NN structure during a query operation.
// ***
// This vector is used to store the values of hash functions <u>
// (<hfTuplesLength>-tuple of LSH fuctions). One <g> function is a concatenation
// of <nHFTuples> of <u> LSH functions.
Uns32T **pointULSHVectors;
// A vector of length <dimension> to store the reduced point (point
// with coordinates divided by <parameterR>).
RealT *reducedPoint;
// This vector is used for storing marked points in a query
// operation (for computing distances to a point at most once). If
// markedPoints[i]=TRUE then point <i> was examined already.
BooleanT *markedPoints;
// This vector stored the indeces in the vector <markedPoints> of all
// TRUE entries.
Int32T *markedPointsIndeces;
// the size of <markedPoints> and of <markedPointsIndeces>
IntT sizeMarkedPoints;
} RNearNeighborStructT, *PRNearNeighborStructT;
(2)UHashStructureT(BucketHashing.h中定义)——该结构定义了用于映射哈希桶的哈希表。通过链表方式解决冲突的问题。
主要有两种哈希表类型:HT_LINKED_LIST 和HT_HYBRID_CHAINS。
HT_LINKED_LIST对应哈希表的链表版本;HT_HYBRID_CHAINS对应带有混合存储的哈希表。
两种哈希表都具有指向哈希函数h1(·)和h2(·)的指针。
typedef struct _UHashStructureT {
// The type of the hash table (can take values HT_*). when
// <typeHT>=HT_LINKED_LIST, chains&buckets are linked lists. when
// <typeHT>=HT_PACKED, chains&buckets are static arrays. when
// <typeHT>=HT_STATISTICS, chains are static arrays and buckets only
// count # of elements. when <typeHT>=HT_HYBRID_CHAINS, a chain is
// a "hybrid" array that contains both the buckets and the points
// (the an element of the chain array is of type
// <HybridChainEntryT>). all chains are conglamerated in the same
// array <hybridChainsStorage>.
IntT typeHT;
// The array containing the hash slots of the universal hashing.
union _hashTableT {
PGBucketT *llHashTable;
PackedGBucketT **packedHashTable;
LinkPackedGBucketT **linkHashTable;
PHybridChainEntryT *hybridHashTable;
} hashTable;
// The sizes of each of the chains of the hashtable (used only when
// typeHT=HT_PACKED or HT_STATISTICS.
IntT *chainSizes;
union _bucketPoints{
PPointT *pointsArray;
PointsListEntryT *pointsList;
} bucketPoints;
HybridChainEntryT *hybridChainsStorage;
// The size of hashTable.
Int32T hashTableSize;
// Number of elements(buckets) stored in the hash table in total (that
// is the number of non-empty buckets).
Int32T nHashedBuckets;
Int32T nHashedPoints;
// Unused (but allocated) instances of the corresponding
// structs. May be reused if needed (instead of allocated new
// memory).
PGBucketT unusedPGBuckets;
PBucketEntryT unusedPBucketEntrys;
Uns32T prime; // the prime used for the universal hash functions.
IntT hashedDataLength;// the number of IntT's in an element from U (U is the set of values to hash).
// The hash functions used for the universal hashing.
// The main hash function (that defines the index
// of the slot in the table).
// The type of the hash function is: h_{a}(k) = ((a\cdot k)mod p)mod hashTableSize.
Uns32T *mainHashA;
// Control hash functions: used to compute/check the <controlValue>s
// of <GBucket>s.
// The type of the hash function is: h_{a}(k) = (a\cdot k)mod p
Uns32T *controlHash1;
} UHashStructureT, *PUHashStructureT;
(3)RNNParametersT(LocalitySensitiveHashing.h中定义)——包含构建RNearNeighborStructureT数据结构的必要参数的结构体。
typedef struct _RNNParametersT {
RealT parameterR; // parameter R of the algorithm.
RealT successProbability; // the success probability 1-\delta
IntT dimension; // dimension of points.
RealT parameterR2; // = parameterR^2
// Whether to use <u> hash functions instead of usual <g>
// functions. When this flag is set to TRUE, <u> functions are
// generated (which are roughly k/2-tuples of LSH), and a <g>
// function is a pair of 2 different <u> functions.
BooleanT useUfunctions;
IntT parameterK; // parameter K of the algorithm.
// parameter M (# of independent tuples of LSH functions)
// if useUfunctions==TRUE, parameterL = parameterM * (parameterM - 1) / 2
// if useUfunctions==FALSE, parameterL = parameterM
IntT parameterM;
IntT parameterL; // parameter L of the algorithm.
RealT parameterW; // parameter W of the algorithm.
IntT parameterT; // parameter T of the algorithm.
// The type of the hash table used for storing the buckets (of the
// same <g> function).
IntT typeHT;
} RNNParametersT, *PRNNParametersT;
(4)PPoint(Geometry.h中定义)——用于存储数据点的结构体。该结构包含数据的坐标(coordinates),数据点范数的平方和该数据点在数据集P的下标索引。
typedef struct _PointT {
IntT index; // the index of this point in the dataset list of points
RealT *coordinates;
RealT sqrLength; // the square of the length of the vector
} PointT, *PPointT;转载请注明作者及文章出处:http://blog.csdn.net/jasonding1354/article/details/38331229
E2LSH源代码下载地址:http://download.csdn.net/detail/jasonding1354/7704277
参考资料:
1、M.Datar,N.Immorlica,P.Indyk,and V.Mirrokni,“Locality-SensitiveHashing Scheme Based on p-Stable Distributions,”Proc.Symp. ComputationalGeometry, 2004.
2、A.Andoni,P.Indyk.E2lsh:Exact Euclidean locality-sensitive hashing.http://web.mit.edu/andoni/www/LSH/.2004.
【E2LSH源码分析】E2LSH源码综述及主要数据结构,布布扣,bubuko.com
原文地址:http://blog.csdn.net/jasonding1354/article/details/38331229