标签:static 存储结构 www HERE elf where apply sys ash
一直有个概念就是,hash可以很快存取数据。但是具体的实现从没有深究过。最近想了解自定义hashCode方法,看到书上说到效率问题时,决定探究一下HashMap中hash的定位方式(HashSet内部也是借助HashMap来实现的)。
HashMap中的数据存储在Node数组table(Node[])中,最基本的Node是一个单向链表结构:
static class Node<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
V value;
Node<K,V> next;
Node(int hash, K key, V value, Node<K,V> next) {
this.hash = hash;
this.key = key;
this.value = value;
this.next = next;
}
}
HashMap的数据数据存取,就是将key对应的hash值换算成table数组的索引;相同索引的话将数据挂到单向链表上。
//n是Node数组table的大小
//hash是key计算得到的hash值(注意这里不是Object.hashCode的返回值)
table[(n - 1) & hash]
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
这里将高位与低位进行异或操作,权衡了很多东西,包括hash值的分布,table大小的边界问题、位扩展的速度效率等等。官方给出的解释如下:
Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don‘t benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.
假如大多数数据的hash值相同的话,每次取值都需要遍历链表,导致效率下降。所以在必要时候应该合理重写hashCode方法。在Objects类里面提供了一个hash方法,它的实现如下:
public static int hash(Object... values) {
return Arrays.hashCode(values);
}
public static int hashCode(Object a[]) {
if (a == null)
return 0;
int result = 1;
for (Object element : a)
result = 31 * result + (element == null ? 0 : element.hashCode());
return result;
}
目前资料[1]看到的情况如下:
首先它是一个(奇)素数。我们知道hash的值越具有唯一性越好,而素数与其他数相比最有机会获取到唯一性的数据。
Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.
资料[1]提到使用素数可以得到足够唯一的key,但这并非唯一的选择,深入的话需要研究hash:
However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.
还有一部分资料让人不想再深究下去:
资料片段[1]
Researchers found that using a prime of 31 gives a better distribution to the keys, and lesser no of collisions. No one knows why, the last i know and i had this question answered by Chris Torek himself, who is generally credited with coming up with 31 hash, on the C++ or C mailing list a while back.
资料片段[2]
You can read Bloch‘s original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R‘s book (but even Kernighan and Ritchie couldn‘t remember where it came from). In the end he basically had to chose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime
甚至有人说hash的实现在未来的版本有可能会修改,这不是我这小小程序员需要考虑的事情。个人觉得只需要记住几件事:
[1] : Why do hash functions use prime numbers?
[2] : stackoverflow - why-does-javas-hashcode-in-string-use-31-as-a-multiplier
[3] : Hash functions.
标签:static 存储结构 www HERE elf where apply sys ash
原文地址:https://www.cnblogs.com/wcmk21/p/10035907.html