码迷,mamicode.com
首页 > 其他好文 > 详细

hashCode方法与31

时间:2018-11-29 11:01:56      阅读:157      评论:0      收藏:0      [点我收藏+]

标签:static   存储结构   www   HERE   elf   where   apply   sys   ash   

hash code 定位

一直有个概念就是,hash可以很快存取数据。但是具体的实现从没有深究过。最近想了解自定义hashCode方法,看到书上说到效率问题时,决定探究一下HashMap中hash的定位方式(HashSet内部也是借助HashMap来实现的)。

HashMap的数据存储结构

HashMap中的数据存储在Node数组table(Node[])中,最基本的Node是一个单向链表结构:

static class Node<K,V> implements Map.Entry<K,V> {
    final int hash;
    final K key;
    V value;
    Node<K,V> next;

    Node(int hash, K key, V value, Node<K,V> next) {
        this.hash = hash;
        this.key = key;
        this.value = value;
        this.next = next;
    }
}

HashMap的数据数据存取,就是将key对应的hash值换算成table数组的索引;相同索引的话将数据挂到单向链表上。

HashMap的索引计算

  1. 首先hash值计算的索引不会越界,如下:
//n是Node数组table的大小
//hash是key计算得到的hash值(注意这里不是Object.hashCode的返回值)
table[(n - 1) & hash]
  1. 上面所说的hash值来源计算如下:
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

这里将高位与低位进行异或操作,权衡了很多东西,包括hash值的分布,table大小的边界问题、位扩展的速度效率等等。官方给出的解释如下:

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don‘t benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

hashCode() 与 "31"

假如大多数数据的hash值相同的话,每次取值都需要遍历链表,导致效率下降。所以在必要时候应该合理重写hashCode方法。在Objects类里面提供了一个hash方法,它的实现如下:

public static int hash(Object... values) {
    return Arrays.hashCode(values);
}
public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}
像上面的实现一样,经常可以在代码中看到31这个数字,为什么会选择31呢?

目前资料[1]看到的情况如下:
首先它是一个(奇)素数。我们知道hash的值越具有唯一性越好,而素数与其他数相比最有机会获取到唯一性的数据。

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

资料[1]提到使用素数可以得到足够唯一的key,但这并非唯一的选择,深入的话需要研究hash:

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

还有一部分资料让人不想再深究下去:

资料片段[1]

Researchers found that using a prime of 31 gives a better distribution to the keys, and lesser no of collisions. No one knows why, the last i know and i had this question answered by Chris Torek himself, who is generally credited with coming up with 31 hash, on the C++ or C mailing list a while back.

资料片段[2]

You can read Bloch‘s original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R‘s book (but even Kernighan and Ritchie couldn‘t remember where it came from). In the end he basically had to chose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime

甚至有人说hash的实现在未来的版本有可能会修改,这不是我这小小程序员需要考虑的事情。个人觉得只需要记住几件事:

  1. 素数作为因子在哈希算法历史上存在很长时间了
  2. 31的选择可能不是最优但它已经足够
  3. 31乘积会被优化,31 * i == (i << 5) - i
  4. 31已经广泛存在于Java的基本库、第三方库的源代码中,已被芸芸众开发者接受

参考

[1] : Why do hash functions use prime numbers?

[2] : stackoverflow - why-does-javas-hashcode-in-string-use-31-as-a-multiplier

[3] : Hash functions.

hashCode方法与31

标签:static   存储结构   www   HERE   elf   where   apply   sys   ash   

原文地址:https://www.cnblogs.com/wcmk21/p/10035907.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!