标签:java
注:本文使用的都是UTF-8的编码encoder
java的String字符转byte[]
byte[] abyte0 = new byte[4];
List<String> utf8mb4String = new ArrayList<>();
for (char i = Character.MIN_HIGH_SURROGATE; i <= Character.MAX_HIGH_SURROGATE; i++) {
for (char j = Character.MIN_LOW_SURROGATE; j <= Character.MAX_LOW_SURROGATE; j++) {
int character = Character.toCodePoint(i, j);
abyte0 = new byte[4];
abyte0[0] = (byte) (240 | character >> 18);
abyte0[1] = (byte) (128 | character >> 12 & 63);
abyte0[2] = (byte) (128 | character >> 6 & 63);
abyte0[3] = (byte) (128 | character & 63);
String str = new String(abyte0);
utf8mb4String.add(str);
System.out.println(str);
}
}4个字节的"字符"在这里是全部都包含了,你可以复制粘贴到你的编辑器里面执行,取一个4字节"字符",用单引号引着,编译会出现错误,因为4个字节的"字符"并不是一个char组成的,而是俩个,想了解下面有详述。
/** The value is used for character storage. */ private final char value[];
字符串内部存储其实是一个char[]
总结一下以下的分析:(#:代表0或者1)
1)单字节字符
字符c属于范围[0x0000 , 0x00C8)
byte[]规则:
b[0] = #### ####(但不大于0xC8)
2)双字节字符
字符c属于范围[0x00C8 , 0x0800)
byte[]规则:
高6位 : b[0] = 110# ####
低6位:b[1] = 10## ####
3)3个字节的字符
字符c属于范围其他情况的补集
byte[]规则:
高4位 : b[0] = 1110 ####
中6位:b[1] = 10## ####
低6位:b[2] = 10## ####
4)4个字节的"字符"
字符c属于范围[0xD800 , 0xDFFF](高位字符范围为:[0xD800,0xDBFF],低位字符范围为:[0xDC00,0xDFFF])
byte[]规则:
高8位 : b[0] = 1111 0###
中高6位:b[1] = 10## ####
低高6位:b[2] = 10## ####
低6位: b[3] = 10## ####
注:java是大端,低地址存储高位,高地址存储低位。
(b[i] & 0xF8)== 0xF0可以判断这个字符串是否是包含4个字节的"字符"。
I.encoder constructor
public byte[] getBytes(String charsetName)
throws UnsupportedEncodingException {
if (charsetName == null) throw new NullPointerException();
return StringCoding.encode(charsetName, value, 0, value.length);
}
以上是String.getBytes方法代码,用于转化字符为byte[]的。
private StringEncoder(Charset cs, String rcn) {
this.requestedCharsetName = rcn;
this.cs = cs;
this.ce = cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
this.isTrusted = (cs.getClass().getClassLoader0() == null);
}以上是StringEncoder的构造函数,注意到
.onMalformedInput(CodingErrorAction.REPLACE) .onUnmappableCharacter(CodingErrorAction.REPLACE);
StringEncoder默认是
CodingErrorAction.REPLACE
作用是,当发生无法满足规则去转化byte[]的时候,就需要使用代替字符来代替无法转化的字符。
this.ce = cs.newEncoder()
protected CharsetEncoder(Charset cs,
float averageBytesPerChar,
float maxBytesPerChar)
{
this(cs,
averageBytesPerChar, maxBytesPerChar,
new byte[] { (byte)‘?‘ });
} 以上是CharsetEncoder的改造函数,这是UTF_Encoder构造所调用到的构造函数,可以看到replacement字符是‘?‘字符。默认的。
II.encode chars to byte[]
public int encode(char ac[], int i, int j, byte abyte0[])
{
int k = i + j;
int l = 0;
for(int i1 = l + Math.min(j, abyte0.length); l < i1 && ac[i] < ‘\200‘;)
abyte0[l++] = (byte)ac[i++];
while(i < k)
{
char c = ac[i++];
if(c < ‘\200‘)//1
abyte0[l++] = (byte)c;
else
if(c < ‘\u0800‘)//2
{
abyte0[l++] = (byte)(192 | c >> 6);
abyte0[l++] = (byte)(128 | c & 63);
} else
if(Character.isSurrogate(c))//4
{
if(sgp == null)
sgp = new Surrogate.Parser();
int j1 = sgp.parse(c, ac, i - 1, k);//4.1
if(j1 < 0)//4.2
{
if(malformedInputAction() != CodingErrorAction.REPLACE)
return -1;
abyte0[l++] = replacement()[0];
} else//4.3
{
abyte0[l++] = (byte)(240 | j1 >> 18);
abyte0[l++] = (byte)(128 | j1 >> 12 & 63);
abyte0[l++] = (byte)(128 | j1 >> 6 & 63);
abyte0[l++] = (byte)(128 | j1 & 63);
i++;
}
} else//3
{
abyte0[l++] = (byte)(224 | c >> 12);
abyte0[l++] = (byte)(128 | c >> 6 & 63);
abyte0[l++] = (byte)(128 | c & 63);
}
}
return l;
}
1)单字节字符
字符c属于范围[0x0000 , 0x00C8)
abyte0[l++] = (byte)c;
byte[]规则:
b[0] = #### ####(但不大于0xC8)
2)双字节字符
字符c属于范围[0x00C8 , 0x0800)
abyte0[l++] = (byte)(192 | c >> 6); abyte0[l++] = (byte)(128 | c & 63);
192 <=> 1100 0000
128 <=> 1000 0000
63 <=> 0011 1111
高6位 : b[0] = 110# ####
低6位:b[1] = 10## ####
3)3个字节的字符
字符c属于范围其他情况的补集
abyte0[l++] = (byte)(224 | c >> 12); abyte0[l++] = (byte)(128 | c >> 6 & 63); abyte0[l++] = (byte)(128 | c & 63);
224 <=> 1110 0000
128 <=> 1000 0000
63 <=> 0011 1111
高4位 : b[0] = 1110 ####
中6位:b[1] = 10## ####
低6位:b[2] = 10## ####
4)4个字节的"字符"
Character.isSurrogate(c)
public static boolean isSurrogate(char ch) {
return ch >= MIN_SURROGATE && ch < (MAX_SURROGATE + 1);
}可以看到字符c属于范围[0xD800 , 0xDFFF] ,满足这个范围的字符,才算是4个字节的"字符".
/**
* The minimum value of a Unicode high-surrogate code unit in the UTF-16 encoding, constant {@code ‘\u005CuD800‘}.
* A high-surrogate is also known as a leading-surrogate
*/
public static final char MIN_HIGH_SURROGATE = ‘\uD800‘;
/**
* The maximum value of a Unicode high-surrogate code unit in the UTF-16 encoding, constant {@code ‘\u005CuDBFF‘}.
* A high-surrogate is also known as a leading-surrogate
*/
public static final char MAX_HIGH_SURROGATE = ‘\uDBFF‘;
/**
* The minimum value of a Unicode low-surrogate code unit in the UTF-16 encoding, constant {@code ‘\u005CuDC00‘}.
* A low-surrogate is also known as a trailing-surrogate.
*/
public static final char MIN_LOW_SURROGATE = ‘\uDC00‘;
/**
* The maximum value of a Unicode low-surrogate code unit in the UTF-16 encoding, constant {@code ‘\u005CuDFFF‘}.
* A low-surrogate is also known as a trailing-surrogate
*/
public static final char MAX_LOW_SURROGATE = ‘\uDFFF‘;
/**
* The minimum value of a Unicode surrogate code unit in the UTF-16 encoding, constant {@code ‘\u005CuD800‘}.
*/
public static final char MIN_SURROGATE = MIN_HIGH_SURROGATE;
/**
* The maximum value of a Unicode surrogate code unit in the UTF-16 encoding, constant {@code ‘\u005CuDFFF‘}.
*/
public static final char MAX_SURROGATE = MAX_LOW_SURROGATE;
/**
* The minimum value of a Unicode supplementary code point</a>, constant {@code U+10000}.
*/
public static final int MIN_SUPPLEMENTARY_CODE_POINT = 0x010000; 以上是下面使用到的一些java.lang.Character源代码常量。
4.1>int j1 = sgp.parse(c, ac, i - 1, k);//4.1
public int parse(char c, char ac[], int i, int j)
{
if(!$assertionsDisabled && ac[i] != c)
throw new AssertionError();
if(Character.isHighSurrogate(c))
{
if(j - i < 2)
{
error = CoderResult.UNDERFLOW;
return -1;
}
char c1 = ac[i + 1];
if(Character.isLowSurrogate(c1))
{
character = Character.toCodePoint(c, c1);//a
isPair = true;
error = null;
return character;
} else
{
error = CoderResult.malformedForLength(1);
return -1;
}
}
if(Character.isLowSurrogate(c))
{
error = CoderResult.malformedForLength(1);
return -1;
} else
{//b
character = c;
isPair = false;
error = null;
return character;
}
} public static boolean isHighSurrogate(char ch) {
// Help VM constant-fold; MAX_HIGH_SURROGATE + 1 == MIN_LOW_SURROGATE
return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
} 高16位范围为:[0xD800,0xDBFF]
public static boolean isLowSurrogate(char ch) {
return ch >= MIN_LOW_SURROGATE && ch < (MAX_LOW_SURROGATE + 1);
} 低16位范围为:[0xDC00,0xDFFF]
b代码块在这里的方法调用是无法进入的,只有是满足高16位范围为:[0xD800,0xDBFF],低16位范围为:[0xDC00,0xDFFF],(即本次传入的字符满足[0xD800,0xDBFF],后跟要有字符,而且字符取值范围[0xDC00,0xDFFF]。
character = Character.toCodePoint(c, c1);//a
public static int toCodePoint(char high, char low) {
// Optimized form of:
// return ((high - MIN_HIGH_SURROGATE) << 10)
// + (low - MIN_LOW_SURROGATE)
// + MIN_SUPPLEMENTARY_CODE_POINT;
return ((high << 10) + low) + (MIN_SUPPLEMENTARY_CODE_POINT
- (MIN_HIGH_SURROGATE << 10)
- MIN_LOW_SURROGATE);
} 字符c是高位字符,字符c1是低位字符,这里产生的codepoint用于4.3的转化
| high - MIN_HIGH | |||
| low - MIN_LOW | |||
| 0 1 | 0 0 ... ... ... ... 0 | ||
| 8bits | 2bits | 6 bits | 10bits |
4.2>if(j1 < 0)//4.2
if(malformedInputAction() != CodingErrorAction.REPLACE) return -1; abyte0[l++] = replacement()[0];
这里是当出现不符合4个字节字符规则的字符,都会以replacement字符(默认为‘?‘)进行替代(转换为byte[]后无法转回来相应的unicode码了,用了replacement的unicode码进行替代了)。
4.3>else
} else//4.3
{
abyte0[l++] = (byte)(240 | j1 >> 18);
abyte0[l++] = (byte)(128 | j1 >> 12 & 63);
abyte0[l++] = (byte)(128 | j1 >> 6 & 63);
abyte0[l++] = (byte)(128 | j1 & 63);
i++;
}
240 <=> 1111 0000
128 <=> 1000 0000
63 <=> 0011 1111
高8位 : b[0] = 1111 0###
中高6位:b[1] = 10## ####
低高6位:b[2] = 10## ####
低6位: b[3] = 10## ####
标签:java
原文地址:http://6325423.blog.51cto.com/6315423/1685469