标签:
跟正则表达式相关的类有:Pattern、Matcher和String。今天我们就开始Java中正则表达式的学习。
一、正则表达式的使用方法
一般推荐使用的方式如下:
Pattern pattern = Pattern.compile("^[^abc]h$"); Matcher matcher = pattern.matcher("hh"); boolean isMatch = matcher.matches();
另外一种不能复用Pattern的方式如下:
boolean b = Pattern.matches("a*b", "aaaaab");
二、Pattern的一些方法介绍
Pattern类的结构:
public final class Pattern extends Object implements Serializable
构造方法:
static Pattern compile(String regex) static Pattern compile(String regex, int flags)
内容的实质是调用:
public static Pattern compile(String regex) { return new Pattern(regex, 0); }
三、我们通过一些小的例子来说明Pattern的一些方法,后续我们对一些重要的方法做特殊说明:
public void pattern3() { String regex = "abc"; Pattern pattern = Pattern.compile(regex); System.out.println("flag: " + pattern.flags()); System.out.println("pattern: " + pattern.pattern()); System.out.println("quote: " + Pattern.quote(regex)); System.out.println("toString: " + pattern.toString()); System.out.println("-------------------------------------------------"); String[] strings = pattern.split("helloabcdabcefgabcgoup"); for (String string : strings) { System.out.println("string: " + string); } System.out.println("-------------------------------------------------"); String[] stringsLimit = pattern.split("helloabcdabcefgabcgoup", 2); for (String string : stringsLimit) { System.out.println("string: " + string); } }
得到的运行结果如下:
flag: 0 pattern: abc quote: \Qabc\E toString: abc ------------------------------------------------- string: hello string: d string: efg string: goup ------------------------------------------------- string: hello string: dabcefgabcgoup
Pattern类的toString()方法和pattern()方法都是返回pattern,也就是Pattern.compile的第一个参数。
这里Matcher比较重要,我们重点说明,当我们调用以下方法之后,有三种方法可以进行匹配:
Pattern pattern = Pattern.compile("^[^abc]h$");
Matcher matcher = pattern.matcher("hh");
我们通过实例来说明它们之间的区别:
String regex = "abc"; String input = "abcabcabc"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input);
matches方法:没有任何输出,因为abc与abcabcabc不是全部匹配的
while (matcher.matches()) { System.out.println(matcher.start() + ", end: " + matcher.end() + "group: " + matcher.group()); }
lookingAt方法:控制台打印出0, end: 3group: abc,lookingAt方法永远为真。
while (matcher.lookingAt()) { System.out.println(matcher.start() + ", end: " + matcher.end() + "group: " + matcher.group()); }
find方法:返回true之后,再次调用是接着匹配的位置向后匹配。
while (matcher.find()) { System.out.println(matcher.start() + ", end: " + matcher.end() + "group: " + matcher.group()); }
打印结果如下:
0, end: 3group: abc 3, end: 6group: abc 6, end: 9group: abc
一、Main方法如下,注意for循环中的是i<=matcher.groupCount();
group是用()划分的,可以根据组的编号来引用某个组。组号为0表示整个表达式,组号1表示被第一对括号括起来的组。依次类推。如:a(bc(d)e)f,就有三个组:组0是abcdef,组1是bcde,组2是d
public static void main(String[] args) { String regex = "a(b(c))d"; String input = "abcdebcabcedbcfcdcfdcd"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input); matcher.region(0, 8); while(matcher.find()) { for(int i = 0; i <= matcher.groupCount(); i ++) { System.out.println("i = " + i + ", start: " + matcher.start(i) + ", end: " + matcher.end(i) + ", group: " + matcher.group(i)); } } System.out.println("------------------------------------------------------"); matcher.reset("abcd"); if(matcher.matches()) { for(int i = 0; i <= matcher.groupCount(); i ++) { System.out.println("start: " + matcher.start(i) + ", end: " + matcher.end(i) + ", group: " + matcher.group(i)); } } System.out.println("------------------------------------------------------"); matcher.reset("abcdabcd"); if (matcher.lookingAt()) { for(int i = 0; i <= matcher.groupCount(); i ++) { System.out.println("start: " + matcher.start(i) + ", end: " + matcher.end(i) + ", group: " + matcher.group(i)); } } }
得到运行结果如下:
i = 0, start: 0, end: 4, group: abcd i = 1, start: 1, end: 3, group: bc i = 2, start: 2, end: 3, group: c ------------------------------------------------------ start: 0, end: 4, group: abcd start: 1, end: 3, group: bc start: 2, end: 3, group: c ------------------------------------------------------ start: 0, end: 4, group: abcd start: 1, end: 3, group: bc start: 2, end: 3, group: c
二、Pattern的split()方法:
@Test public void pattern3() { String regex = "abc"; Pattern pattern = Pattern.compile(regex); String[] strings = pattern.split("helloabcdabcefgabcgoup"); for (String string : strings) { System.out.println("string: " + string); } System.out.println("-------------------------------------------------"); String[] stringsLimit = pattern.split("helloabcdabcefgabcgoup", 2); for (String string : stringsLimit) { System.out.println("string: " + string); } }
运行结果如下:
string: hello string: d string: efg string: goup ------------------------------------------------- string: hello string: dabcefgabcgoup
它的原理:
public String[] split(CharSequence input, int limit) { int index = 0; boolean matchLimited = limit > 0; ArrayList<String> matchList = new ArrayList<>(); Matcher m = matcher(input); // Add segments before each match found while(m.find()) { if (!matchLimited || matchList.size() < limit - 1) { String match = input.subSequence(index, m.start()).toString(); matchList.add(match); index = m.end(); } else if (matchList.size() == limit - 1) { // last one String match = input.subSequence(index, input.length()).toString(); matchList.add(match); index = m.end(); } } // If no match was found, return this if (index == 0) return new String[] {input.toString()}; // Add remaining segment if (!matchLimited || matchList.size() < limit) matchList.add(input.subSequence(index, input.length()).toString()); // Construct result int resultSize = matchList.size(); if (limit == 0) while (resultSize > 0 && matchList.get(resultSize-1).equals("")) resultSize--; String[] result = new String[resultSize]; return matchList.subList(0, resultSize).toArray(result); }
三、Pattern的标记flag
如果是默认的,打印结果:huhx
@Test public void pattern4() { String regex = "huhx$"; String input = "My name is huhx\n how are you, HUHX\n I love you, Huhx\n huhx"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input); while (matcher.find()) { System.out.println(matcher.group()); } }
如果加上多行和大小写忽略:
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
则打印结果:
huhx
HUHX
Huhx
huhx
四、String类的替换方法
@Test public void pattern6() { String regex = "huhx$"; String input = "My name is huhx$ and I love you. huhx"; String inputReplace = input.replace(regex, "Linux"); String inputReplaceFirst = input.replaceFirst(regex, "Linux"); String inputReplaceAll = input.replaceAll(regex, "Linux"); System.out.println("inputReplace: " + inputReplace); System.out.println("inputReplaceFirst: " + inputReplaceFirst); System.out.println("inputReplaceAll: " + inputReplaceAll); }
打印结果如下:replace方法是字符串替换,应用的是Pattern.LITERAL标志位。replaceFirst和replaceAll方法应用了默认正则表达式。
inputReplace: My name is Linux and I love you. huhx inputReplaceFirst: My name is huhx$ and I love you. Linux inputReplaceAll: My name is huhx$ and I love you. Linux
五、appendReplacement方法:
@Test public void pattern7() { String regex = "huhx"; String input = "My name is huhx and I love you. huhx"; Pattern p = Pattern.compile(regex); Matcher m = p.matcher(input); StringBuffer sb = new StringBuffer(); while (m.find()) { m.appendReplacement(sb, "Linux"); } m.appendTail(sb); System.out.println(sb.toString()); }
打印结果如下:
My name is Linux and I love you. Linux
一、Pattern.compile(String regex)的代码:
private Pattern(String p, int f) { pattern = p; flags = f; // to use UNICODE_CASE if UNICODE_CHARACTER_CLASS present if ((flags & UNICODE_CHARACTER_CLASS) != 0) flags |= UNICODE_CASE; // Reset group index count capturingGroupCount = 1; localCount = 0; if (pattern.length() > 0) { compile(); } else { root = new Start(lastAccept); matchRoot = lastAccept; } }
capturingGroupCount是跟group有关的,默认为1。重点方法在于compile(),初始化Node类matchRoot。这是主要的匹配的执行类。
二、Pattern的matcher方法:
public Matcher matcher(CharSequence input) { if (!compiled) { synchronized(this) { if (!compiled) compile(); } } Matcher m = new Matcher(this, input); return m; }
Matcher(this, input)的代码如下:groups为int数组,存储group匹配的start和end。后面group(int i)方法返回的匹配结果就是根据group的数值和text得到的。
Matcher(Pattern parent, CharSequence text) { this.parentPattern = parent; this.text = text; // Allocate state storage int parentGroupCount = Math.max(parent.capturingGroupCount, 10); groups = new int[parentGroupCount * 2]; locals = new int[parent.localCount]; // Put fields into initial states reset(); }
三、Matcher的find方法:
public boolean find() { int nextSearchIndex = last; if (nextSearchIndex == first) nextSearchIndex++; // If next search starts before region, start it at region if (nextSearchIndex < from) nextSearchIndex = from; // If next search starts beyond region then it fails if (nextSearchIndex > to) { for (int i = 0; i < groups.length; i++) groups[i] = -1; return false; } return search(nextSearchIndex); }
search方法:如果没有匹配结果,那么first为-1,下一次调用find方法从头开始匹配。
boolean search(int from) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = NOANCHOR; boolean result = parentPattern.root.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result; }
四、lookingAt方法:
public boolean lookingAt() { return match(from, NOANCHOR); }
match方法:
boolean match(int from, int anchor) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = anchor; boolean result = parentPattern.matchRoot.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result; }
五、matches方法:match方法是上述match方法。
public boolean matches() { return match(from, ENDANCHOR); }
match方法都有这样的代码: this.first = from;有一个方法可以设置匹配字段的范围:
public Matcher region(int start, int end) { if ((start < 0) || (start > getTextLength())) throw new IndexOutOfBoundsException("start"); if ((end < 0) || (end > getTextLength())) throw new IndexOutOfBoundsException("end"); if (start > end) throw new IndexOutOfBoundsException("start > end"); reset(); from = start; to = end; return this; }
官方文档的说明如下:
A matcher finds matches in a subset of its input called the region. By default, the region contains all of the matcher‘s input. The region can be modified via theregion method and queried via the regionStart and regionEnd methods. The way that the region boundaries interact with some pattern constructs can be changed.
六、得到结果集的信息:group数组存储匹配结果集的开始和结束的index。根据连续的index是可以确定一个字符串的。
group()方法:
public String group() { return group(0); }
group(int group)方法:
public String group(int group) { if (first < 0) throw new IllegalStateException("No match found"); if (group < 0 || group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); if ((groups[group*2] == -1) || (groups[group*2+1] == -1)) return null; return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString(); }
getSubSequence(int start, int end)方法:
CharSequence getSubSequence(int beginIndex, int endIndex) { return text.subSequence(beginIndex, endIndex); }
七、start和end方法:
start()方法:
public int start() { if (first < 0) throw new IllegalStateException("No match available"); return first; }
start(int group)方法:
public int start(int group) { if (first < 0) throw new IllegalStateException("No match available"); if (group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); return groups[group * 2]; }
end()方法:
public int end() { if (first < 0) throw new IllegalStateException("No match available"); return last; }
end(int group)方法:
public int end(int group) { if (first < 0) throw new IllegalStateException("No match available"); if (group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); return groups[group * 2 + 1]; }
标签:
原文地址:http://www.cnblogs.com/huhx/p/javaRegrex2.html