码迷,mamicode.com
首页 > 编程语言 > 详细

Java正则表达式的几个应用实例(匹配网址,匹配美国安全码,匹配日期)

时间:2014-11-29 17:13:00      阅读:199      评论:0      收藏:0      [点我收藏+]

标签:android   style   blog   http   io   ar   color   os   sp   

     由于最近做的项目需要从英文文本中提取出字符串进行话题的聚类,于是就花了一天的时间来学习Java正则表达式,一下几个小例子是我的一些小练笔,如有不合理之处,还望各位指教!!

     1.此例是用来过滤掉英文文本中的网址,并将过滤后的字符串输出

      首先需要先贴出来我需要过滤的英文文本,我将这些文本存在一个名为englishtxt.txt中,其内容为

  

 1 www.baidu.com
 2 银行挤兑:可能引发下一轮金融危机的盲点 http://mp.weixin.qq.com/s?__biz=MjM5MDY4Mzg2MA==&mid=200223248&idx=1&sn=a5b668754a60a8e07f335bd59521fb03#rd?…
 3 Beijing CBD right now 01 pic.twitter.com/zCNP4CFrrk
 4 I see more and more Chinese ask the same question online: what if most #MH370 passengers were Americans; how would the US government react?
 5 10:27:01 Chinese Net friend expectations http://chinafree.greatzhonghua.org/showthread.php?tid=5377?… Chinese Net friend expectations -...
 6 01:47:01 Times silly and fantastic notions, Gu Xiaojun Thought Yiu glorious http://chinafree.greatzhonghua.org/showthread.php?tid=4969?… T...
 7 [強國空氣問題比愛滋更嚴重] China Smog at Center of <<Air Pollution Deaths Cited>> by WHO http://bloom.bg/1rqNRBP? /via @BloombergNews
 8 [Android 高登仔] LIHK 已重生,你會花 HK$10 買嗎? https://play.google.com/store/apps/details?id=com.lihk.hkgolden.app.reborn?…
 9 #Taiwan protests: Water cannons are an indiscriminate tool for dispersing protesters & can result in serious injury
10 NASA 的新太空衣... http://jscfeatures.jsc.nasa.gov/z2/?
11 PHOTOS: Marijuana through the years http://ow.ly/uXzuq? (AP Photo/DEA) pic.twitter.com/4LSP4nlLMQ
12 Protest in Taiwan http://blog.flickr.net/en/2014/03/24/protest-in-taiwan/?… /via @flickr
13 [原來昨天說的那位嬰兒已經...] Baby born on board diverted Cathay flight dies http://www.scmp.com/news/hong-kong/article/1456417/baby-born-board-diverted-cathay-flight-dies?… /via @SCMP_News
14 What does Apple think about the lack of diversity in emojis? We have their response. http://on.mtv.com/OWu6D7? /via @MTVact
15 Linkin Park releases customizable music video powered by Xbox‘s Project Spark http://www.theverge.com/2014/3/25/5546982/linkin-park-releases-customizable-music-video-powered-by-xboxs?…
16 Full draw for @afcasiancup 2015 is here pic.twitter.com/nrYJo1mm9G #AC2015
17 Interesting draw RT @afcasiancup: Group B: Saudi Arabia, China PR, DPR Korea, Uzbekistan #AC2015
18 Finally: @emirates are activating their Twitter account.
19 Interior Minister Prince Mohammed bin Naif launches new ministry site aboard what appears like a private jet —SPA pic.twitter.com/NDSGJVbXTs

   

  从该文本文档中我们可以看出,文本中存在大量的网址,如果直接拿来进行话题聚类的话,会产生大量的噪声数据,于是需要去除这些网址,于是我的代码如下

   

 1 import java.io.BufferedReader;
 2 import java.io.File;
 3 import java.io.FileNotFoundException;
 4 import java.io.FileReader;
 5 import java.io.IOException;
 6 import java.util.regex.Matcher;
 7 import java.util.regex.Pattern;
 8 
 9 public class URLMatcher {
10     public static void main(String[] args) throws IOException {
11             BufferedReader br = new BufferedReader(new FileReader(new File("D://englishtxt.txt")));
12             System.out.println("开始从文本中读数据");
13             String line = br.readLine();
14             while(line!=null)
15             {
17                 String value = line.replaceAll("(http://|https://|ftp://)?(\\w+\\.)+\\w+(:\\d*)?([^#\\s]*)","").replaceAll("[\\/?:;!@#$%^&*+()【】<<>>...-]", "");
18                 StringBuilder strb = new StringBuilder();
19                 Pattern ptn = Pattern.compile("\\w+");
20                 Matcher mch = ptn.matcher(value);
21                 while(mch.find())
22                 {
23                     strb.append(mch.group());
24                     strb.append(" ");
25                 }
26                 System.out.println(strb.toString());
27                 line = br.readLine();
28             }
29 
30    }   
31 }    

   上面代码的功能不仅能够过滤掉大量的网址,还可以去除一些特殊的标点符号

 运行的结果如下:

   

开始从文本中读数据

rd 
Beijing CBD right now 
I see more and more Chinese ask the same question online what if most MH passengers were Americans how would the US government react 
Chinese Net friend expectations Chinese Net friend expectations 
Times silly and fantastic notions Gu Xiaojun Thought Yiu glorious T 
China Smog at Center of Air Pollution Deaths Cited by WHO via BloombergNews 
Android LIHK HK I 
Taiwan protests Water cannons are an indiscriminate tool for dispersing protesters can result in serious injury 
NASA 
PHOTOS Marijuana through the years AP PhotoDEA 
Protest in Taiwan via flickr 
f Baby born on board diverted Cathay flight dies via SCMP News 
What does Apple think about the lack of diversity in emojis We have their response via MTVact 
Linkin Park releases customizable music video powered by Xbox s Project Spark 
Full draw for afcasiancup is here AC 
Interesting draw RT afcasiancup Group B Saudi Arabia China PR DPR Korea Uzbekistan AC 
Finally emirates are activating their Twitter account 
Interior Minister Prince Mohammed bin Naif launches new ministry site aboard what appears like a private jet SPA 

   从上面的结果可以看出,网址基本都被过滤出来了。

 

2.下面的这个小例子是用来匹配美国的安全码

 代码如下:

            String safeNum = "This is a safe num 999-99-9999,this is the second num 456003348,this is the third num 456-909090,this is the forth num 45677-0764";
            Pattern ptn = Pattern.compile("\\d{3}\\-?\\d{2}\\-?\\d{4}");
            Matcher mch = ptn.matcher(safeNum);
            while(mch.find())
            {
                System.out.println(mch.group());
            }

最后的输出结果为:

999-99-9999
456003348
456-909090
45677-0764

 

3.这个小例子是用来匹配英文中的日期

            String strDate = "this is a date June 26,1951";
            Pattern ptn = Pattern.compile("([a-zA-Z]+)\\s[0-9]{1,2},\\s*[0-9]{4}");
            Matcher mch = ptn.matcher(strDate);
            while(mch.find())
            {
                System.out.println(mch.group());
            }

输出结果为:

June 26,1951

以上的这3个小例子就是我在学正则表达式的时候做的小练笔,希望对大家的学习有所帮助!!

 

Java正则表达式的几个应用实例(匹配网址,匹配美国安全码,匹配日期)

标签:android   style   blog   http   io   ar   color   os   sp   

原文地址:http://www.cnblogs.com/westlake1990/p/4131047.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!