Java正则表达式--网页爬虫

时间：2016-03-31 20:19:27 阅读：230 评论：0 收藏：0 [点我收藏+]

标签：

网页爬虫：其实就一个程序用于在互联网中获取符合指定规则的数据爬取邮箱地址，爬取的源不同，本地爬取或者是网络爬取

(1)爬取本地数据：

 1     public static List<String> getMails() throws IOException {
 2         // 1.读取源文件
 3         // 爬取本地文件
 4         BufferedReader bufr = new BufferedReader(new FileReader("D:\\mail.txt"));
 5         // 2.对读取的数据进行规则的匹配，从中获取符合规则的数据
 6         String mail_regex = "\\w+@\\w+(\\.\\w+)+";
 7         List<String> list = new ArrayList<String>();
 8         Pattern p = Pattern.compile(mail_regex);
 9         String line = null;
10         while ((line = bufr.readLine()) != null) {
11             Matcher m = p.matcher(line);
12             while (m.find()) {
13         // 3.将符合规则的数据存储到集合中
14                 list.add(m.group());
15             }
16         }
17         return list;
18     }

运行结果：

emdm@cnw.cjn
cwec@cwc.cwk.cwe
163@com.cn
shuwei_yao@163.com.cn

(2)爬取网络数据

 1     public static List<String> getWebMails() throws IOException {
 2         // 1.读取源文件
 3         URL url = new URL("http://sina.com.cn");
 4         BufferedReader bufIn = new BufferedReader(new InputStreamReader(
 5                 url.openStream()));
 6         // 2.对读取的数据进行规则的匹配，从中获取符合规则的数据
 7         String mail_regex = "\\w+@\\w+(\\.\\w+)+";
 8         List<String> list = new ArrayList<String>();
 9         Pattern p = Pattern.compile(mail_regex);
10         String line = null;
11         while ((line = bufIn.readLine()) != null) {
12             Matcher m = p.matcher(line);
13             while (m.find()) {
14         // 3.将符合规则的数据存储到集合中
15                 list.add(m.group());
16             }
17         }
18         return list;
19     }

运行结果：

jubao@vip.sina.com
jubao@vip.sina.com

Java正则表达式--网页爬虫

标签：

原文地址：http://www.cnblogs.com/ysw-go/p/5342445.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行