码迷,mamicode.com
首页 > 编程语言 > 详细

java爬取网站信息和url实例

时间:2018-12-26 11:38:08      阅读:215      评论:0      收藏:0      [点我收藏+]

标签:spi   tin   iter   url   cat   mkdir   buffer   blog   writer   


https://blog.csdn.net/weixin_38409425/article/details/78616688(出自此為博主)

 

具體代碼如下:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.URL;
import java.net.URLConnection;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* 网络爬虫
*
* @author jacke 陈
*
*/
public class SpirderUrl {

public static void spiderURL(String url, String regex, String filename) {

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

String time = sdf.format(new Date());
System.out.println(time);

URL realURL = null;
URLConnection connection = null;
BufferedReader br = null;
PrintWriter pw = null;
PrintWriter pw1 = null;

Pattern pattern = Pattern.compile(regex);
try {
realURL = new URL(url);
connection = realURL.openConnection();
// connection.connect();

File fileDir = new File("E:/spider/" + time);
if (!fileDir.exists()) {
fileDir.mkdirs();
}
// 将爬取到的内容放到E盘相应目录下
pw = new PrintWriter(
new FileWriter("E:/spider/" + time + "/" + filename + "_content.txt"), true);
pw1 = new PrintWriter(new FileWriter("E:/spider/" + time + "/" + filename + "_URL.txt"),
true);

br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = null;

// 读写
while ((line = br.readLine()) != null) {
pw.println(line);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
pw1.println(matcher.group());
}

}
System.out.println("爬取成功!");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
br.close();
pw.close();
pw1.close();
} catch (IOException e) {
e.printStackTrace();
}

}

}

public static void main(String[] args) {
String url = "https://www.cnblogs.com/csh520mjy/p/";
String regex = "(http|https)://[\\w+\\.?/?]+\\.[A-Za-z]+";
spiderURL(url, regex, "8btc");
}

}

爬取結果:

 

技术分享图片

 

 

java爬取网站信息和url实例

标签:spi   tin   iter   url   cat   mkdir   buffer   blog   writer   

原文地址:https://www.cnblogs.com/csh520mjy/p/10177885.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!