上了数据挖掘这门课,想做个小例子。基本思路是根据配置文件打开相应的网址,再根据内容的分类信息自动选择符合的文章进行保存。考虑到效率,采用多线程方式。才完成一个基本框架。包括读取配置文件、打开网址、保存文件;后续还有很多工作,解析网页,分析类别等。感觉最大的工作量是网页分析,和数据挖掘关系不大啊,不知道符不符合要求。代码如下:
public static void main(String[] args) {
String baseUrl=null;
String searchBlogs=null;
String blogs[]=null;
String fileDir=null;
String category=null;
InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties");
Properties p = new Properties();
try {
p.load(inputStream);
baseUrl=p.getProperty("baseUrl");
fileDir=p.getProperty("fileDir");
searchBlogs=p.getProperty("searchBlogs");
if(searchBlogs!=""){
blogs=searchBlogs.split(";");
}
for(String s:blogs){
//System.out.println(s);
//此处要采用多线程操作
SaveFile sf=new SaveFile(baseUrl+s,fileDir+"/"+s+".html");
sf.save();
}
category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
}public class SaveFile {
private String url;
private String filename;
public SaveFile(String url,String filename){
this.url=url;
this.filename=filename;
}
public void save(){
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
try{
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename));
if (entity != null) {
String res=EntityUtils.toString(entity,"UTF-8");
outputStream.write(res.getBytes("UTF-8"));
outputStream.flush();
}
outputStream.close();
}catch(IOException e){
e.printStackTrace();
}
}
}原文地址:http://blog.csdn.net/zjc/article/details/44308945