上了数据挖掘这门课,想做个小例子。基本思路是根据配置文件打开相应的网址,再根据内容的分类信息自动选择符合的文章进行保存。考虑到效率,采用多线程方式。才完成一个基本框架。包括读取配置文件、打开网址、保存文件;后续还有很多工作,解析网页,分析类别等。感觉最大的工作量是网页分析,和数据挖掘关系不大啊,不知道符不符合要求。代码如下:
public static void main(String[] args) { String baseUrl=null; String searchBlogs=null; String blogs[]=null; String fileDir=null; String category=null; InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties"); Properties p = new Properties(); try { p.load(inputStream); baseUrl=p.getProperty("baseUrl"); fileDir=p.getProperty("fileDir"); searchBlogs=p.getProperty("searchBlogs"); if(searchBlogs!=""){ blogs=searchBlogs.split(";"); } for(String s:blogs){ //System.out.println(s); //此处要采用多线程操作 SaveFile sf=new SaveFile(baseUrl+s,fileDir+"/"+s+".html"); sf.save(); } category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8"); } catch (IOException e) { e.printStackTrace(); } }
public class SaveFile { private String url; private String filename; public SaveFile(String url,String filename){ this.url=url; this.filename=filename; } public void save(){ HttpClient httpclient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); try{ HttpResponse response = httpclient.execute(httpGet); HttpEntity entity = response.getEntity(); BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename)); if (entity != null) { String res=EntityUtils.toString(entity,"UTF-8"); outputStream.write(res.getBytes("UTF-8")); outputStream.flush(); } outputStream.close(); }catch(IOException e){ e.printStackTrace(); } } }
原文地址:http://blog.csdn.net/zjc/article/details/44308945