上数据挖掘课,数据准备部分考虑这样做:根据配置文件打开相应的网址并保存。之后再对这些文件进行内容解析、文本提取、矩阵转换、聚类等。
public static void main(String[] args){ final int THREAD_COUNT=5; String baseUrl=null; String searchBlogs=null; String blogs[]=null; String fileDir=null; //String category=null; InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties"); Properties p = new Properties(); try { p.load(inputStream); baseUrl=p.getProperty("baseUrl"); fileDir=p.getProperty("fileDir"); searchBlogs=p.getProperty("searchBlogs"); if(searchBlogs!=""){ blogs=searchBlogs.split(";"); } ExecutorService pool=Executors.newFixedThreadPool(THREAD_COUNT); for(String s:blogs){ pool.submit(new SaveWeb(baseUrl+s,fileDir+"/"+s+".html")); } pool.shutdown(); //category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8"); } catch (IOException e) { e.printStackTrace(); } }
打开网页并保存模块:
public class SaveWeb implements Runnable{ private String url; private String filename; public SaveWeb(String url,String filename){ this.url=url; this.filename=filename; } @Override public void run() { HttpClient httpclient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); try{ HttpResponse response = httpclient.execute(httpGet); HttpEntity entity = response.getEntity(); BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename)); if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){ if (entity != null) { String res=EntityUtils.toString(entity,"UTF-8"); outputStream.write(res.getBytes("UTF-8")); outputStream.flush(); } } outputStream.close(); }catch(IOException e){ e.printStackTrace(); } } }
后续:
作业完成了,但几乎和上面的内容没啥关系,本来想全删了。再想也不算写错,只是没用上而已,还是留着吧。
最终,用Java代码循环加并发去获得一个地址列表存到文件里。而采用R语言去做的挖掘工作。包括获取网页、解析正文、分词、聚类、结果输出等。R语言真是省事,几十行代码全搞定了。但最终分类的结果不理想。看来基于全文的计算特征不明显,划分出来的类也很不准确,还得考虑改进。
本文出自 “空空如也” 博客,转载请与作者联系!
原文地址:http://6738767.blog.51cto.com/6728767/1920069