数据采集类

时间：2014-07-19 18:15:25 阅读：180 评论：0 收藏：0 [点我收藏+]

爬虫，又称蜘蛛，是从别的网站抓取资源的一种方法，C#.NET使用爬虫的方法如下：

protected string GetPageHtml(string url)
{
string pageinfo;
try
{
WebRequest myreq = WebRequest.Create(url);
WebResponse myrep = myreq.GetResponse();
StreamReader reader = new StreamReader(myrep.GetResponseStream(), Encoding.GetEncoding("gb2312"));
pageinfo = reader.ReadToEnd();
}
catch
{
pageinfo = "";
}
return pageinfo;
}

按上述方法就可以在程序中获取某URL的页面源文件。
但是有些网站屏蔽了爬虫，那就需要模拟浏览器获取的方法来进行，具体代码如下：

protected string GetPageHtml(string url)
{
string pageinfo;
try
{
HttpWebRequest myReq = (HttpWebRequest)HttpWebRequest.Create(url);
myReq.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
myReq.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)";
HttpWebResponse myRep = (HttpWebResponse)myReq.GetResponse();
Stream myStream = myRep.GetResponseStream();
StreamReader sr = new StreamReader(myStream, Encoding.Default);
pageinfo = sr.ReadToEnd().ToString();
}
catch
{
pageinfo = "";
}
return pageinfo;
}

数据采集类,布布扣,bubuko.com

数据采集类

标签：style blog http color 使用 os

原文地址：http://www.cnblogs.com/yujinchao88/p/3855051.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行