码迷,mamicode.com
首页 > 其他好文 > 详细

代理池的创建

时间:2016-12-24 17:19:33      阅读:244      评论:0      收藏:0      [点我收藏+]

标签:arp   文件   interval   动手   爬虫   uid   foreach   size   html   

应用场景

   爬虫过于频繁的抓取网站信息会被反爬虫机制屏蔽掉,或者有些网站对我们的Ip有限制,一个IP之能操作一次,这个时候就需要设置代理了。这方面需求还是很大的,有专门的服务商提供代理,没钱的自己动手打造一个代理池吧。

所用的工具

  O/RM-Entity Framework

  Html解析-HtmlAgilityPack  任务调度-Quartz.NET基本原理

  部分网站上有免费的代理IP信息,比如xicidaili.com,proxy360.cn。这些网站有很多免费代理IP,然而有些质量不好,需要程序及时从代理池中删掉质量低的代理,不断加入优质代理。

技术分享

解决方案构建

   创建一个ProxyPool的控制台应用程序,并使用NuGET添加Entity FrameworkHtmlAgilityPackQuartz.NET

   创建一个IP类

  public class IP
    {
        public int Id { get; set; }
        public string ip { get; set; }
        public int port { get; set; }
    
    }

 

 创建数据库上下文类

 public  class ProxyEntity:DbContext
    {
        public DbSet<IP> IPS { get; set; }
    }

修改配置文件,添加
<connectionStrings>
        <add name="ProxyEntity" connectionString="server=.;database=Proxy;uid=sa;pwd=000000"  providerName="System.Data.SqlClient" />
    </connectionStrings>
技术分享
代理池的管理
创建一个PoolManage类来管理池子,需要一个数据库上下文对象
 private ProxyEntity Database;
        public PoolManage()
        {
            Database = new ProxyEntity();
        }

封装对代理IP的添加,首先取数据库查查是否已经加入池子中,不存在则加入,这个是为了保证数据不重复

 public void Add(IP ip)
        {
            var ips = Database.IPS;
            if (ips.Where(i => i.ip == ip.ip).Count() == 0)
            {
                ips.Add(ip);
                Database.SaveChanges();
            }

        }

封装一个页面下载方法
public string DownloadHtml(string url)
        {
            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0";
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                using (Stream dataStream = response.GetResponseStream())
                {
                    using (StreamReader reader = new StreamReader(dataStream, Encoding.UTF8))
                    {
                        return reader.ReadToEnd();
                    }
                }
               
            }
            catch
            {
                return "";
            }
        }

抓取西刺代理
  public void Downloadxicidaili()//下载西刺代理的html页面
        {

            string url = "http://www.xicidaili.com/";
            string html = DownloadHtml(url);
            
            HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);
            HtmlNode node = doc.DocumentNode;
            string xpathstring = "//tr[@class=‘odd‘]";
            HtmlNodeCollection collection = node.SelectNodes(xpathstring);
            foreach (var item in collection)
            {
                IP ip = new IP();
                string xpath = "td[2]";
                ip.ip = item.SelectSingleNode(xpath).InnerHtml;
                xpath = "td[3]";
                ip.port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
                Add(ip);
            }

        }

 抓取proxy360

 public void Downloadproxy360()//下载proxy360
        {
            string url = "http://www.proxy360.cn/default.aspx";
            string html= DownloadHtml(url);
            HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);
            string xpathstring = "//div[@class=‘proxylistitem‘]";
            HtmlNode node = doc.DocumentNode;
            HtmlNodeCollection collection = node.SelectNodes(xpathstring);
            foreach (var item in collection)
            {
                   IP ip = new IP();
                   var childnode = item.ChildNodes[1];
                   xpathstring = "span[1]";
                   ip.ip = childnode.SelectSingleNode(xpathstring).InnerHtml.Trim();
                   xpathstring = "span[2]";
                   ip.port = int.Parse(childnode.SelectSingleNode(xpathstring).InnerHtml);
                   ip.From = 3;
                   Add(ip);
            }
        }

只往池子里面加代理还不行,对于劣质代理需要及时抽出来
代理检测
 public static bool IsAvailable(IP ip)
        {
            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://www.baidu.com/");
                request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0";
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                using (Stream dataStream = response.GetResponseStream())
                {
                    using (StreamReader reader = new StreamReader(dataStream, Encoding.UTF8))
                    {
                        if (reader.ReadToEnd().Contains("百度"))
                        {
                            return true;
                        }
                        return false;
                    }
                }
             
            }
            catch
            {
                return false;
            }
        }
对代理池中的所有代理进行遍历,踢出劣质代理,为了快速采用多线程,每个线程负责检测10个代理,创建一个类
 public class TextThread
    {
        private ProxyEntity Database;
        private int PageNum;
        private int pageSize = 10;
      
        public TextThread(int _PageNum)
        {

            Database = new ProxyEntity();
            PageNum = _PageNum;
        }
        public void test()
        {
            List<IP> Ips = Database.IPS.OrderBy(i => i.Id).Skip(pageSize * PageNum).Take(pageSize).ToList();
            Console.WriteLine();
            foreach (var item in Ips)
            {
              
                if (!PoolManage.IsAvailable(item))
                {
                    Database.IPS.Remove(item);
                    Database.SaveChanges();
                    Console.WriteLine("删除一个");
                    return;
                }
                Console.WriteLine("测试通过");
            }
          
        }

    }

 回到PoolManage

  public void TextAllIps()
        {
            var ips = Database.IPS;
            int pageSize = 10;
            double s = ips.Count() / pageSize;
            int PageCount = (int)Math.Ceiling(s);

            for (int i = 0; i < PageCount; i++)
            {
                TextThread test = new TextThread(i);
                Thread thread = new Thread(test.test);
                thread.IsBackground = true;         
                thread.Start();
            }
        }

 

执行这个工作的时候需要抓取代理,检测代理,把这些行为封装到一起
  public void Initial()
        {
            Downloadxicidaili();
           Downloadproxy360();
            TextAllIps();
        }
进行测试下,在main方法中添加
            PoolManage manager = new PoolManage();
           manager.Initial();
            Console.ReadLine();
技术分享

 

 很多端口可能会挂掉,由优质变成劣质,程序还不能停下来,需要不断检测,不断抓取,这里使用Quartz来定时执行

先创建工作类
 class TotalJob:IJob
    {
        public void Execute(IJobExecutionContext context)
        {
            PoolManage manager = new PoolManage();
            manager.Initial();
        }
    }
让它定时执行
 private static void Run()
        {
            try
            {
                StdSchedulerFactory factory = new StdSchedulerFactory();
                IScheduler scheduler = factory.GetScheduler();
                scheduler.Start();
                IJobDetail job = JobBuilder.Create<TotalJob>().WithIdentity("job1", "group1").Build();
                ITrigger trigger = TriggerBuilder.Create()
                 .WithIdentity("trigger1", "group1")
                 .StartNow()
                 .WithSimpleSchedule(
                 x => x
                .WithIntervalInMinutes(5)//每5分钟执行一次
                 .RepeatForever()
                ).Build();
                scheduler.ScheduleJob(job, trigger);
                //Thread.Sleep(TimeSpan.FromSeconds(60));
                //  scheduler.Shutdown();
            }
            catch (SchedulerException se)
            {
                Console.WriteLine(se);
            }
        }

 在main函数里面添加代码

    Run();
            Console.WriteLine("Press any key to close the application");

 


已经实现了对代理的管理,这个池子的资源还是要给人用,服装一个方法,让它从池子里面随机返回一个代理
 public class Pool
    {
        public static IP GetIP()
        {
            ProxyEntity Database = new ProxyEntity();
            List<IP> list = Database.IPS.ToList();
            Random randowm = new Random();
            int index=0;
                if(list.Count>0)
                {
                    index=randowm.Next(list.Count);
                }
                return (IP)list.ToArray().GetValue(index);
        }
}

 

 测试一下代理池的质量吧

public static void Test()
        {
            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://www.baidu.com/");
                request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0";
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                using (Stream dataStream = response.GetResponseStream())
                {
                    using (StreamReader reader = new StreamReader(dataStream, Encoding.UTF8))
                    {
                        Console.WriteLine("请求成功,请求的结果为:");
                        Console.WriteLine(reader.ReadToEnd().Substring(0,25));
                        Console.WriteLine("...");
                    }
                }
               
            }
            catch {
                Console.WriteLine("失败");
            }
        }

 

 测试结果
技术分享
质量还行

 

 

 

 


 

 

 


代理池的创建

标签:arp   文件   interval   动手   爬虫   uid   foreach   size   html   

原文地址:http://www.cnblogs.com/zuin/p/6217421.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!