码迷,mamicode.com
首页 > 其他好文 > 详细

爬取当当网的图书信息之结尾

时间:2016-11-27 16:58:02      阅读:166      评论:0      收藏:0      [点我收藏+]

标签:对象   string   url   页面   eth   array   indexof   back   ssid   

由于当当网上的图书信息很丰富,全部抓取下来工作量很大。只抓取其中的一类

在Main()方法里面

首先用户输入种子URL

 string starturl = Console.ReadLine();

构建数据库上下文对象

   BookStoreEntities storeDB = new BookStoreEntities();

获取图书类的URL

 

 string html = Tool.GetHtml(starturl);
            ArrayList list = new ArrayList();
            list = Tool.GetList(html);
            foreach (var item in list)
            {
                BookClass bookclass = new BookClass();
                bookclass.Url = item.ToString();
                storeDB.BookClass.Add(bookclass);
            }
            storeDB.SaveChanges();

使用多线程爬取图书信息

  每个图书种类都开一个线程来爬取图书信息

封装一个process类

 public class process
    {
        BookStoreEntities storeDB = new BookStoreEntities();

        public BookClass BookClass;
        public process(int BookClassId)
        {
            BookClass = storeDB.BookClass.Find(BookClassId);
        }
   
    }

接下来要在这个类实现爬取图书信息

  public void threads()
        {
}

实现翻页

图书种类展示页面是有规律的

http://category.dangdang.com/cp01.54.06.00.00.00.html
http://category.dangdang.com/pg2-cp01.54.06.00.00.00.html
http://category.dangdang.com/pg3-cp01.54.06.00.00.00.html

把第一页的URL拆成两部分 前部分http://category.dangdang.com/,后部分cp01.54.06.00.00.00.html

第二页到100页都是  前部分+"pg"+页数+“-”+后部分

for (int i = 1; i <= BookClass.Pages; i++)
            {
                string url = "";
                //http://category.dangdang.com/pg100-cp01.54.06.00.00.00.html
                //http://book.dangdang.com/01.54.htm?ref=book-01-A
                //http://category.dangdang.com/cp01.54.06.00.00.00.html
                //http://category.dangdang.com/pg2-cp01.54.13.00.00.00.html
                string tempurl = BookClass.Url;
                int p1 = tempurl.IndexOf("cp");
                string fast = "";
                string rear = "";
                if (p1 > 0)
                {
                    
                        fast = tempurl.Substring(0, p1);
                       rear = tempurl.Substring(p1, tempurl.Length - p1);
                       url = fast + "pg" + i.ToString() + "-" + rear;                    
                }
                if (url == "")
                {
                    return;
                }
                if (i==1)
                {
                    url = BookClass.Url;
                }
}

继续在这个方法里面添加

 public void threads()
        {

            ArrayList L = new ArrayList();
            for (int i = 1; i <= BookClass.Pages; i++)
            {
                string url = "";
                //http://category.dangdang.com/pg100-cp01.54.06.00.00.00.html
                //http://book.dangdang.com/01.54.htm?ref=book-01-A
                //http://category.dangdang.com/cp01.54.06.00.00.00.html
                //http://category.dangdang.com/pg2-cp01.54.13.00.00.00.html
                string tempurl = BookClass.Url;
                int p1 = tempurl.IndexOf("cp");
                string fast = "";
                string rear = "";
                if (p1 > 0)
                {
                    
                        fast = tempurl.Substring(0, p1);
                       rear = tempurl.Substring(p1, tempurl.Length - p1);
                       url = fast + "pg" + i.ToString() + "-" + rear;                    
                }
                if (url == "")
                {
                    return;
                }
                if (i==1)
                {
                    url = BookClass.Url;
                }
                string internet = Tool.GetHtml(url);
                L = Tool.GetProduct(internet);
                foreach (var item in L)
                {
                    Console.WriteLine(item.ToString());
                    string html = Tool.GetHtml(item.ToString());
                    Dictionary<int, string> dict;
                    dict = Tool.analysis(html);
                    Book book = new Book
                    {
                        AuthorName = dict[3],
                        BookName = dict[1],
                        Price = Convert.ToDecimal(dict[2]),
                        Publisher = dict[4],
                        PictureUrl = dict[5],
                        BookContent = dict[6]
                    };
                    BookClass.Books.Add(book);
                    storeDB.SaveChanges();

                }


            }
        }

回到Main函数

var items = storeDB.BookClass;

            foreach (var bookclass in items )
            {
                process p=new process(bookclass.BookClassId);
                Thread th = new Thread(p.threads);
                th.IsBackground = true;
                th.Start();
                Thread.Sleep(1000);
            }
            storeDB.SaveChanges();
            Console.ReadLine();

 

爬取当当网的图书信息之结尾

标签:对象   string   url   页面   eth   array   indexof   back   ssid   

原文地址:http://www.cnblogs.com/zuin/p/6106468.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!