标签:
上一篇,对速卖通产品采集做了分析,包含要采集产品信息,以及如何采集这些产品信息,这一篇接着来采集实战,相关技术前篇也说过了,不废话直接开项目做。
一, 创建解决方案,编写采集代码
1. 创建解决方案“CollectorSolution”,在其中新建“Collector” 空 ASP.NET MVC 项目,解决方案结构图如下:
2.在“Collector” 项目中,分别新增“CollectingController” 控制器,以及和控制器相关的视图,并将原来默认路由 Home -》 Index 改成 Collecting -》 Index,截图如下:
RouteConfig 修改成如下:
1 using System.Web.Mvc;
2 using System.Web.Routing;
3
4 namespace Collector
5 {
6 public class RouteConfig
7 {
8 public static void RegisterRoutes(RouteCollection routes)
9 {
10 routes.IgnoreRoute("{resource}.axd/{*pathInfo}");
11
12 routes.MapRoute(
13 name: "Default",
14 url: "{controller}/{action}/{id}",
15 defaults: new { controller = "Collecting", action = "Index", id = UrlParameter.Optional }
16 );
17 }
18 }
19 }
3. 分别新增“CollectionViewModel” ,"CollectedProductViewModel","CollectedProductImageViewModel" 视图模型,和一个存放正则表达式的结构体:“ParseProductPatterns”,代码分别如下
1.> CollectionViewModel
1 using System.Collections.Generic;
2
3 namespace Collector.Models
4 {
5 public class CollectionViewModel
6 {
7 public CollectionViewModel()
8 {
9 ProductViews = new List<CollectedProductViewModel>();
10 }
11 public string CollectionUrl { get; set; }
12 public IEnumerable<CollectedProductViewModel> ProductViews { get; set; }
13 }
14 }
2.> CollectedProductViewModel
1 using System.Collections.Generic;
2
3 namespace Collector.Models
4 {
5 public class CollectedProductViewModel
6 {
7 public CollectedProductViewModel()
8 {
9 ProductImages = new List<CollectedProductImageViewModel>();
10 }
11 public string ProductName { get; set; }
12 public decimal ProductPrice { get; set; }
13 public decimal ProductDiscountPrice { get; set; }
14 public string ProductCurrency { get; set; }
15 public string ProductColor { get; set; }
16 public string ProductSize { get; set; }
17 public IEnumerable<CollectedProductImageViewModel> ProductImages { get; set; }
18 }
19 }
3.>CollectedProductImageViewModel
1 namespace Collector.Models
2 {
3 public class CollectedProductImageViewModel
4 {
5 public string ImageUrl { get; set; }
6 public int Sort { get; set; }
7 }
8 }
4.>ParseProductPatterns
namespace Collector.Models
{
public struct ParseProductPatterns
{
public static string ProductNamePattern = "(?<=<h1 class=\"product-name\" itemprop=\"name\">).*?(?=</h1>)";
public static string ProductJsnPattern = @"(?<=var skuProducts=).*?(?=;\s*var skuAttrIds=)";
public static string ProductImageJsonPattern = "(?<=window.runParams.imageBigViewURL=).*?(?=;)";
public static string ProductCurrencyPattern = "(?<=window.runParams.currencyCode=\").*?(?=\";)";
public static string ProductColorPattern =
"(?<=<a data-role=\"sku\" data-sku-id=\"{0}\" id=\"sku-1-{0}\" title=\").*?(?=\")";
public static string ProductSizePattern =
"(?<=<a data-role=\"sku\" data-sku-id=\"{0}\" id=\"sku-2-{0}\" href=\"javascript:void\\(0\\)\"\\s+><span>).*?(?=</)";
}
}
基本上容易理解,我这里就不再一一讲解了。
4. 视图布局设计很简单,如下图
采集地址 就是速卖通产品地址,这里不支持店铺和类型采集地址。表格就是采集产品信息展示。
5. 控制器和视图代码如下
1.> CollectingController
1 using System;
2 using System.Collections.Generic;
3 using System.Linq;
4 using System.Text.RegularExpressions;
5 using System.Web.Mvc;
6 using Collector.Models;
7 using Newtonsoft.Json.Linq;
8 using RestSharp;
9
10 namespace Collector.Controllers
11 {
12 public class CollectingController : Controller
13 {
14 // GET: Collecting
15 public ActionResult Index()
16 {
17 return View();
18 }
19
20 [HttpPost]
21 public ActionResult Index(CollectionViewModel collectionView)
22 {
23 collectionView = ColllectWithParse(collectionView);
24 return View(collectionView);
25 }
26
27 public CollectionViewModel ColllectWithParse(CollectionViewModel collectionView)
28 {
29 if (collectionView == null || string.IsNullOrEmpty(collectionView.CollectionUrl))
30 {
31 return collectionView;
32 }
33 var client = new RestClient(collectionView.CollectionUrl);
34 var request = new RestRequest(Method.GET);
35 var response = client.Execute(request);
36 var htmlContent = response.Content;
37 collectionView.ProductViews = ParseProducts(htmlContent);
38 return collectionView;
39 }
40
41 public IEnumerable<CollectedProductViewModel> ParseProducts(string productHtmlContent)
42 {
43 var productName = RegexMatchValue(ParseProductPatterns.ProductNamePattern, productHtmlContent);
44 var productCuurency = RegexMatchValue(ParseProductPatterns.ProductCurrencyPattern, productHtmlContent);
45
46 var productJson = RegexMatchValue(ParseProductPatterns.ProductJsnPattern, productHtmlContent);
47
48 var prodctJsonArray = JArray.Parse(productJson);
49 var products =
50 prodctJsonArray.Select(pja =>
51 {
52 var colorWithSizeCode = pja["skuPropIds"].ToString().Split(‘,‘);
53 var priceJson = pja["skuVal"];
54 var skuPrice = priceJson["skuPrice"];
55 var price = skuPrice == null ? "0" : skuPrice.ToString();
56 var actSkuPrice = priceJson["actSkuPrice"];
57 var discountPrice = actSkuPrice == null ? "0" : actSkuPrice.ToString();
58 return new
59 {
60 ColorCode = colorWithSizeCode.First(),
61 SizeCode = colorWithSizeCode.Last(),
62 Price = Convert.ToDecimal(price),
63 DiscountPrice = Convert.ToDecimal(discountPrice),
64 };
65 }).ToList();
66
67 var collectedImages = ParseProducImages(productHtmlContent);
68
69 var collectedProducts = products.Select(p => new CollectedProductViewModel
70 {
71 ProductName = productName,
72 ProductPrice = p.Price,
73 ProductDiscountPrice = p.DiscountPrice,
74 ProductCurrency = productCuurency,
75 ProductColor = SetProductColorWithSize(ParseProductPatterns.ProductColorPattern,p.ColorCode,productHtmlContent),
76 ProductSize = SetProductColorWithSize(ParseProductPatterns.ProductSizePattern, p.SizeCode, productHtmlContent),
77 ProductImages = collectedImages
78 }).ToList();
79 return collectedProducts;
80 }
81
82 private IEnumerable<CollectedProductImageViewModel> ParseProducImages(string productHtmlContent)
83 {
84 var imagesJson = RegexMatchValue(ParseProductPatterns.ProductImageJsonPattern, productHtmlContent);
85 var imageJsonArray = JArray.Parse(imagesJson);
86
87 var images = imageJsonArray.ToObject<List<string>>();
88 return images.Select((t, i) => new CollectedProductImageViewModel
89 {
90 ImageUrl = t,
91 Sort = i
92 });
93 }
94
95 private string SetProductColorWithSize(string pattern, string colorWithSizeCode,string input)
96 {
97 var newPattern = string.Format(pattern, colorWithSizeCode);
98 return RegexMatchValue(newPattern, input);
99 }
100
101 private string RegexMatchValue(string pattern, string input, RegexOptions regexOptions = RegexOptions.IgnoreCase|RegexOptions.Singleline)
102 {
103 var regex = new Regex(pattern, regexOptions);
104 var match = regex.Match(input);
105 return match.Value;
106 }
107 }
108 }
2.> Collecting->Index
1 @model Collector.Models.CollectionViewModel
2 <!DOCTYPE html>
3
4 <html>
5 <head>
6 <meta name="viewport" content="width=device-width" />
7 <title></title>
8 <!-- CSS goes in the document HEAD or added to your external stylesheet -->
9 <style type="text/css">
10 table.gridtable {
11 font-family: verdana,arial,sans-serif;
12 font-size: 11px;
13 color: #333333;
14 border-width: 1px;
15 border-color: #666666;
16 border-collapse: collapse;
17 }
18
19 table.gridtable th {
20 border-width: 1px;
21 padding: 8px;
22 border-style: solid;
23 border-color: #666666;
24 background-color: #dedede;
25 }
26
27 table.gridtable td {
28 border-width: 1px;
29 padding: 8px;
30 border-style: solid;
31 border-color: #666666;
32 background-color: #ffffff;
33 }
34 </style>
35 </head>
36 <body>
37 <div>
38 @using (Html.BeginForm("Index", "Collecting", FormMethod.Post))
39 {
40 <table>
41 <tr>
42 <td>采集地址:</td>
43 <td>
44 @Html.TextAreaFor(m => m.CollectionUrl, 4, 0, new { style = "width:1500px;" })
45 </td>
46
47 </tr>
48 <tr><td colspan="2" style="text-align: right;"><input type="submit" value="开始采集" /></td></tr>
49 </table>
50 }
51 </div>
52 <div>
53 <table class="gridtable">
54 <thead>
55 <tr>
56 <th width="5%">编号</th>
57 <th width="5%">图片</th>
58 <th width="30%">产品名称</th>
59
60 <th width="10%">产品单价</th>
61 <th width="10%">产品参考单价</th>
62 <th width="10%">产品币别</th>
63 <th width="10%">产品颜色</th>
64 <th width="10%">产品大小</th>
65 </tr>
66 </thead>
67 <tbody>
68 @{
69 var i = 0;
70 if (Model == null || Model.ProductViews == null)
71 {
72 return;
73 }
74 }
75 @foreach (var collectedProduct in Model.ProductViews)
76 {
77 <tr>
78 <td align="center">@{i++;}@i</td>
79 <td><img src="@collectedProduct.ProductImages.FirstOrDefault().ImageUrl" width="60" height="60" /></td>
80 <td>@collectedProduct.ProductName</td>
81 <td>@collectedProduct.ProductDiscountPrice</td>
82 <td>@collectedProduct.ProductPrice</td>
83 <td>@collectedProduct.ProductCurrency</td>
84 <td>@collectedProduct.ProductColor</td>
85 <td>@collectedProduct.ProductSize</td>
86 </tr>
87 }
88
89 </tbody>
90
91 </table>
92 </div>
93 </body>
94 </html>
这里要说明的是,本篇只是采集的冰山一角的例子,所有没有搞得很复杂,没有严格封装,不管是前端,还是后端,希望大家了解,还有本人不喜好在代码中加注释,在我看来代码就是注释。
二, 测试结果,将MVC项目,部署到IIS,端口号1005,走起看效果。
1. 测试上一篇速卖通产品地址:
http://www.aliexpress.com/store/product/Yoga-Tops-Women-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirt-Camisetas-Deporte-Mujer-Gym/1025110_32620359354.html?spm=a2g01.8032156.template-section-container.27.wcM8ES&sdom=3514.555719.493653.0_32620359354
效果截图如下:
刚刚采集发现上一篇写的这个产品地址,速卖通不打折,因此没有了折扣价格。
2.再采集一个地址:
http://www.aliexpress.com/store/product/LEVEL-4-shock-Professional-running-intensive-training-without-rims-snow-sports-bra-open-front-zipper-style/1025110_32357688343.html?spm=2114.12010108.1000013.1.uvJqBj
截图如下
这个产品的产品变体有很多,所有一网页还显示不了。
源码码:https://github.com/haibozhou1011/Collector
总结:
好了,速卖通产品采集系列,就全部结束了,总的来说,采集这个活技术都是大家经常用的,主要是前期分析,抓产品信息规则,每个网站多有规律,大家留心观察就会找到一些蛛丝马迹,就会有所突破。希望大家如果有更好的采集方法,一定要和大家分享。
标签:
原文地址:http://www.cnblogs.com/davidzhou/p/5479958.html