码迷,mamicode.com
首页 > Web开发 > 详细

gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理

时间:2018-01-07 20:08:38      阅读:204      评论:0      收藏:0      [点我收藏+]

标签:min   mos   and   mkdir   blog   clear   pac   box   man   

pachong.rb
 
URL = ‘bangumi.tv/character/‘
 
READY = []
Dir.glob(‘download/*‘).each do |f|
if f =~ /download\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘fail/*‘).each do |f|
if f =~ /fail\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘error/*‘).each do |f|
if f =~ /error\/(\d+)/
READY << $1.to_i
end
end
 
READY.uniq!
 
def download(i)
log = ‘‘
fn = i.to_s
system "wget #{URL}#{fn}"
 
lines = []
 
if !FileTest.exist?(fn)
return ‘‘
end
 
File.open(fn, ‘r‘) do |f|
lines = f.readlines
end
 
find = false
lines.each do |l|
if l =~ /<title>(.+)<\/title>/
name, description = $1.split(‘|‘).collect { |e| e.strip }
log << "#{i}: #{name}, #{description}\n"
end
if l =~ /href="(.+)" class="cover thickbox"/
url = ‘http:‘ + $1
url.slice!(/\?.+$/)
log << url + "\n"
system "wget #{url}"
system "rm #{fn}"
find = true
break
end
end
 
if !find
system "mv #{fn} fail\\"
log << "\n"
end
 
return log
end
 
i = ARGV[0].to_i
n = ARGV[1].to_i
 
log = ‘‘
 
n.times do
log << download(i) if !READY.include?(i)
i += 1
end
 
system "mv *.jpg download\\"
 
File.open(‘pachong.txt‘, ‘a‘) do |f|
f << log
end
readme.md
 

before running

  1. install wget and ruby.
  2. create folder download and fail
  3. modified forloop.bat,
    • line5, (start, step = 50, end = start + 1000). (20 threads).
    • line7, second parameter for pachong.rb should >= step
  4. run forloop.bat
  5. When mostly all pictures are downloaded, run ruby run.rb 50

tips

  1. This script may lose some picture. Just try more times, pictrue in folder would be ignored.
  2. If any cmd window get stuck, press enter to skip current wget command.
forloop.bat
 
@echo off
mkdir download
mkdir fail
mkdir error
for /l %%i in (30001,500,40000) do (
@ping 127.0.0.1 -n 1 >nul
start /min cmd /c ruby pachong.rb %%i 500
)
run.rb
 
Dir.glob(‘*‘).each do |f|
if f =~ /^\d+/
system "mv #{f} error\\"
end
end
system "mv *.jpg download\\"
 
Limit = ARGV[0]? ARGV[0].to_i : 50
 
READY = []
Dir.glob(‘download/*‘).each do |f|
if f =~ /download\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘fail/*‘).each do |f|
if f =~ /fail\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘error/*‘).each do |f|
if f =~ /error\/(\d+)/
READY << $1.to_i
end
end
 
r = READY.sort
show = true
j = 0
 
start = []
step = []
 
for i in 20001..40000
if show
if !r.include?(i)
start << i
show = !show
j = i
end
else
if r.include?(i)
step << i - j
print "#{j} -> #{i} : #{i-j}\n"
show = !show
end
end
end
 
print "total: #{step.sum}\n"
 
n = 0
i = 0
while start[i]
if step[i] > Limit
if step[i] > 2 * Limit
start << start[i] + 2 * Limit
step << step[i] - 2 * Limit
step[i] = 2 * Limit
end
start[i] += 1
printf "#{start[i]} + #{step[i]}\n"
system "start /min cmd /c ruby pachong.rb #{start[i]} #{step[i]}"
sleep(1)
n += 1
break if n > 20
end
i += 1
end

gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理

标签:min   mos   and   mkdir   blog   clear   pac   box   man   

原文地址:https://www.cnblogs.com/PHPnetc/p/8228942.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!