标签:dem connect mod 内容 获取数据 ted client fan 手动
最近老大让从网站上获取数据,手动太慢,网上找了点python,用脚本操作。
1 import os 2 import re 3 4 import xlrd 5 import requests 6 import xlwt 7 from bs4 import BeautifulSoup 8 from xlutils.copy import copy 9 from xlwt import * 10 11 12 def read_excel(path): 13 # 打开文件 14 workbook = xlrd.open_workbook(path) 15 # 获取所有sheet 16 17 # 根据sheet索引或者名称获取sheet内容 18 sheet1 = workbook.sheet_by_index(0) # sheet索引从0开始 19 20 # sheet的名称,行数,列数 21 i = 0 22 for sheet1_values in sheet1._cell_values: 23 24 str = sheet1_values[0] 25 str.replace(‘\‘‘,‘‘) 26 print (str,i) 27 response = get_responseHtml(str) 28 soup = get_beautifulSoup(response) 29 pattern1 = ‘^https://ews-aln-core.cisco.com/applmgmt/view-appl/+[0-9]*$‘ 30 pattern2 = ‘^https://ews-aln-core.cisco.com/applmgmt/view-endpoint/+[0-9]*$‘ 31 pattern3 = ‘^https://ews-aln-core.cisco.com/applmgmt/view-appl/by-name/‘ 32 if pattern_match(str,pattern1) or pattern_match(str,pattern3): 33 priority = soup.find("table", class_="main_table_layout").find("tr", class_="centered sub_section_header").find_next("tr", 34 align="center").find_all( 35 "td") 36 elif pattern_match(str,pattern2): 37 priority = soup.find("table", class_="main_table_layout").find("tr", 38 class_="centered").find_next( 39 "tr", 40 align="center").find_all( 41 "td") 42 else: 43 print("no pattern") 44 try: 45 priorityNumble =‘P‘ + get_last_td(priority) 46 47 except Exception: 48 print("没有找到"+str) 49 priorityNumble = ‘P‘ + get_last_td(priority) 50 write_excel(path,i,1,priorityNumble) 51 i = i + 1 52 def write_excel(path,row,col,value): 53 oldwb = xlrd.open_workbook(path) 54 wb =copy(oldwb) 55 ws = wb.get_sheet(0) 56 ws.write(row,col,value) 57 wb.save(path) 58 def get_last_td(result): 59 for idx in range(len(result)): 60 returnResult = result[idx].contents[0] 61 return returnResult 62 def get_beautifulSoup(request): 63 soup = BeautifulSoup(request, ‘html.parser‘, from_encoding=‘utf-8‘, exclude_encodings=‘utf-8‘) 64 return soup 65 def get_responseHtml(url): 66 headers = { 67 ‘User-Agent‘: ‘User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36‘} 68 response = requests.get(url, auth=(userName, passWord),headers=headers).content 69 return response 70 def pattern_match(str,pattern,flags = 0): 71 pattern = re.compile(pattern) 72 return re.match(pattern,str,flags) 73 74 if __name__ == ‘__main__‘: 75 userName = ‘*‘; 76 passWord = ‘*‘ 77 path = r‘*‘ 78 read_excel(path)
这里面坑可是不少
1.刚开始xlsx格式文件,save后不能打开,把excel格式改为xls才正确。
2.header网上找的,这样不会被认为是网络爬虫而报错:http.client.RemoteDisconnected: Remote end closed connection without response.
3.copy的参数要为workbook而不是xls的fileName,否则报错:AttributeError: ‘str’ object has no attribute ‘datemode’.
4.找到一篇很好的博客:Python中,添加写入数据到已经存在的Excel的xls文件,即打开excel文件,写入新数据
5.刚开始想往新的文件里save,用了新的路径,发现不可行,因为在for循环中每次都是从源excel中copy,所以实际结果只插入了一行。
6.正则表达式的语法:正则表达式 - 语法 和 Python正则表达式
6.python中beautiful soup的用法,很全的文档:Beautiful Soup 4.2.0 文档
7.一个爬小说的demo:Python3网络爬虫(七):使用Beautiful Soup爬取小说
8.从没写过python,第一次写,花了半天时间,还有很多可以改进的地方。
标签:dem connect mod 内容 获取数据 ted client fan 手动
原文地址:http://www.cnblogs.com/lizhang4/p/7988065.html