标签:
以下是自己编程的一些小贴士,记录,总结提高自己。
1.python中集合类型的查找,尽量用dict or set类型。
dict和set类型,在python内部的实现都是使用hash映射,查找的时间复杂度是O(1),比任何的查找算法都高效。
当在程序中使用到>1K次的查询,就应该开始考虑使用dict或set类型来进行数据的组织。
1 #coding:utf-8 2 from urllib.request import urlopen 3 from bs4 import BeautifulSoup 4 import re 5 import string 6 import operator 7 import datetime 8 9 commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will","as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"] 10 #若不注释,则为set类型,跑一遍程序,对比一下,则知优劣! 11 #commonWords = set(commonWords) 12 13 def isCommon(word): 14 global commonWords 15 if word in commonWords: 16 return True 17 return False 18 19 20 def cleanText(input): 21 input = re.sub(‘\n+‘, " ", input).lower() 22 input = re.sub(‘\[[0-9]*\]‘, "", input) 23 input = re.sub(‘ +‘, " ", input) 24 input = re.sub("u\.s\.", "us", input) 25 input = bytes(input, "UTF-8") 26 input = input.decode("ascii", "ignore") 27 return input 28 29 def cleanInput(input): 30 input = cleanText(input) 31 cleanInput = [] 32 input = input.split(‘ ‘) 33 for item in input: 34 item = item.strip(string.punctuation) 35 if len(item) > 1 or (item.lower() == ‘a‘ or item.lower() == ‘i‘): 36 cleanInput.append(item) 37 38 cleanContent = [] 39 for word in cleanInput: 40 if not isCommon(word): 41 cleanContent.append(word) 42 return cleanContent 43 44 def getNgrams(input, n): 45 input = cleanInput(input) 46 output = {} 47 for i in range(len(input)-n+1): 48 ngramTemp = " ".join(input[i:i+n]) 49 if ngramTemp not in output: 50 output[ngramTemp] = 0 51 output[ngramTemp] += 1 52 return output 53 54 def getFirstSentenceContaining(ngram, content): 55 #print(ngram) 56 sentences = content.split(".") 57 for sentence in sentences: 58 if ngram in sentence: 59 return sentence 60 return "" 61 62 content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), ‘utf-8‘) 63 64 print(‘Use the set as the format of common words.‘) 65 print(‘Begin:‘,datetime.datetime.now()) 66 for i in range(50): 67 ngrams = getNgrams(content, 2) 68 sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True) 69 print(‘End:‘,datetime.datetime.now()) 70 print(sortedNGrams)
2.python往数据库插入数据
在插入数据之前,记得先进行一次查询,查看数据是否已经在数据库中。
一可以使程序更健壮,二也可顺便避免二次查询。
3.数据库在建表的时候,最后有索引
最近需要往数据库中插入上百万级的数据,十万级以后之后,数据库变得极慢,磁盘读写也是爆满!
后来,发现查询次数太多,重新建表,顺便加入索引。特别是unique index,我猜背后的实现机制是hash映射。
加入索引之后的数据库,大大减轻了磁盘的负担,查询速度几乎恒定,不过数据库的增大还是降低了读写的速度(实属情理之中)。
标签:
原文地址:http://www.cnblogs.com/flyinghorse/p/5735276.html