正则表达式 --python re模块

时间：2017-09-11 14:17:16 阅读：194 评论：0 收藏：0 [点我收藏+]

标签：bsp print 获取 pil 基础 content attr 字符串 space

正则表达式

1 程序员 --- 基本----都需要使用
2 数据提取====正则
3 爬虫基础

正则表达式 本身和python无关，是所有语言通用的----一种匹配字符串内容的 一种规则

1 元字符

字符集  [0123456789]     [0-9]---只能从大到小    [a-zA-Z]---中间其他字符

一个字符集只能匹配一个字符

.    匹配所有  ======= 除了换行符‘\n‘

\w    (word) 数字 字母 下划线 \W
\s    (space)  \S
\d    (digit)  \D

^      开头   在字符集里面[^] 表示不匹配====== 在外面表示开头
$         结尾

()     分组

a|b      或者

[^...]

2 量词

作量词前面的字符的出现次数

默认 匹配多的模式  

量词后边加   ？ 表示非贪

匹配0次或多次   *  贪婪 匹配多次     *？---一次

匹配一次或者多次  +  贪婪             +？---一次

匹配0次或者多次  ？  贪婪             ？？---0次

{n}  重复n次

{n,} 重复n次或者多次  贪婪

{n,m} 重复n-m 次     贪婪            {1,2}？---一次

例子 1

李[杰莲英二棍子]{1,3}

李[^和]*

my_str=‘李杰和李莲英和李二棍子‘
import re

# pattern=re.compile(r‘李[杰莲英二棍子]{1,3}‘)
pattern=re.compile(r‘李[^和]{1,3}‘)

re=pattern.findall(my_str)

print(re)

例子 2  身份证号

import re
#
str=input(‘身份证号：‘)
pattern=re.compile(r‘[0-9]\d{16}[0-9X]|[1-9]\d{14}‘)
re=pattern.findall(str)
print(re)

  [1-9]\d{14}(\d{2}[0-9X])?

3 转义字符 \

      转义字符 \       python遇到 \ 需要再加一个 \    

      匹配  \d的时候     ---->>>  需要转义 \\d ------python需要 \\\\d

    r‘\\d\n‘ 前边加上 原生字符串

4 正则贪婪匹配

  正则贪婪匹配的本质 就是  ======回溯算法=====

 .*?x 后面加任意字符 ----取前面任意字符直到匹配x到停止


re 模块 Python的 正则表达式 模块

import re

re 模块的方法

import re

pattern=re.compile(r‘<.*?>‘)
string=‘script>XXXXX<script‘

result=re.findall(pattern,string)   # 返回全部 找到的对象  没有 ==== []
print(result)


result1=re.search(pattern,string)   # 返回第一个对象===需要用group() 获取  包含 pattern
print(result1)
if result1:
    print(result1.group())          # 防止找不到的时候 None


result3=re.match(pattern,string)     # 以  ^ pattren  开头  找不到 None
print(result3)


re.split(‘ab‘,string)  先按‘a‘分割 在按 ‘b‘ 分割


re.sub(‘\d‘,‘H‘,string)  把 ‘\d‘ 替换为 ’H‘


============= re.finditer()=================>得到一个迭代器  再用group() 取值

=================re.compile()==============

pttern=re.compile(正则表达式) === 编译 为正则表达式对象

直接使用 不用再次编译  节省时间

pattern.findall
pattern.search
pattern.match

 ===================分组的优先匹配=======优先显示====================

# import re
#
# ret=re.findall(‘www.(baidu|oldboy).com‘,‘www.oldboy.com‘)
# print(ret)   #  [‘oldboy‘]
#
# ret=re.findall(‘www.(?:baidu|oldboy).com‘,‘www.oldboy.com‘)
# print(ret)   # [‘www.oldboy.com‘]


# ===================split 的优先级== 加了() 分组后 优先级提高========================

# import re
#
# ret=re.split(‘\d+‘,‘123gg6gg4ds45fff‘)
# print(ret)  # [‘‘, ‘gg‘, ‘gg‘, ‘ds‘, ‘fff‘]
#
# ret= re.split(‘(\d+)‘,‘123gg6gg4ds45fff‘)
#
# print(ret) # [‘‘, ‘123‘, ‘gg‘, ‘6‘, ‘gg‘, ‘4‘, ‘ds‘, ‘45‘, ‘fff‘]
#


# ===================标签分组   给一个组命名 (?P <name>) ========================

  出于 在正则表达式的使用  对多个量词的约束
  对一条匹配的正则  只对其中需要的内容 进行分组


# import re
# 
# # ret=re.search(‘<\w+>\w+</\w+>‘,‘<h1>hello</h1>‘)
# 
# ret1=re.search(‘<(?P<t>\w+)>\w+</(?P=t)>‘,‘<h1>hello</h1>‘)
# print(ret1.group(),ret1.group(‘t‘))
# 
# ret2=re.search(r‘<\w+>(?P<content>\w+)</\w+>‘,‘<h1>hello</h1>‘)
# print(ret2.group(),ret2.group(‘content‘))
# 
# 
# ret=re.search(r‘<(\w+)>\w+</\1>‘,‘<h1>hello</h1>‘)
# print(ret.group(),ret.group(1))

#如果不给组起名字，也可以用\序号来找到对应的组，表示要找的内容和前面的组内容一致
#获取的匹配结果可以直接用group(序号)拿到对应的值

正则表达式 --python re模块

标签：bsp print 获取 pil 基础 content attr 字符串 space

原文地址：http://www.cnblogs.com/big-handsome-guy/p/7504574.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行