针对前面正则表达式
及re
模块进行复习,然后题目实战。
正则常用语法
熟悉最常用的正则语法。
- 单字符匹配
.
匹配除换行符之外的任意一个字符。[...]
表示匹配一个字符集集合,如[A-Za-z0-9]
表示匹配所有字母和数字。[^...]
表示匹配除该字符集集合指定字符外的任意字符。如[^0-9]
表示匹配除数字之外的所有字符。\
转义字符,用来改变特殊字符的原有含义(使其表示本身)。 - 预定义字符集
\d
表示数字\D
表示非数字\s
表示空白字符\S
表示非空白字符\w
表示字母和数字\W
表示非字母和数字 - 字符次数匹配
*
匹配前一个字符0或者无限次+
匹配前一个字符1或者无限次?
匹配前一个字符0或者1次 - 边界匹配
^
匹配字符串开头$
匹配字符串结尾 - 分组
(...)
分组(?P<NAME>)
分组,并且指定该分组的名称为NAME。(?P=NAME)
引用名称为NAME的分组所匹配到的字符串,配合上一个使用。题目一
从地址http://qwd.jd.com/fcgi-bin/qwd_searchitem_ex?skuid=26878432382%7C1658610413%7C26222795271%7C25168000024%7C11731514723%7C26348513019%7C20000220615%7C4813030%7C25965247088%7C5327182%7C19588651151%7C1780924%7C15495544751%7C10114188069%7C27036535156%7C10123099847%7C26016197600%7C10503200866%7C16675691362%7C15904713681
得到的json字符串,使用正则匹配,查找出商品对应的skuid
(商品唯一编码)和skuimgurl
(商品图片)。 - 题目分析
- 首先使用简单的爬虫功能得到需要匹配的数据;
- 根据json字符串的规律编写对应的正则表达式
- 输出
- 代码实现
1
2
3
4
5
6
7
8
9
10
11import re
import requests
url = "http://qwd.jd.com/fcgi-bin/qwd_searchitem_ex?skuid=26878432382%7C1658610413%7C26222795271%7C25168000024%7C11731514723%7C26348513019%7C20000220615%7C4813030%7C25965247088%7C5327182%7C19588651151%7C1780924%7C15495544751%7C10114188069%7C27036535156%7C10123099847%7C26016197600%7C10503200866%7C16675691362%7C15904713681"
session = requests.session()
r = session.get(url) #简单爬虫使用示意,后面会讲到
html = r.text
reg = re.compile(r"\s*\"skuid\":\"(\d+)\",\s*\S*\s*\S*\s*\"skuimgurl\":\"(\S*.jpg)\"") #正则表达式
result = reg.findall(html)
print(result) #使用()分组,输出结果为2个分组的数据
输出结果1
[('26878432382', 'https://img20.360buyimg.com/n7/jfs/t18226/169/1318243724/390477/5b0718ff/5ac44edcNa350dbd9.jpg'), ('5327182', 'https://img20.360buyimg.com/n7/jfs/t17461/138/1837663326/68820/5f8da5cd/5ad9b1e2N42bce837.jpg'), ('11731514723', 'https://img20.360buyimg.com/n7/jfs/t19231/337/2147939016/196162/4210a6ae/5aea6250N0235cd05.jpg'), ('19588651151', 'https://img20.360buyimg.com/n7/jfs/t11341/60/1553062810/120774/ab9534ff/5a02c3f4Naebe34b7.jpg'), ('15495544751', 'https://img20.360buyimg.com/n7/jfs/t18088/43/2048465630/167669/dd3c8b7b/5ae12c40N57c98ea8.jpg'), ('16675691362', 'https://img20.360buyimg.com/n7/jfs/t18490/21/2141098141/120513/b3ca521a/5ae90247N3b4909ae.jpg'), ('26222795271', 'https://img20.360buyimg.com/n7/jfs/t19441/291/1597121495/310550/9bc2e141/5ad05fc0N1510cae5.jpg'), ('1780924', 'https://img20.360buyimg.com/n7/jfs/t17167/97/1957869461/43204/d064647b/5adda3e0Ne1d3aa86.jpg'), ('4813030', 'https://img20.360buyimg.com/n7/jfs/t19198/83/1908967366/189260/7538e84b/5adda865N8f547981.jpg'), ('27036535156', 'https://img20.360buyimg.com/n7/jfs/t19399/140/2175516321/123017/41e6d6a8/5aea87d3N9736cc9d.jpg'), ('26348513019', 'https://img20.360buyimg.com/n7/jfs/t14857/240/2643838980/220943/c982fda1/5aaf2002Ndd25bc52.jpg'), ('26016197600', 'https://img20.360buyimg.com/n7/jfs/t19894/76/195725612/190103/23c60ca1/5aeabb94N3e0266bc.jpg'), ('25168000024', 'https://img20.360buyimg.com/n7/jfs/t17629/301/2062161127/434152/aa3560a5/5ae319f9N1ae1146c.jpg'), ('25965247088', 'https://img20.360buyimg.com/n7/jfs/t19270/67/2232771964/253207/25f41fd9/5aea61b0Nfd21a809.jpg'), ('10123099847', 'https://img20.360buyimg.com/n7/jfs/t15511/14/1469153129/729958/b0af0ca1/5a533063N15fea56c.jpg'), ('20000220615', 'https://img20.360buyimg.com/n7/jfs/t16426/172/2638358261/151693/87020840/5ab869ddN30621fec.jpg'), ('15904713681', 'https://img20.360buyimg.com/n7/jfs/t17287/197/2249621651/366556/d36ae213/5aeadb4cN97f413f3.jpg'), ('10114188069', 'https://img20.360buyimg.com/n7/jfs/t19927/88/179058964/386205/afd08ef1/5ae9717fN07f116d9.jpg'), ('10503200866', 'https://img20.360buyimg.com/n7/jfs/t18139/246/1628563908/114414/9315ac7c/5ad0647eNa9f1e2af.jpg'), ('1658610413', 'https://img20.360buyimg.com/n7/jfs/t19411/79/1017814440/108641/1b185d6d/5ab8b479Nd2417e97.jpg')]
题目二
根据文件ga10.wms5.jd.com.txt
中的内容,分别匹配upstream
和location
{}中的内容,将对应内容分别写入文件夹upstream
和location
,文件夹中分别是以配置名称命名的配置内容。显示结果如下
。
- 题目分析
- 正则匹配
upstream
内容,分组应包括名称及全部内容,名称用于文件命名,全部内容用于写入文件。 - 利用
os
模块进行文件夹判断、创建、切换等功能的实现。 - 最后写入文件。
location
处理方法基本一致。
- 代码实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26import codecs
import re
import os
regupstream = re.compile(r"\s*(upstream\s+(\S+)\s+{[^}]+})")
with codecs.open("ga10.wms5.jd.com.txt") as fum:
upstmlist = regupstream.findall(fum.read())
if not os.path.exists("upstream"):
os.mkdir("upstream")
os.chdir("upstream")
for item in upstmlist:
with codecs.open(item[1], "w") as fumw:
fumw.write(item[0])
os.chdir("..")
reglocation = re.compile(r"\s*(location\s+\/(\S+)\/\s+{[^}]+})")
with codecs.open("ga10.wms5.jd.com.txt") as flc:
lcalist = reglocation.findall(flc.read())
if not os.path.exists("location"):
os.mkdir("location")
os.chdir("location")
for ilocal in lcalist:
filename1 = ilocal[1]+".conf"
with codecs.open(filename1, "w") as flcw:
flcw.write(ilocal[0])
输出结果