2018-05-04

python正则表达式练习题

针对前面正则表达式及re模块进行复习，然后题目实战。

正则常用语法

熟悉最常用的正则语法。

单字符匹配
. 匹配除换行符之外的任意一个字符。
[...] 表示匹配一个字符集集合，如[A-Za-z0-9]表示匹配所有字母和数字。 [^...] 表示匹配除该字符集集合指定字符外的任意字符。如[^0-9]表示匹配除数字之外的所有字符。
\ 转义字符，用来改变特殊字符的原有含义(使其表示本身)。
预定义字符集
\d 表示数字
\D 表示非数字
\s 表示空白字符
\S 表示非空白字符
\w 表示字母和数字
\W 表示非字母和数字
字符次数匹配
* 匹配前一个字符0或者无限次
+ 匹配前一个字符1或者无限次
? 匹配前一个字符0或者1次
边界匹配
^ 匹配字符串开头
$ 匹配字符串结尾
分组
(...) 分组
(?P<NAME>) 分组，并且指定该分组的名称为NAME。
(?P=NAME) 引用名称为NAME的分组所匹配到的字符串，配合上一个使用。
题目一
从地址http://qwd.jd.com/fcgi-bin/qwd_searchitem_ex?skuid=26878432382%7C1658610413%7C26222795271%7C25168000024%7C11731514723%7C26348513019%7C20000220615%7C4813030%7C25965247088%7C5327182%7C19588651151%7C1780924%7C15495544751%7C10114188069%7C27036535156%7C10123099847%7C26016197600%7C10503200866%7C16675691362%7C15904713681得到的json字符串，使用正则匹配，查找出商品对应的skuid(商品唯一编码)和skuimgurl(商品图片)。
题目分析

首先使用简单的爬虫功能得到需要匹配的数据；
根据json字符串的规律编写对应的正则表达式
输出

代码实现

import re
import requests

url = "http://qwd.jd.com/fcgi-bin/qwd_searchitem_ex?skuid=26878432382%7C1658610413%7C26222795271%7C25168000024%7C11731514723%7C26348513019%7C20000220615%7C4813030%7C25965247088%7C5327182%7C19588651151%7C1780924%7C15495544751%7C10114188069%7C27036535156%7C10123099847%7C26016197600%7C10503200866%7C16675691362%7C15904713681"
session = requests.session()
r = session.get(url)    #简单爬虫使用示意，后面会讲到
html = r.text

reg = re.compile(r"\s*\"skuid\":\"(\d+)\",\s*\S*\s*\S*\s*\"skuimgurl\":\"(\S*.jpg)\"")    #正则表达式
result = reg.findall(html)
print(result)    #使用()分组，输出结果为2个分组的数据

输出结果

[('26878432382', 'https://img20.360buyimg.com/n7/jfs/t18226/169/1318243724/390477/5b0718ff/5ac44edcNa350dbd9.jpg'), ('5327182', 'https://img20.360buyimg.com/n7/jfs/t17461/138/1837663326/68820/5f8da5cd/5ad9b1e2N42bce837.jpg'), ('11731514723', 'https://img20.360buyimg.com/n7/jfs/t19231/337/2147939016/196162/4210a6ae/5aea6250N0235cd05.jpg'), ('19588651151', 'https://img20.360buyimg.com/n7/jfs/t11341/60/1553062810/120774/ab9534ff/5a02c3f4Naebe34b7.jpg'), ('15495544751', 'https://img20.360buyimg.com/n7/jfs/t18088/43/2048465630/167669/dd3c8b7b/5ae12c40N57c98ea8.jpg'), ('16675691362', 'https://img20.360buyimg.com/n7/jfs/t18490/21/2141098141/120513/b3ca521a/5ae90247N3b4909ae.jpg'), ('26222795271', 'https://img20.360buyimg.com/n7/jfs/t19441/291/1597121495/310550/9bc2e141/5ad05fc0N1510cae5.jpg'), ('1780924', 'https://img20.360buyimg.com/n7/jfs/t17167/97/1957869461/43204/d064647b/5adda3e0Ne1d3aa86.jpg'), ('4813030', 'https://img20.360buyimg.com/n7/jfs/t19198/83/1908967366/189260/7538e84b/5adda865N8f547981.jpg'), ('27036535156', 'https://img20.360buyimg.com/n7/jfs/t19399/140/2175516321/123017/41e6d6a8/5aea87d3N9736cc9d.jpg'), ('26348513019', 'https://img20.360buyimg.com/n7/jfs/t14857/240/2643838980/220943/c982fda1/5aaf2002Ndd25bc52.jpg'), ('26016197600', 'https://img20.360buyimg.com/n7/jfs/t19894/76/195725612/190103/23c60ca1/5aeabb94N3e0266bc.jpg'), ('25168000024', 'https://img20.360buyimg.com/n7/jfs/t17629/301/2062161127/434152/aa3560a5/5ae319f9N1ae1146c.jpg'), ('25965247088', 'https://img20.360buyimg.com/n7/jfs/t19270/67/2232771964/253207/25f41fd9/5aea61b0Nfd21a809.jpg'), ('10123099847', 'https://img20.360buyimg.com/n7/jfs/t15511/14/1469153129/729958/b0af0ca1/5a533063N15fea56c.jpg'), ('20000220615', 'https://img20.360buyimg.com/n7/jfs/t16426/172/2638358261/151693/87020840/5ab869ddN30621fec.jpg'), ('15904713681', 'https://img20.360buyimg.com/n7/jfs/t17287/197/2249621651/366556/d36ae213/5aeadb4cN97f413f3.jpg'), ('10114188069', 'https://img20.360buyimg.com/n7/jfs/t19927/88/179058964/386205/afd08ef1/5ae9717fN07f116d9.jpg'), ('10503200866', 'https://img20.360buyimg.com/n7/jfs/t18139/246/1628563908/114414/9315ac7c/5ad0647eNa9f1e2af.jpg'), ('1658610413', 'https://img20.360buyimg.com/n7/jfs/t19411/79/1017814440/108641/1b185d6d/5ab8b479Nd2417e97.jpg')]

题目二

根据文件ga10.wms5.jd.com.txt中的内容，分别匹配upstream和location{}中的内容，将对应内容分别写入文件夹upstream和location，文件夹中分别是以配置名称命名的配置内容。显示结果如下
regular 。

题目分析

正则匹配upstream内容，分组应包括名称及全部内容，名称用于文件命名，全部内容用于写入文件。
利用os模块进行文件夹判断、创建、切换等功能的实现。
最后写入文件。
location处理方法基本一致。

代码实现

import codecs
import re
import os

regupstream = re.compile(r"\s*(upstream\s+(\S+)\s+{[^}]+})")
with codecs.open("ga10.wms5.jd.com.txt") as fum:
    upstmlist = regupstream.findall(fum.read())
    if not os.path.exists("upstream"):
        os.mkdir("upstream")
    os.chdir("upstream")
    for item in upstmlist:
        with codecs.open(item[1], "w") as fumw:
            fumw.write(item[0])
    os.chdir("..")


reglocation = re.compile(r"\s*(location\s+\/(\S+)\/\s+{[^}]+})")
with codecs.open("ga10.wms5.jd.com.txt") as flc:
    lcalist = reglocation.findall(flc.read())
    if not os.path.exists("location"):
        os.mkdir("location")
    os.chdir("location")
    for ilocal in lcalist:
        filename1 = ilocal[1]+".conf"
        with codecs.open(filename1, "w") as flcw:
            flcw.write(ilocal[0])

输出结果
regular_rex

持续不断

要松懈的时候再坚持一下

python正则表达式练习题

正则常用语法

题目一

题目二

Recommended Posts