2018-06-25

python爬虫(三)

Beautiful Soup

简介

Beautiful Soup是python的一个库，主要功能是从网页抓取数据。
它是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间。
通过使用该库，可以不编写正则就可以方便的实现网页信息的抓取。

安装

Beautiful Soup安装

Beautiful Soup安装很简单，直接pip install beautifulsoup4即可安装。

解析器安装

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	ython的内置标准库执行速度适中文档容错能力强	ython 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

lxml安装pip install lxml
html5lib安装pip install html5lib

快速使用

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

输出结果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

注意，此处使用解析器为lxml，需要提前安装。
使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出。
同时我们通过下面代码可以分别获取所有的链接，以及文字内容

for link in soup.find_all('a'):
    print(link.get('href'))

print(soup.get_text())

基本使用

标签选择器
在上面的代码中有soup.title soup.head soup.p。
通过soup.标签名可以得到标签的内容。当文旦中有多个这样的标签，返回第一个标签的内容。
获取名称
通过soup.title.name可以得到title标签的名称。
获取属性
获取p标签的name属性方式
soup.p.attrs['name']
soup.p['name']
获取内容
soup.p.string可以得到第一个p标签的内容。
嵌套选择
soup.head.title.string
子节点和子孙节点
contents
soup.p.contents 将p标签下的所有子标签存到一个列表中
children
soup.p.children 将p标签下的所有子标签放到一个可迭代对象
此处content和children得到的结果相同，只是一个为列表，一个是可迭代对象，需要通过循环读取
1
2
for i,child in enumerate(soup.p.children):
print(i,child)
父节点
soup.a.parent 获取父节点信息
通过list(enumerate(soup.a.parents))可以获取祖先节点，这个方法返回的结果是一个列表，会分别将a标签的父节点的信息存放到列表中，以及父节点的父节点也放到列表中，并且最后还会讲整个文档放到列表中，所有列表的最后一个元素以及倒数第二个元素都是存的整个文档的信息

标准选择器
fina_all
find_all(name,attrs,recursive,text,kwargs)
可以根据标签名，属性，内容查找文档 name
soup.find_all('ul') 返回列表 attrs**
attrs可以传入字典的方式来查找标签，但是这里有个特殊的就是class,因为class在python中是特殊的字段，所以如果想要查找class相关的可以更改attrs={‘class_’:’element’}或者soup.find_all(‘’,{“class”:”element})，特殊的标签属性可以不写attrs，例如id
1
2
soup.find_all(attrs={'id': 'list-1'})
soup.find_all(attrs={'name': 'elements'})

text
soup.find_all(text='Foo')
结果返回的是列表形式的查到的所有的text=’Foo’的文本

find
find(name,attrs,recursive,text,**kwargs)
find返回的匹配结果的第一个元素
还有其他一些方法
find_parents()返回所有祖先节点，find_parent()返回直接父节点
find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点
find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
css选择器
通过select()直接传入CSS选择器就可以完成选择
.表示class
#表示id
标签1，标签2找到所有的标签1和标签2
标签1 标签2 找到标签1内部的所有的标签2
[atrr=value] 找到具有某个属性的所有标签
1
2
3
soup.select('.panel .panel-heading')
soup.select('ul li')
soup.select('#list-2 .element')

获取内容
通过get_text()获取文本内容

1 2	for li in soup.select('li'): print(li.get_text())

获取属性
通过[属性名]或者attrs[属性名]获取属性

1
2
3

for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

更多关于Beautiful Soup内容可以点击Beautiful Soup 4.2.0 文档查看官方文档介绍。
点击Python爬虫利器二之Beautiful Soup的用法查看更多总结。
点击 python修行路查看更多内容。

webdriver

简介

Selenium 是自动化测试工具。它支持各种浏览器，包括 Chrome，Safari，Firefox 等主流界面式浏览器，如果你在这些浏览器里面安装一个 Selenium 的插件，那么便可以方便地实现Web界面的测试。换句话说叫 Selenium 支持这些浏览器驱动。
Selenium 2，又名 WebDriver，它的主要新功能是集成了 Selenium 1.0 以及 WebDriver（WebDriver 曾经是 Selenium 的竞争对手）。也就是说 Selenium 2 是 Selenium 和 WebDriver 两个项目的合并，即 Selenium 2 兼容 Selenium，它既支持 Selenium API 也支持 WebDriver API。

安装

pip install selenium

还需要安装驱动，根据不同浏览器需要选择不同的驱动，下面地址是chrome驱动。
链接：https://pan.baidu.com/s/1qZ2LfmW 密码：qixa
下载以后，并把chromdriver放在chrome.exe同级目录下面，我的windows下面地址为C:\Program Files (x86)\Google\Chrome\Application。
也可以将对应地址添加在环境变量中。
下面代码实现了在chrome中打开百度首页，然后自动关闭的功能。

from selenium import webdriver

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
browser = webdriver.Chrome(chromedriver)
url = "https://www.baidu.com"
browser.get(url=url)
browser.close()

元素查找

单个元素查找

查找元素有下面几种
find_element_by_name
find_element_by_id
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
示例

from selenium import webdriver

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
browser = webdriver.Chrome(chromedriver)
url = "http://www.taobao.com"
browser.get(url=url)
input_first = browser.find_element_by_id("q")  #通过id
input_second = browser.find_element_by_css_selector("#q") #通过css选择器
input_third = browser.find_element_by_xpath('//*[@id="q"]') #通过xpath选择器
print(input_first)
print(input_second)
print(input_third)
browser.close()

输出结果

1
2
3

<selenium.webdriver.remote.webelement.WebElement (session="7341f32aea4238856409f236325848fc", element="0.4317776711082031-1")>
<selenium.webdriver.remote.webelement.WebElement (session="7341f32aea4238856409f236325848fc", element="0.4317776711082031-1")>
<selenium.webdriver.remote.webelement.WebElement (session="7341f32aea4238856409f236325848fc", element="0.4317776711082031-1")>

还可以通过导入By模块方式使用

1 2	from selenium.webdriver.common.by import By input_first = browser.find_element(By.ID, "q")

该方法和其他类似，By.ID中ID也可以替换成name等。

多个元素查找

多个元素查找就是使用find_elements，单个使用find_element

from selenium import webdriver

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
browser = webdriver.Chrome(chromedriver)
browser.get("http://www.taobao.com")
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()

此时得到的结果是列表。
xpath说明
XPath是XML Path的简称，由于HTML文档本身就是一个标准的XML页面，所以我们可以使用XPath的语法来定位页面元素。
绝对路径
根元素开始用/
相对路劲
任意符合条件的元素 //
查找页面上所有的input元素
//input
查找页面上第一个form元素内的直接子input元素(即只包括form元素的下一级input元素，使用绝对路径表示，单/号)
//form[1]/input
查找页面上第一个form元素内的所有子input元素(只要在form元素内的input都算，不管还嵌套了多少个其他标签，使用相对路径表示，双//号)
//form[1]//input
查找页面上第一个form元素
//form[1]
查找页面上id为loginForm的form元素
//form[@id='loginForm']
查找页面上具有name属性为username的input元素
//input[@name='username']
查找页面上id为loginForm的form元素下的第一个input元素
//form[@id='loginForm']/input[1]
查找页面具有name属性为contiune并且type属性为button的input元素
//input[@name='continue'][@type='button']
查找页面上id为loginForm的form元素下第4个input元素
//form[@id='loginForm']/input[4]

控件交互

清空输入框数据
element.clear()
发送数据
element.sendkeys(“username”)
获取文本的值
element.text
点击按钮
element.click()
表单提交
element.submit()
单选和多选框
element.clear()
element = browser.find_elements_by_id('checkbox')

常用方法

获取cookies
browser.get_cookies()
获取浏览器头名字
browser.title
关闭浏览器
browser.close()
前进
browser.forward()
后退
browser.back()
刷新
browser.refresh()
返回当前页面url
browser.current_url

实例

下面是利用driver实现自动登录京东网站并获取到cookie的操作。

import random
import time
from selenium import webdriver

url = 'https://passport.jd.com/new/login.aspx?ReturnUrl=https%3A%2F%2Fwww.jd.com%2F'
chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
driver = webdriver.Chrome(chromedriver)
driver.get(url)
time.sleep(random.uniform(1, 3))
driver.find_elements_by_xpath('//a[@clstag="pageclick|keycount|login_pc_201804112|10"]')[0].click() #默认为二维码扫描登录，此处为切换到用户账户登录
time.sleep(random.uniform(1, 3))
driver.find_element_by_id('loginname').clear() #清空默认用户名
time.sleep(random.uniform(1, 3))
driver.find_element_by_id('loginname').send_keys("xxxxx") #输入用户名
time.sleep(random.uniform(1, 3))
driver.find_element_by_id('nloginpwd').send_keys("xxxxx") #输入密码
time.sleep(random.uniform(1, 3))
driver.find_element_by_id('loginsubmit').click() #点击登录按钮
time.sleep(random.uniform(5, 10))
print(driver.get_cookies())
driver.close()

time.sleep(random.uniform(1, 3))是当前操作之后随机暂停，模拟人的操作，防止被封。

更多关于webdriver内容可以点击Selenium with Python查看官方文档。

持续不断

要松懈的时候再坚持一下