文章目录
- 第六部分 : 反爬之 selenium
-
- 知识点一 : 网站反爬 释义
- 知识点二 : 如何应对反爬
-
- 1.通过headers中的User-Agent字段来反爬
- 2.分析_通过referer反爬字段或其他字段
- 知识点三 : 中级_12_selenium 的基本使用
- 知识点四 : 中级_13_selenium 查看请求信息
- 知识点五 : 中级_14_selenium 访问新标签页面URL
- 知识点六 : 中级_15_selenium 打开新标签页
- 知识点七 : 中级_16_selenium 切换标签页
- 知识点八 : 中级_17_selenium 获取元素单节点
- 知识点九 : 中级_18_selenium多节点获取元素
- 知识点十 : 中级_19_selenium 获取定位元素的信息
- 知识点十一 : 中级_20_selenium 处理cookie
- 知识点十二 : 中级_21_selenium 等待页面加载
第六部分 : 反爬之 selenium
知识点一 : 网站反爬 释义
??网站反爬是为了保护网站有价值的数据资源,保护网站服务器资源,识别爬虫程序,限制恶意访问请求.
1 | 通过请求信息 | Headers : UA、Referer·Cookie:登录,验证码·具体请求参数:是否正确token(令牌)、sign等加密参数 |
2 | 基于用户行为识别 | 单位时间内单个客户端的请求频率 : IP、UA、账号·鼠标移动轨迹:是否有鼠标移动事件,鼠标移动是否符合人类行为特征(如先加速后减速)·键盘输入事件 |
3 | 频繁/定期更换反爬措施 | 无 |
?? ?? ??
知识点二 : 如何应对反爬
1.通过headers中的User-Agent字段来反爬
: 无爬虫默认情况User-Agent
: 请求前添加User-Agent可以,更好的方法是通过User-Agent池解决
# User-Agent池 user_agent = [ 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_0) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/537.4', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_0; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/4.0.206.1 Safari/532.0', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20110517 Firefox/5.0 Fennec/5.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 Camino/2.2.1', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.1) Gecko/20110318 Firefox/4.0b13pre Fennec/4.0', 'Mozilla/5.0 (Windows NT 6.0; rv:2.1.1) Gecko/20110415 Firefox/4.0.2pre Fennec/4.0.1', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1b2pre) Gecko/20081015 Fennec/1.0a1', 'Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11 ', 'Chrome/15.0.860.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/15.0.860.0', 'Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.220 Safari/535.1', "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Sfari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"
]
2.分析_通过referer字段或者是其他字段来反爬
: 爬虫默认情况下不会带上referer字段
: 添加referer字段
知识点三 : 中级_12_selenium 的基本使用
import time
from selenium import webdriver
# 1.打开浏览器
browser = webdriver.Chrome()
# 2.加载指定页面
browser.get("https://www.baidu.com")
# 3.获取指定的元素
browser.find_element_by_id("kw").send_keys("重庆")
time.sleep(3)
# 4.点击"百度一下"
browser.find_element_by_id("su").click()
time.sleep(3)
# 5.截取当前页面
browser.save_screenshot("./FileSave/百度搜索.png")
# 6.关闭当前页面
browser.close()
知识点四 : 中级_13_selenium 查看请求信息
import time
from selenium import webdriver
# 1.打开浏览器
browser = webdriver.Chrome()
# 2.加载指定页面
browser.get("https://www.baidu.com")
# 3.查看访问页面的源代码
print(browser.page_source)
# 4.查看cookie
print(browser.get_cookies())
# 5.查看经过处理之后,本页面最后显示的url(如果有302,那么就是302之后的url)
print(browser.current_url)
time.sleep(3)
# 6.关闭当前页面
browser.close()
知识点五 : 中级_14_selenium 原标签页上访问新URL
import time
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2_1.加载百度页面
browser.get("https://www.baidu.com")
time.sleep(3)
# 2_2.加载京东页面(在百度页面上)
browser.get("https://www.jd.com")
time.sleep(3)
# 3.关闭当前页面
browser.close()
知识点六 : 中级_15_selenium 打开新标签页
import time
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2_1.打开淘宝页面
browser.get("https://login.taobao.com")
time.sleep(3)
# 2_2.打开京东页面(在新建页面上)
js = "window.open('https://www.jd.com')"
browser.execute_script(js)
知识点七 : 中级_16_selenium 切换标签页
import time
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2_1.打开淘宝页面
browser.get("https://login.taobao.com")
time.sleep(3)
# 2_2.打开京东页面(在新建页面上)
js = "window.open('https://www.jd.com')"
browser.execute_script(js)
time.sleep(3)
# 3_1.切换到第一个标签页
browser.switch_to.window(browser.window_handles[0])
time.sleep(1)
# 3_2.切换到第二个标签页
browser.switch_to.window(browser.window_handles[1])
time.sleep(1)
# 4_1.关闭第二个标签页
browser.close()
time.sleep(1)
# 4_2.关闭第一个标签页
browser.switch_to.window(browser.window_handles[0])
browser.close()
# 5.或者直接用browser.quit()把浏览器关掉,开多少页面都不影响
知识点八 : 中级_17_selenium 获取元素单节点
import time
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2.打开百度新闻页面
browser.get("https://news.baidu.com")
# 3.定位搜索框
ret = browser.find_element_by_xpath("//input[@class='word']")
print(ret)
time.sleep(3)
# 4.关闭浏览器
browser.quit()
知识点九 : 中级_18_selenium获取元素多节点
import time
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2.打开豆瓣电影页面
browser.get("https://movie.douban.com/top250")
# 3.定位25个电影信息
ret = browser.find_elements_by_xpath("//*[@class='item']")
for temp in ret:
print(temp)
time.sleep(3)
# 4.关闭浏览器
browser.quit()
知识点十 : 中级_19_selenium 获取定位到的元素的信息
import time
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2.打开豆瓣页面
browser.get("https://www.douban.com")
# 3.定位h1标签
ret = browser.find_elements_by_tag_name("h1")
print(ret[0].text)
ret = browser.find_elements_by_partial_link_text("下载豆瓣 App")
print(ret[0].get_attribute("href"))
time.sleep(3)
# 4.退出浏览器
browser.quit()
知识点十一 : 中级_20_selenium 处理cookie
from selenium import webdriver
# 1.创建浏览器
browser = webdriver.Chrome()
# 2.打开百度页面
browser.get("https://www.baidu.com")
# 3.获得cookies
cookie_list = browser.get_cookies()
print(cookie_list)
""" 整理为requests等需要的字典方式,因为浏览器在发送新请求时携带的cookie只有name,value 所以此时提取的也只有name,value,其他的不需要 """
# 4.取name和value值
cookie_dict = {
x['name']: x["value"] for x in cookie_list
}
print(cookie_dict)
# 5.关闭页面
browser.close()
知识点十二 : 中级_21_selenium 等待页面加载
""" help1: browser.back()和browser.forward()表示页面的前进和后退,但不推荐用,因为后退退到用的是缓存而没有发送新的请求,可能得不到数据 """
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# 1.创建浏览器驱动对象
browser = webdriver.Chrome()
# 2.等待操作对象
wait_ob = WebDriverWait(browser, 10) # 最多等待10秒
# 3.加载url
browser.get("https://jd.com")
# 4.等待条件到来
search_input = wait_ob.until(EC.presence_of_element_located((By.ID, 'key')))
print(search_input)
# 5.输入内容
search_input.send_keys("Mac Pro")
time.sleep(3)
# 6.退出浏览器
browser.quit()