self.crawl
2016-09-05 15:00:31
manbuheiniu
27405
最后编辑:manbuheiniu 于 2016-09-05 16:13:49
self.crawl(url, **kwargs)
self.crawl在pyspider该系统是一个非常重要的非常重要的接口pyspider需要抓取哪些URL?
参数:
url
需要被抓url或url列表.
callback
该参数用于指定爬行内容后的处理方法。一般分析为 response. _default:__call___ 以下调用方法:
def on_start(self): self.crawl('http://scrapy.org/', callback=self.index_page)
self.crawl还有以下可选参数
age
本参数用于指定任务的有效期,不会在有效期内重复抓取。默认值为-1(永远不会过期,也就是说只抓一次)
@config(age=10 * 24 * 60 * 60) def index_page(self, response): ...
分析:每次回调index_page任务有效期为10天,10天内遇到任务将被忽略(除非有强制抓取参数)
priority
该参数用于指定任务的优先级,值越大,就越先执行。默认值为0。
def index_page(self): self.crawl('http://www.example.org/page2.html', callback=self.index_page) self.crawl('http://www.example.org/233.html', callback=self.detail_page,priority=1)
如果将这两个任务同时放入任务队列,页面233.html先被执行. Use this parameter can do aBFSand reduce the number of tasks in queue(which may cost more memory resources).
exetime
the executed time of task in unix timestamp.default: 0(immediately)
import time def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,exetime=time.time() 30*60)
The page would be crawled 30 minutes later.
retries
任务执行失败后重试次数.default: 3
itag
任务标记值,该标记将在捕获时进行比较。如果该值发生变化,无论有效期是否到达,都将重新捕获新内容。大多数用于动态判断内容是修改还是强制重新攀爬。默认值为:None.
def index_page(self, response): for item in response.doc('.item').items(): self.crawl(item.find('a').attr.url, callback=self.detail_page, itag=item.find('.update-time').text())
在这个例子中使用页面update-time元素值被视为itag判断内容是否更新。
class Handler(BaseHandler): crawl_config = { 'itag': 'v223' }
修改全局参数itag,重新执行所有任务(需要点)run按钮启动任务。
auto_recrawl
when enabled, task would be recrawled everyagetime.default: False
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,age=5*60*60, auto_recrawl=True)
The page would be restarted everyage5 hours.
method
HTTP设置请求方法,默认值: GET
params
附加一个字典参数url参数里,如 :
def on_start(self): self.crawl('http://httpbin.org/get', callback=self.callback,params={'a': 123, 'b': 'c'}) self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
分析:这两个任务是一样的。
data
该参数将附加到URL请求中body如果面,如果字典会经过,form-encoding编码再附加.
def on_start(self): self.crawl('http://httpbin.org/post', callback=self.callback,method='POST', data={'a': 123, 'b': 'c'})
files
dictionary of{field: {filename: 'content'}}files to multipart upload.`
headers
自定义请求头(字典类型)
cookies
自定义请求cookies(字典类型)
connect_timeout
链接超时时间,单位秒,默认值:20.
timeout
最大等待秒数的请求内容。默认值:120.
allow_redirects
遇到30x默认情况是:True.
validate_cert
遇到HTTPS类型的URL是否验证证书,默认值:True.
proxy
设置代理服务器,格式如 username:password@hostname:port .暂时只支持http代理
class Handler(BaseHandler): crawl_config = { 'proxy': 'localhost:8080' }
Handler.crawl_config里配置proxy参数将对整个项目生效,本项目的所有任务将由代理爬行。
etag
use HTTP Etag mechanism to pass the process if the content of the page is not changed.default: True
last_modified
use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed.default: True
fetch_type
是否启用设置JavaScript解析引擎.default: None
js_script
JavaScript run before or after page loaded, should been wrapped by a function likefunction() { document.write("binux"); }.
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,fetch_type='js', js_script=''' function() { window.scrollTo(0,document.body.scrollHeight); return 123; } ''')
The script would scroll the page to bottom. The value returned in function could be captured via Response.js_script_result.
js_run_at
run JavaScript specified via js_script at document-start or document-end. default: document-end
js_viewport_width/js_viewport_height
set the size of the viewport for the JavaScript fetcher of the layout process.
load_images
load images when JavaScript fetcher enabled. default: False
save
传递一个对象给任务,在任务解析时可以通过response.save来获取传递的值.
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,save={'a': 123}) def callback(self, response): return response.save['a']
在回调里123将被返回.
taskid
唯一性的taskid用来区别不同的任务.默认taskid是由URL经过md5计算得出.你也可以使用def get_taskid(self, task)方法覆盖系统自带的来自定义任务id.如:
import json from pyspider.libs.utils import md5string def get_taskid(self, task): return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
本实例任务ID不只是url,不同的data参数也会生成不同的任务id
force_update
force update task params even if the task is in ACTIVE status.
cancel
cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawltask, you should set auto_recrawl=False as well.
cURL command
self.crawl(curl_command)
cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel, right click the request and "Copy as cURL".
You can use cURL command as the first argument of self.crawl. It will parse the command and make the HTTP request just like curl do.
@config(**kwargs)
default parameters of self.crawl when use the decorated method as callback. For example:
@config(age=15*60) def index_page(self, response): self.crawl('http://www.example.org/list-1.html', callback=self.index_page) self.crawl('http://www.example.org/product-233', allback=self.detail_page) @config(age=10*24*60*60) def detail_page(self, response): return {...}
age of list-1.html is 15min while the age of product-233.html is 10days. Because the callback of product-233.html is detail_page, means it's a detail_page so it shares the config of detail_page.
Handler.crawl_config = {}
default parameters of self.crawl for the whole project. The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
class Handler(BaseHandler): crawl_config = { 'headers': { 'User-Agent': 'GoogleBot', } } ...
crawl_config set a project level user-agent.