自己也打csgo,想抓住全站数据进行可视化和预测分析,直接开始滚动。
1. 分析网站结构,了解数据的加载方法,爬上起始页面的珠宝信息
CS:GO饰品市场_网易BUFF珠宝交易平台进入主页,分析珠宝分类。首先进入匕首分类,打开控制台抓取包搜索关键词,搜索7120珠宝的价格。
可以看出,珠宝的信息和价格都是通过的https://buff.163.com/api/market/goods?这个api加载接口返回前端页面,我们只在这里使用xpath分析前端页面无法获取数据,切换到其他分类查看接口参数规则,关键参数game=csgo 游戏类型,page_num=1 页数, category_group=knife这是我们的分类,每个分类的名称都可以在前端页面上找到
分析到这,我们可以获得分类标题、分类链接、珠宝属性等数据,直接构建旧常规scrapy项目
输入 scrapy startproject Buff
切换到Buff目录下
输入 scrapy genspider armshttps://buff.163.com/market/csgo
这样并完成了这样一个项目
打开我们的spider正常输入以上两串代码后,文件将生成以下文件并打开我们arms
初始项目通常只有这些行代码,name是我们定义的爬虫名称, allowed_domains是域名,start_urls为起始的url,创建项目后,首先要检查域名和开始url是否与网站一致,如果不一致,自行修改。
接着开始写代码了,我们明确我们需要抓取的数据:分类名称,分类页的url。打开我们的items.py建模所需的数据。
# -- items.py -- import scrapy class BuffItem(scrapy.Item): biglabel = scrapy.Field() biglabel_link = scrapy.Field()
来到我们的arms.py文件中,scrapy默认提取数据的方法有xpath和css,我们在这里使用xpath提取我们需要的分类名称,然后提取分类页面url拼接分类页面url有时间戳,不知道服务器会不会验证,所以还是带着。
# -- arms.py -- import scrapy import time class ArmsSpider(scrapy.Spider): name = 'arms' allowed_domains = ['buff.163.com'] start_urls = ['https://buff.163.com/market/csgo'] def parse(self, response): node_list = response.xpath('//*[@class="h1z1-selType type_csgo"]/div') for node in node_list: base_data = {} base_data['biglabel'] = node.xpath('.//p/text()').get() base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(node.xpath('.//p/@value').get(), int(time.time() * 1000)) print(base_data)
完成后操作我们的arms查看base_data是否正确
控制台输入 scrapy crawl arms # 运行爬虫
即使第一步完成,我们也会继续分析如何捕获第二步详细页面的数据,并实现翻页操作。
2. 抓取详细页面的数据
当然,要分析详细页面的数据,我们需要拼接下来url我们重新定义函数 parse_img(self, response): 通过上一个函数获得的数据meta进行传递到parse_img函数中,而且我们需要在那里parse中加入我们的cookie,需要注意的是,scrapy中的cookies字典的格式,我们需要cookies每个字段都包装成字典。具体操作如下,如果不添加cookies则请求biglabel_link同时,我们打开无数据返回setting.py和middlewares.py配置我们的文件User_Agent,scrapy中默认使用scrapy的ua,服务器很容易识别。
# -- arms.py -- def parse(self, response): node_list = response.xpath('//*[@class="h1z1-selType type_csgo"]/div') for node in node_list: base_data = {} base_data['biglabel'] = node.xpath('.//p/text()').get() base_data['value'] = node.xpath('.//p/@value').get() base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(node.xpath('.//p/@value').get(), int(time.time() * 1000)) cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676; _gid=GA1.2.109923849.1658721677; NTES_YD_SESS=XFu19pwcHN6Blr5FRmVVtMV81vu_q8LPnNvqzF_aBBLVAZ_WA7Kw6g3z0x.OnVZMav2Ct9VuprshE6tCMRo1iMtqZtzZa9kp4Y77cost521PJbbZt_Zw9WtdpVwDUUF4QXKWPYURB6P8PZT97Ar4Rde7Tg2EiB1L5n9lVw.3Z6GrETAU6i5ct03n9LcMEld0JF7Zqj_Gl2wTJGt1fx3tyz8NuI1YoOmb7Oh9VTxwoqYE3; S_INFO=1658722544|0|0&60##|18958675241; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-UWdoO73qKkqBcWzo4Cz2l1lZz2HToVVUjAFknHzNIT6n2038696921; _gat_gtag_UA_109989484_1=1; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658722546; csrf_token=ImZjMDM1NzJhOTVmYWI2NGRmMjJkN2I1ZDUzYTBkMGIzZGM4N2ZjOTIi.Fb-fQ.kliU6aNIb4iHTYaX16iMNAY73VI'
cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
yield scrapy.Request(
url=base_data['biglable_link'],
callback=self.parse_img,
meta={'base_data': base_data},
cookies=cookies
)
def parse_img(self, response):
temp = response.meta['base_data']
ua配置如下,在setting文件中添加一个USER_AGENT列表,如何在下载器中间件随机选取ua,记得要打开下载器管道。
# -- middlewares.py --
from .settings import USER_AGENT_LIST
import random
class RandomUserAgent:
def process_request(self, request, spider):
UserAgent = random.choice(USER_AGENT_LIST)
request.headers['User-Agent'] = UserAgent # 随机替换UA
setting.py中添加
# -- setting.py --
DOWNLOADER_MIDDLEWARES = {
'Buff.middlewares.RandomUserAgent': 300,
}
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Opera/8.0 (Windows NT 5.1; U; en)",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
]
ROBOTSTXT_OBEY = False
由于我们想要抓取详细数据,api接口所返回的格式为json类型,scrapy默认的xpath和css这时候已经用不上了,我们在arms.py中导入和这两个包,是基于的json格式的数据提取器,所用到的语法都差不多,对于scrapy调试json类型的数据,建议先用requests对单页进行请求,把提取规则写好后再套入scrapy当中。
直接将jsonpath的提取规则套入我们的scrapy当中,对数据进行提取。同时打开我们的items.py文件对这些元素进行建模,建模完成后在arms.py文件中导入我们的items.py Buffitems()这个类,对其进行实例化。
# -- arms.py --
def parse_img(self, response):
base_data = response.meta['base_data']
json_data = json.loads(response.text)
id = jsonpath.jsonpath(json_data, '$..items[*].id')
name = jsonpath.jsonpath(json_data, '$..items[*].name')
market_name = jsonpath.jsonpath(json_data, '$..items[*].market_hash_name')
price = jsonpath.jsonpath(json_data, '$..items[*].sell_min_price')
exterior_wear = jsonpath.jsonpath(json_data, '$..info.tags.exterior.localized_name')
quality = jsonpath.jsonpath(json_data, '$..info.tags.quality.localized_name')
rarity = jsonpath.jsonpath(json_data, '$..info.tags.rarity.localized_name')
type = jsonpath.jsonpath(json_data, '$..info.tags.type.localized_name')
weapon_type = jsonpath.jsonpath(json_data, '$..info.tags.weapon.localized_name')
for i in range(len(id)):
item = BuffItem()
item['biglabel'] = base_data['biglabel']
item['biglabel_link'] = base_data['biglable_link']
item['id'] = id[i]
item['name'] = name[i]
item['market_name'] = market_name[i]
item['price'] = price[i]
item['exterior_wear'] = exterior_wear[i]
item['quality'] = quality[i]
item['rarity'] = rarity[i]
item['type'] = type[i]
item['weapon_type'] = weapon_type[i]
yield item
最终运行效果如下
3. 翻页功能实现
在这一步之前,我们所实现的功能只能够爬取一页的内容,这远远达不到我们想要抓取全站数据的目标。在这一步,我们将实现翻页功能,对每一页的数据都进行提取。
在我们的arms.py文件parse_img这个函数中,写我们的翻页功能,首先提取出页数,非常简单
page = jsonpath.jsonpath(json_data, '$.data.total_page')[0]
写入我们的scrapy当中
# -- arms.py --
page = jsonpath.jsonpath(json_data, '$.data.page_num')[0] + 1
pages = jsonpath.jsonpath(json_data, '$.data.total_page')[0]
if page <= pages:
# print(base_data['value'])
next_url = 'https://buff.163.com/api/market/goods?game=csgo&page_num={}&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(page, base_data['value'], int(time.time() * 1000))
cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676; _gid=GA1.2.109923849.1658721677; NTES_YD_SESS=XFu19pwcHN6Blr5FRmVVtMV81vu_q8LPnNvqzF_aBBLVAZ_WA7Kw6g3z0x.OnVZMav2Ct9VuprshE6tCMRo1iMtqZtzZa9kp4Y77cost521PJbbZt_Zw9WtdpVwDUUF4QXKWPYURB6P8PZT97Ar4Rde7Tg2EiB1L5n9lVw.3Z6GrETAU6i5ct03n9LcMEld0JF7Zqj_Gl2wTJGt1fx3tyz8NuI1YoOmb7Oh9VTxwoqYE3; S_INFO=1658722544|0|0&60##|18958675241; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-UWdoO73qKkqBcWzo4Cz2l1lZz2HToVVUjAFknHzNIT6n2038696921; _gat_gtag_UA_109989484_1=1; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658722546; csrf_token=ImZjMDM1NzJhOTVmYWI2NGRmMjJkN2I1ZDUzYTBkMGIzZGM4N2ZjOTIi.Fb-qfQ.kliU6aNIb4iHTYaX16iMNAY73VI'
cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
yield scrapy.Request(
url=next_url,
callback=self.parse_img,
meta={'base_data': base_data},
cookies=cookies
)
定义了page每一次都加1开始,比如现在是第一页所以page=2,假设pages=100,如果page小于且等于pages,则会一直循环下去。
这里有个小插曲,爬的太快了被网站限制,为了和谐相处,咱们打开setting文件设置下载延迟。
在setting.py中添加
DOWNLOAD_DELAY = 3 # 下载延迟CONCURRENT_REQUESTS = 8 # scrapy最大的并发量,默认为16
在测试过程中,发现频繁报了一个错,bool类型的数据不可下标,进入对于的url查看,发现有些饰品缺少些元素,正常info内会有五个标签
修改我们的arms.py,最终为
import scrapy
import time
import json
import jsonpath
from ..items import BuffItem
class ArmsSpider(scrapy.Spider):
name = 'arms'
allowed_domains = ['buff.163.com']
start_urls = ['https://buff.163.com/market/csgo']
def parse(self, response):
node_list = response.xpath('//*[@class="h1z1-selType type_csgo"]/div')
for node in node_list:
base_data = {}
base_data['biglabel'] = node.xpath('.//p/text()').get()
base_data['value'] = node.xpath('.//p/@value').get()
base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(base_data['value'], int(time.time() * 1000))
cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676; _gid=GA1.2.109923849.1658721677; NTES_YD_SESS=XFu19pwcHN6Blr5FRmVVtMV81vu_q8LPnNvqzF_aBBLVAZ_WA7Kw6g3z0x.OnVZMav2Ct9VuprshE6tCMRo1iMtqZtzZa9kp4Y77cost521PJbbZt_Zw9WtdpVwDUUF4QXKWPYURB6P8PZT97Ar4Rde7Tg2EiB1L5n9lVw.3Z6GrETAU6i5ct03n9LcMEld0JF7Zqj_Gl2wTJGt1fx3tyz8NuI1YoOmb7Oh9VTxwoqYE3; S_INFO=1658722544|0|0&60##|18958675241; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-UWdoO73qKkqBcWzo4Cz2l1lZz2HToVVUjAFknHzNIT6n2038696921; _gat_gtag_UA_109989484_1=1; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658722546; csrf_token=ImZjMDM1NzJhOTVmYWI2NGRmMjJkN2I1ZDUzYTBkMGIzZGM4N2ZjOTIi.Fb-qfQ.kliU6aNIb4iHTYaX16iMNAY73VI'
cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
yield scrapy.Request(
url=base_data['biglable_link'],
callback=self.parse_img,
meta={'base_data': base_data},
cookies=cookies
)
def parse_img(self, response):
base_data = response.meta['base_data']
json_data = json.loads(response.text)
id = jsonpath.jsonpath(json_data, '$..items[*].id')
name = jsonpath.jsonpath(json_data, '$..items[*].name')
market_name = jsonpath.jsonpath(json_data, '$..items[*].market_hash_name')
price = jsonpath.jsonpath(json_data, '$..items[*].sell_min_price')
exterior_wear = jsonpath.jsonpath(json_data, '$..info.tags.exterior.localized_name')
quality = jsonpath.jsonpath(json_data, '$..info.tags.quality.localized_name')
rarity = jsonpath.jsonpath(json_data, '$..info.tags.rarity.localized_name')
type = jsonpath.jsonpath(json_data, '$..info.tags.type.localized_name')
weapon_type = jsonpath.jsonpath(json_data, '$..info.tags.weapon.localized_name')
for i in range(len(id)):
item = BuffItem()
item['biglabel'] = base_data['biglabel']
item['biglabel_link'] = base_data['biglable_link']
item['id'] = id[i]
item['name'] = name[i]
item['market_name'] = market_name[i]
item['price'] = price[i]
if not exterior_wear:
item['exterior_wear'] = ''
else:
item['exterior_wear'] = exterior_wear[i]
if not quality:
item['quality'] = ''
else:
item['quality'] = quality[i]
if not rarity:
item['rarity'] = ''
else:
item['rarity'] = rarity[i]
if not type:
item['type'] = ''
else:
item['type'] = type[i]
if not weapon_type:
item['weapon_type'] = ''
else:
item['weapon_type'] = weapon_type[i]
yield item
page = jsonpath.jsonpath(json_data, '$.data.page_num')[0] + 1
pages = jsonpath.jsonpath(json_data, '$.data.total_page')[0]
if page <= pages:
# print(base_data['value'])
next_url = 'https://buff.163.com/api/market/goods?game=csgo&page_num={}&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(page, base_data['value'], int(time.time() * 1000))
cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676; _gid=GA1.2.109923849.1658721677; NTES_YD_SESS=XFu19pwcHN6Blr5FRmVVtMV81vu_q8LPnNvqzF_aBBLVAZ_WA7Kw6g3z0x.OnVZMav2Ct9VuprshE6tCMRo1iMtqZtzZa9kp4Y77cost521PJbbZt_Zw9WtdpVwDUUF4QXKWPYURB6P8PZT97Ar4Rde7Tg2EiB1L5n9lVw.3Z6GrETAU6i5ct03n9LcMEld0JF7Zqj_Gl2wTJGt1fx3tyz8NuI1YoOmb7Oh9VTxwoqYE3; S_INFO=1658722544|0|0&60##|18958675241; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-UWdoO73qKkqBcWzo4Cz2l1lZz2HToVVUjAFknHzNIT6n2038696921; _gat_gtag_UA_109989484_1=1; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658722546; csrf_token=ImZjMDM1NzJhOTVmYWI2NGRmMjJkN2I1ZDUzYTBkMGIzZGM4N2ZjOTIi.Fb-qfQ.kliU6aNIb4iHTYaX16iMNAY73VI'
cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
yield scrapy.Request(
url=next_url,
callback=self.parse_img,
meta={'base_data': base_data},
cookies=cookies
)
当然这里面还有三级页面,四级页面,进入三级页面能够拿到磨损度的数值以及卖家信息,四级页面能够获取一些卖家的个人信息。网易buff没有任何反爬是我意料之外的,只需要通过登入的滑块就能拿到账号的cookies。
关于更深的爬取我们还可以对三四级页面进行分析,然后使用scrapy-redis进行分布式爬取,加快爬取速度,在时间紧急的情况下推荐用分布式,案例演示我们就只用redis做一个断点续爬就行,咱们直接干。
4. 断点续爬,以及将数据存入mysql当中
断点需怕无谓就是添加几步操作,添加重复过滤模块以及redis的调度器
打开我们的redis,然后在setting.py中添加如下代码。
USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)' # 设置重复过滤器的模块 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 设置调度器,scrapy_redis中的调度器具有与数据库交互的功能 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 设置当爬虫结束的时候是否保持redis数据库中的去重集合与任务队列 SCHEDULER_PERSIST = TrueITEM_PIPELINES = { # 'JD.pipelines.ExamplePipeline': 300, # 当开启该管道,该管道会把数据存到redis数据库中 'scrapy_redis.pipelines.RedisPipeline': 400, } REDIS_HOST = 'host' REDIS_PORT = port REDIS_ENCODING = 'utf-8' REDIS_PARAMS = {'password':'your redis password'}
运行后可以看到,数据库中多了两个键,其中dupefilter是我们的url去重指纹,在scrapy中默认url去重为md5加密,加密过后的数值能更节省内存,items为我们的数据
其中增量式爬虫原理也相同:在请求页面前会对数据库内的文件进行匹配,如果指定的元素存在数据库当中,则跳过该链接的请求,去拿没获取过的数据。
如果想对数据进行持久化存储,那就得考虑mysql或者存储为csv文件,这里我们两个都演示。打开我们的pipelines.py文件,编写我们的管道中间件,默认情况下pipelines.py文件如下:
将pipelines.py文件修改如下.
import csv
import pymysql
class BuffPipeline:
def open_spider(self, spider):
self.file = open(file='arms.csv', mode='a', encoding='utf-8-sig', newline='')
self.csvwriter = csv.writer(self.file)
self.csvwriter.writerow(['biglabel', 'biglabel_link', 'id', 'name', 'market_name', 'price', 'exterior_wear', 'quality', 'rarity', 'type', 'weapon_type'])
def process_item(self, item, spider):
self.csvwriter.writerow([item['biglabel'], item['biglabel_link'], item['id'], item['name'], item['market_name'], item['price'], item['exterior_wear'], item['quality'], item['rarity'], item['type'], item['weapon_type']])
return item
def close_spider(self, spider):
self.file.close()
class MysqlPipeline:
def open_spider(self, spider):
self.mysql = pymysql.connect(host='localhost', user='root', password='your mysql password', db='sys', port=3306, charset='utf8')
self.cursor = self.mysql.cursor()
sql = '''
create table buff(
biglabel char(255),
biglabel_link char(255),
id char(255),
name char(255),
market_name varchar(500),
price char(30),
exterior_wear char(255),
quality varchar(255),
rarity char(255),
type char(255),
weapon_type char(255))
'''
self.cursor.execute(sql) # 创建表格
def process_item(self, item, spider):
insert_sql = '''insert into buff(biglabel, biglabel_link, id, name, market_name, price, exterior_wear, quality, rarity, type, weapon_type) value ('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')''' % (item['biglabel'], item['biglabel_link'], item['id'], item['name'], item['market_name'], item['price'], item['exterior_wear'], item['quality'], item['rarity'], item['type'], item['weapon_type'])
self.cursor.execute(insert_sql)
self.mysql.commit() # 提交
return item
# 关闭
def close_spider(self, spider):
self.cursor.close()
self.mysql.close()
记住要提前打开我们的mysql和redis,然后在setiing中设置我们的管道和redis一页,最后实现的效果如下。
好了,就这么多。