moguibeijing 2019-06-26
本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息,并存入MongoDB中。网页的截图如下,全部数据共12多万条。

  我们不再过多介绍Scrapy的创建和运行,只给出相关的代码。关于Scrapy的创建和运行,有兴趣的读者可以参考:Scrapy爬虫(4)爬取豆瓣电影Top250图片。
  修改items.py,代码如下,用来储存每个理财产品的相关信息,如产品名称,发行银行等。
import scrapy
class BankItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    bank = scrapy.Field()
    currency = scrapy.Field()
    startDate = scrapy.Field()
    endDate = scrapy.Field()
    period = scrapy.Field()
    proType = scrapy.Field()
    profit = scrapy.Field()
    amount = scrapy.Field()创建爬虫文件bankSpider.py,代码如下,用来爬取网页中理财产品的具体信息。
import scrapy
from bank.items import BankItem
class bankSpider(scrapy.Spider):
    name = 'bank'
    start_urls = ['https://www.rong360.com/licai-bank/list/p1']
    def parse(self, response):
        item = BankItem()
        trs = response.css('tr')[1:]
        
        for tr in trs:
            item['name'] = tr.xpath('td[1]/a/text()').extract_first()
            item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
            item['currency'] = tr.xpath('td[3]/text()').extract_first()
            item['startDate'] = tr.xpath('td[4]/text()').extract_first()
            item['endDate'] = tr.xpath('td[5]/text()').extract_first()
            item['period'] = tr.xpath('td[6]/text()').extract_first()
            item['proType'] = tr.xpath('td[7]/text()').extract_first()
            item['profit'] = tr.xpath('td[8]/text()').extract_first()
            item['amount'] = tr.xpath('td[9]/text()').extract_first()
            yield item
        next_pages = response.css('a.next-page')
        if len(next_pages) == 1:
            next_page_link = next_pages.xpath('@href').extract_first() 
        else:
            next_page_link = next_pages[1].xpath('@href').extract_first()
       
        if next_page_link:
            next_page = "https://www.rong360.com" + next_page_link
            yield scrapy.Request(next_page, callback=self.parse)为了将爬取的数据储存到MongoDB中,我们需要修改pipelines.py文件,代码如下:
# pipelines to insert the data into mongodb
import pymongo
from scrapy.conf import settings
class BankPipeline(object):
    def __init__(self):
        # connect database
        self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])
        # using name and password to login mongodb
        # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])
        
        # handle of the database and collection of mongodb
        self.db = self.client[settings['MONGO_DB']]
        self.coll = self.db[settings['MONGO_COLL']] 
    def process_item(self, item, spider):
        postItem = dict(item)
        self.coll.insert(postItem)
        return item其中的MongoDB的相关参数,如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下:
MONGO_HOST = "localhost" # 主机IP MONGO_PORT = 27017 # 端口号 MONGO_DB = "Spider" # 库名 MONGO_COLL = "bank" # collection名 # MONGO_USER = "" # MONGO_PSW = ""
其中用户名和密码可以根据需要添加。
接下来,我们就可以运行爬虫了。运行结果如下:

共用时3小时,爬了12多万条数据,效率之高令人惊叹!
  最后我们再来看一眼MongoDB中的数据:

Perfect!本次分享到此结束,欢迎大家交流~~