ZHANGRENXIANG00 2020-06-09
scrapy中间有两种:爬虫中间件,下载中间件
爬虫中间件:处于引擎和爬虫spider之间
下载中间件:处于引擎和下载器之间
主要对下载中间件进行处理
作用:批量拦截请求和响应
UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
request.headers[‘User-Agent‘] = ‘xxx‘
需要构建一个请求载体池:
user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
代理操作:发送请求使用代理
request.meta[‘proxy‘] = ‘http://ip:port‘
需要构建代理池
# 代理需要自己去获取,这里的已经无效 PROXY_http = [ ‘153.180.102.104:80‘, ‘195.208.131.189:56055‘, ] PROXY_https = [ ‘120.83.49.90:9000‘, ‘95.189.112.214:35508‘, ]
注意点:不光在process_request方法中使用,在process_exception方法中也要使用
原因是:ip被封的时候访问某些网站,访问成功,但是返回了错误页面,有的直接是请求失败,针对这两种情况,应该分别设置中间件处理
篡改响应数据或者直接替换响应对象
拦截请求示例:
class MovieproDownloaderMiddleware(object): #拦截正常的请求,参数request就是拦截到的请求对象 def process_request(self, request, spider): print(‘i am process_request()‘) #实现:将拦截到的请求尽可能多的设定成不同的请求载体身份标识 request.headers[‘User-Agent‘] = random.choice(user_agent_list) #代理操作 if request.url.split(‘:‘)[0] == ‘http‘: request.meta[‘proxy‘] = ‘http://‘+random.choice(PROXY_http) #http://ip:port else: request.meta[‘proxy‘] = ‘https://‘ + random.choice(PROXY_https) # http://ip:port return None #拦截响应:参数response就是拦截到的响应 def process_response(self, request, response, spider): print(‘i am process_response()‘) return response #拦截发生异常的请求 def process_exception(self, request, exception, spider): print(‘i am process_exception()‘) #拦截到异常的请求然后对其进行修正,然后重新进行请求发送 # 代理操作 if request.url.split(‘:‘)[0] == ‘http‘: request.meta[‘proxy‘] = ‘http://‘ + random.choice(PROXY_http) # http://ip:port else: request.meta[‘proxy‘] = ‘https://‘ + random.choice(PROXY_https) # http://ip:port return request #将修正之后的请求进行重新发送
使用场景:有动态加载数据,此时访问首页url无法获取想要的数据,用到selenium进行获取
1.实例化浏览器对象:写在爬虫类的构造方法中
bro = webdriver.Chrome(executable_path=r‘C:\Users\oldboy-python\Desktop\爬虫+数据\tools\chromedriver.exe‘)
2.在中间件中执行浏览器自动化的操作
3.关闭浏览器:爬虫类中的closed(self,spider)关闭浏览器
def closed(self,spider): self.bro.quit()
拦截响应中间件的示例
1.先找到不满足要求的响应对象对应的请求对象
可以在爬虫类中定义容器,存储不满足要求响应对象的请求url
通过spider点方法进行获取,spider就是爬虫文件中爬虫类实例化的对象
2.通过HtmlResponse类重新发送请求,该请求对应的响应对象替换之前不满足要求的响应对象
from scrapy.http import HtmlResponse new_response = HtmlResponse(url=request.url,body=page_text,encoding=‘utf-8‘,request=request)
def process_response(self, request, response, spider): #spider.five_model_urls:五个板块对应的url bro = spider.bro if request.url in spider.five_model_urls: bro.get(request.url) sleep(1) page_text = bro.page_source #包含了动态加载的新闻数据 #如果if条件成立则该response就是五个板块对应的响应对象 new_response = HtmlResponse(url=request.url,body=page_text,encoding=‘utf-8‘,request=request) return new_response return response