Python使用scrapy抓取网站sitemap信息的方法

本文实例讲述了Python使用scrapy抓取网站sitemap信息的方法。分享给大家供大家参考。具体如下：

import re
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.utils.response import body_or_str
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
class SitemapSpider(BaseSpider):
 name = "SitemapSpider"
 start_urls = ["http://www.domain.com/sitemap.xml"]
 def parse(self, response):
  nodename = 'loc'
  text = body_or_str(response)
  r = re.compile(r"(<%s[\s>])(.*?)(</%s>)"%(nodename,nodename),re.DOTALL)
  for match in r.finditer(text):
   url = match.group(2)
   yield Request(url, callback=self.parse_page)
 def parse_page(self, response):
    hxs = HtmlXPathSelector(response)
    #Mock Item
  blah = Item()
  #Do all your page parsing and selecting the elemtents you want
    blash.divText = hxs.select('//div/text()').extract()[0]
  yield blah

希望本文所述对大家的Python程序设计有所帮助。

Python使用scrapy抓取网站sitemap信息的方法

wlpython

相关推荐

网站地图sitemap.xml自动更新lastmod文件（PHP代码）-更新

Hexo博客谷歌收录地址

使用hexo和github搭建静态博客网站（三）

Django Sitemap 站点地图的实现方法

网站地图制作工具 Sitemap Creator 2.1

robots 中 Sitemap 的 XML 格式和用法

robots.txt及其 Meta标签在SEO中的作用、如何提交网站地图（转）

wordpress优化十法

主动提交sitemap让谷歌、雅虎、MSN统统收录你的网站、博客

网站robots.txt文件说明（2）

站长 sitemap索引文件以及提交到百度

如何正确处理网站建设产生的死链

专业的seoer人员应该掌握哪些网站技术?

什么是SEO SEO新手快速入门技巧

浅谈搜索引擎营销推广技巧及搜索排名优化

百度sitemap索引文件怎么制作？如何提交到百度？

SEO中容易被忽视的细节分析小结

SiteMap Maker v1.01 网站地图生成器

关于百度站长工具的使用测评

Java搜索引擎服务器Open Search Server1.2.3发布

Django添加sitemap的方法示例

利用Python3分析sitemap.xml并抓取导出全站链接详解

在Django中使用Sitemap的方法讲解

为大家分享SEO几款常用工具

新手指南如何向搜索引擎提交网站sitemap

百度实时推送api接口应用示例

shell脚本实现快速生成xml格式sitemap实例分享

php基于curl主动推送最新内容给百度收录的方法

php生成百度sitemap站点地图类函数实例

CodeIgniter生成网站sitemap地图的方法

google sitemap.asp