learnpy 2020-04-26
lxml 是 一个HTML/XML的解析器,主要的功能是如何解析和提取 HTML/XML 数据。
一、lxml示例
1、初步
# 使用 lxml 的 etree 库 from lxml import etree text = ‘‘‘ <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 </ul> </div> ‘‘‘ #利用etree.HTML,将字符串解析为HTML文档 html = etree.HTML(text) # 按字符串序列化HTML文档 result = etree.tostring(html) print(result)
结果
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
2、从文件里读取内容
from lxml import etree # 读取外部文件 hello.html html = etree.parse(‘./hello.html‘) result = etree.tostring(html, pretty_print=True) print(result)
3、html内容
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
@1、获取所有的 <li>
标签
from lxml import etree html = etree.parse(‘hello.html‘) print type(html) # 显示etree.parse() 返回类型 result = html.xpath(‘//li‘) print result # 打印<li>标签的元素集合 print len(result) print type(result) print type(result[0]) 结果是 <type ‘lxml.etree._ElementTree‘> [<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>] 5 <type ‘list‘> <type ‘lxml.etree._Element‘>
@2、继续获取<li>
标签的所有 class
属性
from lxml import etree html = etree.parse(‘hello.html‘) result = html.xpath(‘//li/@class‘) print result 结果是 [‘item-0‘, ‘item-1‘, ‘item-inactive‘, ‘item-1‘, ‘item-0‘]
@3、继续获取<li>
标签下href
为 link1.html
的 <a>
标签
from lxml import etree html = etree.parse(‘hello.html‘) result = html.xpath(‘//li/a[@href="link1.html"]‘) print result 运行结果 [<Element a at 0x10ffaae18>]
@4、获取<li>
标签下的所有 <span>
标签
from lxml import etree html = etree.parse(‘hello.html‘) #result = html.xpath(‘//li/span‘) #注意这么写是不对的: #因为 / 是用来获取子元素的,而 <span> 并不是 <li> 的子元素,所以,要用双斜杠 result = html.xpath(‘//li//span‘) print result 运行结果 [<Element span at 0x10d698e18>]
@5、获取 <li>
标签下的<a>
标签里的所有 class
from lxml import etree html = etree.parse(‘hello.html‘) result = html.xpath(‘//li/a//@class‘) print result 运行结果 [‘blod‘]
@6、获取最后一个 <li>
的 <a>
的 href
from lxml import etree html = etree.parse(‘hello.html‘) result = html.xpath(‘//li[last()]/a/@href‘) # 谓语 [last()] 可以找到最后一个元素 print result 运行结果 [‘link5.html‘]
@7、获取倒数第二个元素的内容
from lxml import etree html = etree.parse(‘hello.html‘) result = html.xpath(‘//li[last()-1]/a‘) # text 方法可以获取元素内容 print result[0].text 运行结果 fourth item
@8、获取 class
值为 bold
的标签名
from lxml import etree html = etree.parse(‘hello.html‘) result = html.xpath(‘//*[@class="bold"]‘) # tag方法可以获取标签名 print result[0].tag 运行结果 span