shengge0 2020-02-13
经人提醒忘记发网址id的爬取过程了,
http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20021300174
AH20021300174为要爬取的内容
现代码如下:
import json import requests import io url="http://www.beijing.gov.cn/hudong/hdjl/com.web.search.mailList.mailList.biz.ext" kv = { ‘Host‘: ‘www.beijing.gov.cn‘, ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0‘, ‘Accept‘: ‘application/json, text/javascript, */*; q=0.01‘, ‘Accept-Language‘: ‘zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2‘, ‘Accept-Encoding‘: ‘gzip, deflate‘, ‘Content-Type‘: ‘text/json‘, ‘X-Requested-With‘: ‘XMLHttpRequest‘, ‘Content-Length‘: ‘155‘, ‘Origin‘: ‘http://www.beijing.gov.cn‘, ‘Connection‘: ‘keep-alive‘, ‘Referer‘: ‘http://www.beijing.gov.cn/hudong/hdjl/‘} def page(begin): query={ ‘PageCond/begin‘: begin, ‘PageCond/isCount‘:‘true‘, ‘PageCond/length‘:6, } datas=json.dumps(query) r=requests.post(url,data=datas,headers=kv) print(r.status_code) print(r.text) js=json.loads(r.text) for j in js["mailList"]: print(j) print(j.get("original_id")) def href(): begin=0 for i in range(0,5584): if i%6==0: page(i) #print(begin) if __name__=="__main__": href()