Python 人类的HTML解析器:requests_html
该库旨在使解析HTML(例如,抓取Web)尽可能简单直观。
使用该库时,您将自动获得:
- 全面的JavaScript支持!
- CSS选择器(又称jQuery风格,多亏了PyQuery)。
- XPath Selectors,让您感到昏厥。
- 模拟的用户代理(如真实的Web浏览器)。
- 自动跟随重定向。
- 连接池和cookie持久性。
- 您知道和喜欢的请求体验具有神奇的解析能力。
安装
>>> pip install requests-html
仅支持Python 3.6。(及以上版本)
教程和用法
使用Requests向python.org发出GET请求
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')
按原样获取页面上所有链接的列表(不包括锚):
>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
以绝对形式(不包括锚)获取页面上所有链接的列表:
>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}
Element
使用CSS选择器选择一个(了解更多信息):
>>> about = r.html.find('#about', first=True)
抓取Element
的文本内容:
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
内省an Element
的属性(了解更多信息):
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
呈现Element
的HTML:
>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
在以下Element
列表中选择一个列表Element
:
>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
搜索元素内的链接:
>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
在页面上搜索文本:
>>> r.html.search('Python is a {} language')[0]
programming
更复杂的CSS选择器示例(从Chrome开发工具复制):
>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath也受支持(了解更多信息):
>>> r.html.xpath('a')
[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]
您还可以仅选择包含某些文本的元素:
>>> r = session.get('http://python-requests.org/')
>>> r.html.find('a', containing='kenneth')
[<Element 'a' href='http://kennethreitz.com/pages/open-projects.html'>, <Element 'a' href='http://kennethreitz.org/'>, <Element 'a' href='https://twitter.com/kennethreitz' class=('twitter-follow-button',) data-show-count='false'>, <Element 'a' class=('reference', 'internal') href='dev/contributing/#kenneth-reitz-s-code-style'>]
支持JavaScript
让我们获取一些JavaScript渲染的文本:
>>> r = session.get('http://python-requests.org/')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
请注意,您第一次运行该render()
方法时,会将Chromium下载到您的主目录(例如~/.pyppeteer/
)中。这只会发生一次。
分页
还提供智能分页支持(始终在改进):
>>> r = session.get('https://reddit.com')
>>> for html in r.html:
... print(html)
<HTML url='https://www.reddit.com/'>
<HTML url='https://www.reddit.com/?count=25&after=t3_81puu5'>
<HTML url='https://www.reddit.com/?count=50&after=t3_81nevg'>
<HTML url='https://www.reddit.com/?count=75&after=t3_81lqtp'>
<HTML url='https://www.reddit.com/?count=100&after=t3_81k1c8'>
<HTML url='https://www.reddit.com/?count=125&after=t3_81p438'>
<HTML url='https://www.reddit.com/?count=150&after=t3_81nrcd'>
…
您也可以轻松地请求下一个URL:python
>>> r = session.get('https://reddit.com')
>>> r.html.next()
'https://www.reddit.com/?count=25&after=t3_81pm82'
在没有请求的情况下使用
您也可以在没有请求的情况下使用此库:
>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}
您也可以在没有请求的情况下呈现JavaScript页面:
# ^^ proceeding from above ^^
>>> script = """
() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}
"""
>>> val = html.render(script=script, reload=False)
>>> print(val)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
>>> print(html.html)
<html><head></head><body><a href="https://httpbin.org"></a></body></html>
以上内容来自于requests_html作者官网,简单翻译过来就是很好的使用教程呢!
更详细的介绍可以参考原作者的网站或者GitHub
由于requests_html和之前介绍的requests库是同一作者,使用很多原来在requests上的一些操作在这个新库也是可以使用的。如:自定义header
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
r=session.get("http://httpbin.org/get",headers={'user-agent':ua})
print(r.text)
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"
},
"origin":
"url": "https://httpbin.org/get"
}
requests_html对我目前来说已经是一个非常优秀的库了,因为与其他的爬虫操作不同,网页获取,解析器,和Js渲染用在一个库中解决了,使用者一个库就可以写出一个比较完整的爬虫Dome。如:
#!/usr/bin/env python
# @coding: UTF-8
# @Time : 2019/9/28 17:11
# @Author : lengqie
# @File : MySpider1.1.py
# @Explain : 对原版本的1.0的优化版本
import re,pymysql,time
from requests_html import HTMLSession
session=HTMLSession()
# 获取文章链接
def GetPostLinks(html):
try:
# 获取全部链接
All_links=html.html.links
Post_links=set()
# 筛选出文章链接
pattern = r'(https?|ftp|file)(:/)?/[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]?p=\d{1,5}$'
for link in All_links:
if re.match(pattern,link):
Post_links.add(link)
except Exception as e:
print("连接获取出错!\t"+str(e))
exit()
else:
# 返回一个集合
print("链接获取成功!\n")
return Post_links
# 获取文章信息
def GetPost(url):
try:
# 提取文章id,使用正则表达式提取出含有ID的一个元组中,再取出元组中的数字与默认结果一起组成css选择器
id=str(re.findall(r'\d+(?:\.\d+)?', url)[0])
# 提取文章标题
title_selector = '#post-'+id+'> header'
title_session=Response.html.find(title_selector,first=True)
title=title_session.text
# 提取文章时间
time_selector = '#post-'+id+ '> footer > span.posted-on > a > time.entry-date.published'
time_session=Response.html.find(time_selector,first=True)
time=time_session.text
# 提取所在分类
tag_selector ='#post-'+id+' > footer > span.cat-links > a'
tag_session=Response.html.find(tag_selector,first=True)
tag = tag_session.text
# 提取文章内容
content_selector='#post-'+id+ '> div'
content_session = Response.html.find(content_selector,first=True)
content=content_session.text
except Exception as e:
print("文章提取出错!\t"+str(e))
exit()
else:
# 会返回一个元组
print("ID={}".format(id))
print("文章获取成功!")
return title,time,tag,content
# 创建数据库
def CreateDatabase():
# 连接数据库
try:
db = pymysql.connect("127.0.0.1", "root", "", "test")
cursor = db.cursor()
# 表存在需要删除
cursor.execute("DROP TABLE IF EXISTS POST")
except Exception as e:
print("数据库连接失败!\t"+str(e))
exit()
else:
print("数据库连接成功!\n")
try:
sql = """CREATE TABLE POST (
title CHAR(255),
time CHAR(255),
tag CHAR(255),
content TEXT )"""
cursor.execute(sql)
except Exception as e:
print("表创建错误!\t"+str(e))
exit()
else:
print("新建表成功!\n")
db.close()
# 写入数据库
def Writer2Mysql(tuple):
# 连接数据库 若创建数据表成功,此处则不需要异常处理!
db = pymysql.connect("127.0.0.1", "root", "", "test")
cursor = db.cursor()
sql = "INSERT INTO POST(title,time,tag,content) VALUES (%s, %s, %s,%s)"
try:
cursor.execute(sql, (tuple[0],tuple[1],tuple[2],tuple[3]))
db.commit()
print('插入数据成功!\n')
except Exception as e:
# 数据回滚
db.rollback()
print("插入数据失败!\t"+str(e))
exit()
db.close()
if __name__ == '__main__':
# 程序计时开始
start=time.perf_counter()
url="http://jianghaodong.com"
# 浏览器标识
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
Response = session.get(url,headers={'user-agent':ua})
# javascript渲染
#Response.html.render()
# 休眠时间 默认 0
SleepTime = 0
# 获取链接
links=GetPostLinks(Response)
# 创建数据库
CreateDatabase()
for link in links:
ReturnTuple = GetPost(link)
time.sleep(SleepTime)
# 写入到表
Writer2Mysql(ReturnTuple)
print("爬取完毕!")
# 程序计时结束
end=time.perf_counter()
print("耗时{:>.2}s".format(end-start))