Python 人类的HTML解析器:requests_html

该库旨在使解析HTML(例如,抓取Web)尽可能简单直观。

使用该库时,您将自动获得:

  • 全面的JavaScript支持!
  • CSS选择器(又称jQuery风格,多亏了PyQuery)。
  • XPath Selectors,让您感到昏厥。
  • 模拟的用户代理(如真实的Web浏览器)。
  • 自动跟随重定向。
  • 连接池和cookie持久性。
  • 您知道和喜欢的请求体验具有神奇的解析能力。

安装

>>> pip install requests-html

仅支持Python 3.6(及以上版本)

教程和用法

使用Requestspython.org发出GET请求

>>> from requests_html import HTMLSession
>>> session = HTMLSession()

>>> r = session.get("https://python.org/")

按原样获取页面上所有链接的列表(不包括锚):

>>> r.html.links
{"//docs.python.org/3/tutorial/", "/about/apps/", "https://github.com/python/pythondotorg/issues", "/accounts/login/", "/dev/peps/", "/about/legal/", "//docs.python.org/3/tutorial/introduction.html#lists", "/download/alternatives", "http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html", "/download/other/", "/downloads/windows/", "https://mail.python.org/mailman/listinfo/python-dev", "/doc/av", "https://devguide.python.org/", "/about/success/#engineering", "https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event", "https://www.openstack.org", "/about/gettingstarted/", "http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html", "/success-stories/industrial-light-magic-runs-python/", "http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator", "/", "http://pyfound.blogspot.com/", "/events/python-events/past/", "/downloads/release/python-2714/", "https://wiki.python.org/moin/PythonBooks", "http://plus.google.com/+Python", "https://wiki.python.org/moin/", "https://status.python.org/", "/community/workshops/", "/community/lists/", "http://buildbot.net/", "/community/awards", "http://twitter.com/ThePSF", "https://docs.python.org/3/license.html", "/psf/donations/", "http://wiki.python.org/moin/Languages", "/dev/", "/events/python-user-group/", "https://wiki.qt.io/PySide", "/community/sigs/", "https://wiki.gnome.org/Projects/PyGObject", "http://www.ansible.com", "http://www.saltstack.com", "http://planetpython.org/", "/events/python-events", "/about/help/", "/events/python-user-group/past/", "/about/success/", "/psf-landing/", "/about/apps", "/about/", "http://www.wxpython.org/", "/events/python-user-group/665/", "https://www.python.org/psf/codeofconduct/", "/dev/peps/peps.rss", "/downloads/source/", "/psf/sponsorship/sponsors/", "http://bottlepy.org", "http://roundup.sourceforge.net/", "http://pandas.pydata.org/", "http://brochure.getpython.info/", "https://bugs.python.org/", "/community/merchandise/", "http://tornadoweb.org", "/events/python-user-group/650/", "http://flask.pocoo.org/", "/downloads/release/python-364/", "/events/python-user-group/660/", "/events/python-user-group/638/", "/psf/", "/doc/", "http://blog.python.org", "/events/python-events/604/", "/about/success/#government", "http://python.org/dev/peps/", "https://docs.python.org", "http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html", "/users/membership/", "/about/success/#arts", "https://wiki.python.org/moin/Python2orPython3", "/downloads/", "/jobs/", "http://trac.edgewall.org/", "http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html", "/privacy/", "https://pypi.python.org/", "http://www.riverbankcomputing.co.uk/software/pyqt/intro", "http://www.scipy.org", "/community/forums/", "/about/success/#scientific", "/about/success/#software-development", "/shell/", "/accounts/signup/", "http://www.facebook.com/pythonlang?fref=ts", "/community/", "https://kivy.org/", "/about/quotes/", "http://www.web2py.com/", "/community/logos/", "/community/diversity/", "/events/calendars/", "https://wiki.python.org/moin/BeginnersGuide", "/success-stories/", "/doc/essays/", "/dev/core-mentorship/", "http://ipython.org", "/events/", "//docs.python.org/3/tutorial/controlflow.html", "/about/success/#education", "/blogs/", "/community/irc/", "http://pycon.blogspot.com/", "//jobs.python.org", "http://www.pylonsproject.org/", "http://www.djangoproject.com/", "/downloads/mac-osx/", "/about/success/#business", "http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html", "http://wiki.python.org/moin/TkInter", "https://docs.python.org/faq/", "//docs.python.org/3/tutorial/controlflow.html#defining-functions"}

绝对形式(不包括锚)获取页面上所有链接的列表:

>>> r.html.absolute_links
{"https://github.com/python/pythondotorg/issues", "https://docs.python.org/3/tutorial/", "https://www.python.org/about/success/", "http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html", "https://www.python.org/dev/peps/", "https://mail.python.org/mailman/listinfo/python-dev", "https://www.python.org/doc/", "https://www.python.org/", "https://www.python.org/about/", "https://www.python.org/events/python-events/past/", "https://devguide.python.org/", "https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event", "https://www.openstack.org", "http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html", "https://docs.python.org/3/tutorial/introduction.html#lists", "http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator", "http://pyfound.blogspot.com/", "https://wiki.python.org/moin/PythonBooks", "http://plus.google.com/+Python", "https://wiki.python.org/moin/", "https://www.python.org/events/python-events", "https://status.python.org/", "https://www.python.org/about/apps", "https://www.python.org/downloads/release/python-2714/", "https://www.python.org/psf/donations/", "http://buildbot.net/", "http://twitter.com/ThePSF", "https://docs.python.org/3/license.html", "http://wiki.python.org/moin/Languages", "https://docs.python.org/faq/", "https://jobs.python.org", "https://www.python.org/about/success/#software-development", "https://www.python.org/about/success/#education", "https://www.python.org/community/logos/", "https://www.python.org/doc/av", "https://wiki.qt.io/PySide", "https://www.python.org/events/python-user-group/660/", "https://wiki.gnome.org/Projects/PyGObject", "http://www.ansible.com", "http://www.saltstack.com", "https://www.python.org/dev/peps/peps.rss", "http://planetpython.org/", "https://www.python.org/events/python-user-group/past/", "https://docs.python.org/3/tutorial/controlflow.html#defining-functions", "https://www.python.org/community/diversity/", "https://docs.python.org/3/tutorial/controlflow.html", "https://www.python.org/community/awards", "https://www.python.org/events/python-user-group/638/", "https://www.python.org/about/legal/", "https://www.python.org/dev/", "https://www.python.org/download/alternatives", "https://www.python.org/downloads/", "https://www.python.org/community/lists/", "http://www.wxpython.org/", "https://www.python.org/about/success/#government", "https://www.python.org/psf/", "https://www.python.org/psf/codeofconduct/", "http://bottlepy.org", "http://roundup.sourceforge.net/", "http://pandas.pydata.org/", "http://brochure.getpython.info/", "https://www.python.org/downloads/source/", "https://bugs.python.org/", "https://www.python.org/downloads/mac-osx/", "https://www.python.org/about/help/", "http://tornadoweb.org", "http://flask.pocoo.org/", "https://www.python.org/users/membership/", "http://blog.python.org", "https://www.python.org/privacy/", "https://www.python.org/about/gettingstarted/", "http://python.org/dev/peps/", "https://www.python.org/about/apps/", "https://docs.python.org", "https://www.python.org/success-stories/", "https://www.python.org/community/forums/", "http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html", "https://www.python.org/community/merchandise/", "https://www.python.org/about/success/#arts", "https://wiki.python.org/moin/Python2orPython3", "http://trac.edgewall.org/", "http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html", "https://pypi.python.org/", "https://www.python.org/events/python-user-group/650/", "http://www.riverbankcomputing.co.uk/software/pyqt/intro", "https://www.python.org/about/quotes/", "https://www.python.org/downloads/windows/", "https://www.python.org/events/calendars/", "http://www.scipy.org", "https://www.python.org/community/workshops/", "https://www.python.org/blogs/", "https://www.python.org/accounts/signup/", "https://www.python.org/events/", "https://kivy.org/", "http://www.facebook.com/pythonlang?fref=ts", "http://www.web2py.com/", "https://www.python.org/psf/sponsorship/sponsors/", "https://www.python.org/community/", "https://www.python.org/download/other/", "https://www.python.org/psf-landing/", "https://www.python.org/events/python-user-group/665/", "https://wiki.python.org/moin/BeginnersGuide", "https://www.python.org/accounts/login/", "https://www.python.org/downloads/release/python-364/", "https://www.python.org/dev/core-mentorship/", "https://www.python.org/about/success/#business", "https://www.python.org/community/sigs/", "https://www.python.org/events/python-user-group/", "http://ipython.org", "https://www.python.org/shell/", "https://www.python.org/community/irc/", "https://www.python.org/about/success/#engineering", "http://www.pylonsproject.org/", "http://pycon.blogspot.com/", "https://www.python.org/about/success/#scientific", "https://www.python.org/doc/essays/", "http://www.djangoproject.com/", "https://www.python.org/success-stories/industrial-light-magic-runs-python/", "http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html", "http://wiki.python.org/moin/TkInter", "https://www.python.org/jobs/", "https://www.python.org/events/python-events/604/"}

Element使用CSS选择器选择一个(了解更多信息):

>>> about = r.html.find("#about", first=True)

抓取Element的文本内容:

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure

内省an Element的属性(了解更多信息):

>>> about.attrs
{"id": "about", "class": ("tier-1", "element-1"), "aria-haspopup": "true"}

呈现Element的HTML:

>>> about.html
"<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>"

在以下Element列表中选择一个列表Element

>>> about.find("a")
[<Element "a" href="/about/" title="" class="">, <Element "a" href="/about/apps/" title="">, <Element "a" href="/about/quotes/" title="">, <Element "a" href="/about/gettingstarted/" title="">, <Element "a" href="/about/help/" title="">, <Element "a" href="http://brochure.getpython.info/" title="">]

搜索元素内的链接:

>>> about.absolute_links
{"http://brochure.getpython.info/", "https://www.python.org/about/gettingstarted/", "https://www.python.org/about/", "https://www.python.org/about/quotes/", "https://www.python.org/about/help/", "https://www.python.org/about/apps/"}

在页面上搜索文本:

>>> r.html.search("Python is a {} language")[0]
programming

更复杂的CSS选择器示例(从Chrome开发工具复制):

>>> r = session.get("https://github.com/")
>>> sel = "body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p"

>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

XPath也受支持(了解更多信息.aspx)):

>>> r.html.xpath("a")
[<Element "a" class="btn" href="https://help.github.com/articles/supported-browsers">]

您还可以仅选择包含某些文本的元素:

>>> r = session.get("http://python-requests.org/")
>>> r.html.find("a", containing="kenneth")
[<Element "a" href="http://kennethreitz.com/pages/open-projects.html">, <Element "a" href="http://kennethreitz.org/">, <Element "a" href="https://twitter.com/kennethreitz" class=("twitter-follow-button",) data-show-count="false">, <Element "a" class=("reference", "internal") href="dev/contributing/#kenneth-reitz-s-code-style">]

支持JavaScript

让我们获取一些JavaScript渲染的文本:

>>> r = session.get("http://python-requests.org/")

>>> r.html.render()

>>> r.html.search("Python 2 will retire in only {months} months!")["months"]
"<time>25</time>"

请注意,您第一次运行该render()方法时,会将Chromium下载到您的主目录(例如~/.pyppeteer/)中。这只会发生一次。

分页

还提供智能分页支持(始终在改进):

>>> r = session.get("https://reddit.com")
>>> for html in r.html:
...     print(html)
<HTML url="https://www.reddit.com/">
<HTML url="https://www.reddit.com/?count=25&amp;after=t3_81puu5">
<HTML url="https://www.reddit.com/?count=50&amp;after=t3_81nevg">
<HTML url="https://www.reddit.com/?count=75&amp;after=t3_81lqtp">
<HTML url="https://www.reddit.com/?count=100&amp;after=t3_81k1c8">
<HTML url="https://www.reddit.com/?count=125&amp;after=t3_81p438">
<HTML url="https://www.reddit.com/?count=150&amp;after=t3_81nrcd">
…

您也可以轻松地请求下一个URL:python

>>> r = session.get("https://reddit.com")
>>> r.html.next()
"https://www.reddit.com/?count=25&amp;after=t3_81pm82"

在没有请求的情况下使用

您也可以在没有请求的情况下使用此库:

>>> from requests_html import HTML
>>> doc = """<a href="https://httpbin.org">"""

>>> html = HTML(html=doc)
>>> html.links
{"https://httpbin.org"}

您也可以在没有请求的情况下呈现JavaScript页面:

# ^^ proceeding from above ^^
>>> script = """
        () => {
            return {
                width: document.documentElement.clientWidth,
                height: document.documentElement.clientHeight,
                deviceScaleFactor: window.devicePixelRatio,
            }
        }
    """
>>> val = html.render(script=script, reload=False)

>>> print(val)
{"width": 800, "height": 600, "deviceScaleFactor": 1}

>>> print(html.html)
<html><head></head><body><a href="https://httpbin.org"></a></body></html>

以上内容来自于requests_html作者官网,简单翻译过来就是很好的使用教程呢!

更详细的介绍可以参考原作者的网站或者GitHub

由于requests_html和之前介绍的requests库是同一作者,使用很多原来在requests上的一些操作在这个新库也是可以使用的。如:自定义header

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"
r=session.get("http://httpbin.org/get",headers={"user-agent":ua})

print(r.text)
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"
  }, 
  "origin": 
  "url": "https://httpbin.org/get"
}

requests_html对我目前来说已经是一个非常优秀的库了,因为与其他的爬虫操作不同,网页获取,解析器,和Js渲染用在一个库中解决了,使用者一个库就可以写出一个比较完整的爬虫Dome。如:

#!/usr/bin/env python
# @coding: UTF-8
# @Time : 2019/9/28 17:11
# @Author : lengqie
# @File : MySpider1.1.py
# @Explain : 对原版本的1.0的优化版本

import re,pymysql,time
from requests_html import HTMLSession

session=HTMLSession()

# 获取文章链接
def GetPostLinks(html):
    try:
        # 获取全部链接
        All_links=html.html.links
        Post_links=set()
        # 筛选出文章链接
        pattern = r"(https?|ftp|file)(:/)?/[-A-Za-z0-9+&amp;@#/%?=~_|!:,.;]+[-A-Za-z0-9+&amp;@#/%=~_|]?p=\d{1,5}$"
        for link in All_links:
            if re.match(pattern,link):
                Post_links.add(link)
    except Exception as e:
        print("连接获取出错!\t"+str(e))
        exit()
    else:
        # 返回一个集合
        print("链接获取成功!\n")
        return Post_links


# 获取文章信息
def GetPost(url):
    try:
        # 提取文章id,使用正则表达式提取出含有ID的一个元组中,再取出元组中的数字与默认结果一起组成css选择器
        id=str(re.findall(r"\d+(?:\.\d+)?", url)[0])

        # 提取文章标题
        title_selector = "#post-"+id+"> header"
        title_session=Response.html.find(title_selector,first=True)
        title=title_session.text

        # 提取文章时间
        time_selector = "#post-"+id+ "> footer > span.posted-on > a > time.entry-date.published"
        time_session=Response.html.find(time_selector,first=True)
        time=time_session.text

        # 提取所在分类
        tag_selector ="#post-"+id+" > footer > span.cat-links > a"
        tag_session=Response.html.find(tag_selector,first=True)
        tag = tag_session.text

        # 提取文章内容
        content_selector="#post-"+id+ "> div"
        content_session = Response.html.find(content_selector,first=True)
        content=content_session.text

    except Exception as e:
        print("文章提取出错!\t"+str(e))
        exit()

    else:
        # 会返回一个元组
        print("ID={}".format(id))
        print("文章获取成功!")
        return title,time,tag,content


# 创建数据库
def CreateDatabase():
    # 连接数据库
    try:
        db = pymysql.connect("127.0.0.1", "root", "", "test")
        cursor = db.cursor()
        # 表存在需要删除
        cursor.execute("DROP TABLE IF EXISTS POST")
    except Exception as e:
        print("数据库连接失败!\t"+str(e))
        exit()
    else:
        print("数据库连接成功!\n")
    try:
        sql = """CREATE TABLE POST (
                title  CHAR(255),
                time  CHAR(255),
                tag CHAR(255),
                content TEXT )"""
        cursor.execute(sql)
    except Exception as e:
        print("表创建错误!\t"+str(e))
        exit()
    else:
        print("新建表成功!\n")
    db.close()


# 写入数据库
def Writer2Mysql(tuple):
    # 连接数据库 若创建数据表成功,此处则不需要异常处理!
    db = pymysql.connect("127.0.0.1", "root", "", "test")
    cursor = db.cursor()

    sql = "INSERT INTO POST(title,time,tag,content) VALUES (%s, %s, %s,%s)"
    try:
        cursor.execute(sql, (tuple[0],tuple[1],tuple[2],tuple[3]))
        db.commit()
        print("插入数据成功!\n")
    except Exception as e:
        # 数据回滚
        db.rollback()
        print("插入数据失败!\t"+str(e))
        exit()
    db.close()


if __name__ == "__main__":
    # 程序计时开始
    start=time.perf_counter()

    url="http://jianghaodong.com"
    # 浏览器标识
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
    Response = session.get(url,headers={"user-agent":ua})
    # javascript渲染
    #Response.html.render()
    # 休眠时间 默认 0
    SleepTime = 0

    # 获取链接
    links=GetPostLinks(Response)
    # 创建数据库
    CreateDatabase()
    for link in links:
        ReturnTuple = GetPost(link)
        time.sleep(SleepTime)
        # 写入到表
        Writer2Mysql(ReturnTuple)

    print("爬取完毕!")
    # 程序计时结束
    end=time.perf_counter()
    print("耗时{:>.2}s".format(end-start))
文章目录