MechanicalSoup：自动与网站交互的轻量级 Python 库。类似于不愿意执行JS的 Selenium

用于自动与网站交互的 Python 库。 MechanicalSoup 自动存储和发送 cookie，跟踪重定向，并可以跟踪链接和提交表单。但它不执行 JavaScript。

https://github.com/MechanicalSoup/MechanicalSoup

我们写爬虫一般是请求+解析两步走，该项目将 Requests(请求) 和 BeautifulSoup(解析) 两大 Python 爬虫常用库，封装成一个浏览器对象(StatefulBrowser)，将上面说的两步并成一步。后面仅需一个浏览器对象，就可以完成请求页面、过滤内容、提交表单、跳转地址等操作，使得代码更加简单、操作更加方便。又因为它不依赖浏览器进程，所以相较于 Selenium 它更加轻巧，但缺点是不支持 JS 动态渲染的页面。

"""Example usage of MechanicalSoup to get the results from the Qwant
search engine.
"""

import re
import mechanicalsoup
import html
import urllib.parse

# Connect to Qwant
browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup')
browser.open("https://lite.qwant.com/")

# Fill-in the search form
browser.select_form('#search-form')
browser["q"] = "MechanicalSoup"
browser.submit_selected()

# Display the results
for link in browser.page.select('.result a'):
    # Qwant shows redirection links, not the actual URL, so extract
    # the actual URL from the redirect link:
    href = link.attrs['href']
    m = re.match(r"^/redirect/[^/]*/(.*)$", href)
    if m:
        href = urllib.parse.unquote(m.group(1))
    print(link.text, '->', href)

MechanicalSoup：自动与网站交互的轻量级 Python 库。类似于不愿意执行JS的 Selenium

全部评论: 0 条

相关推荐