现如今大多数页面,通过html5/js等方式,动态渲染页面,对于抓取动态网页,用常规的抓取方法显得力不从心。 前些年出现了phantomjs,可以有效的抓取动态页面,但phantomjs的一些缺点,内存溢出等经常出现卡死。现在该作者也停止更新phantomjs了

Now,决定弃用phantomjs!

发现新大陆

chrome自从v59版本后,推出了headless浏览器,配合Chrome DevTools Protocol,使用浏览器内核其Api,可实现分布远程调试chrome(数据抓取等)

Chrome DevTools Protocol允许工具对Chromium,Chrome和其他基于Blink的浏览器进行测试,检查,调试和配置。 许多现有项目目前使用该协议。 Chrome DevTools开发人员工具,使用此协议,团队维护其API。

Server端,在装有chrome浏览器环境的服务器中,打开chrome remote debug 以下命令在docker环境下,alpine,chrome环境中,更多chrome启动参数,参考https://peter.sh/experiments/chromium-command-line-switches/

chromium-browser --headless --no-sandbox --disable-gpu --remote-debugging-port=9222

chrome --headless --no-sandbox --disable-gpu --remote-debugging-port=9222 --remote-debugging-address=0.0.0.0 --window-size=1920,1080 --user-data-dir=

注意,这里使用的remote-debugging-port是9444,是在初始化启动命令中设置折。可以通过浏览器打开查看远程服务器中的chrome信息

http://192.168.110.128:9444/json

[ { "description": "", "devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(9E4790959AAB0C8FB8F309ABB204729C)", "id": "(9E4790959AAB0C8FB8F309ABB204729C)", "title": "百度一下,你就知道", "type": "page", "url": "https://www.baidu.com/", "webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(9E4790959AAB0C8FB8F309ABB204729C)" }, { "description": "", "devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(C8A6E4D304F820AC9F48AC9A34137F78)", "id": "(C8A6E4D304F820AC9F48AC9A34137F78)", "title": "百度一下,你就知道", "type": "page", "url": "https://www.baidu.com/", "webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(C8A6E4D304F820AC9F48AC9A34137F78)" }, { "description": "", "devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(E18749BAD4802F598A844A7EE14BA9C4)", "id": "(E18749BAD4802F598A844A7EE14BA9C4)", "title": "about:blank", "type": "page", "url": "about:blank", "webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(E18749BAD4802F598A844A7EE14BA9C4)" }, { "description": "", "devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(2C5CCAACD2BFBA9E39D73EBAB2291C87)", "id": "(2C5CCAACD2BFBA9E39D73EBAB2291C87)", "title": "", "type": "page", "url": "file:///", "webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(2C5CCAACD2BFBA9E39D73EBAB2291C87)" } ] 新建一个标签 http://localhost:9222/json/new http://localhost:9222/json/new?http://www.baidu.com 关闭一个标签 http://localhost:9222/json/close/477810FF-323E-44C5-997C-89B7FAC7B158 激活标签页 http://localhost:9222/json/activate/477810FF-323E-44C5-997C-89B7FAC7B158 查看版本信息 http://localhost:9222/json/version client端,通过websocket协议,连接至chrome remote port ws://192.168.110.128:9444/devtools/page/(9E4790959AAB0C8FB8F309ABB204729C) 执行以下api接口中的命令

#打开页面 {"id":200,"method":"Page.navigate","params":{"url":"https://www.baidu.com"}} #获取dom {"id":200,"method":"DOM.getDocument"} #获取html {"id":200,"method":"DOM.getOuterHTML","params":{"nodeId":1,"backendNodeId":12}} #获取资源树 {"id":200,"method":"Page.getResourceTree","params":{}} 通过Api接口(Runtime.evaluate)执行js,类似于chrome中的onsole输出 {"id":200,"method":"Runtime.evaluate","params":{"expression":"document.title","objectGroup":"console","includeCommandLineAPI":true,"silent":false,"contextId":1,"returnByValue":false,"generatePreview":true,"userGesture":true,"awaitPromise":false}}

{"id":200,"method":"Runtime.evaluate","params":{"expression":"document.title","objectGroup":"console","includeCommandLineAPI":true,"silent":false,"returnByValue":false,"generatePreview":true,"userGesture":true,"awaitPromise":false}}

返回结果 { "id": 200, "result": { "result": { "type": "string", "value": "百度一下,你就知道" } } }

Api功能模块域 https://chromedevtools.github.io/debugger-protocol-viewer/1-2/ 扩展API 有很多扩展应用使用了该协议来与页面做交互调试,官网上有很多Sample Extensions

https://developer.chrome.com/extensions/samples#search:debugger Chrome Api https://chromedevtools.github.io/devtools-protocol/

API–模拟键盘输入 https://chromedevtools.github.io/devtools-protocol/tot/Input/

chrome启动参数 https://peter.sh/experiments/chromium-command-line-switches/

一些有意思的工具 https://developer.chrome.com/devtools/docs/debugging-clients

后话 很多工具都使用了Chrome debugging protocol,包括phantomJS,Selenium的ChromeDriver,本质都是一样的实现,它就相当于Chrome内核提供的API让应用调用。

官网列出了很多有意思的工具:链接,因为API丰富,所以才有了这么多的chrome插件。

实现了Remote debugging protocol的node的库:

chrome-debug-protocol 使用了ES6和TypeScript https://github.com/DickvdBrink/chrome-debug-protocol chrome-remote-interface 官网推荐的 https://github.com/cyrus-and/chrome-remote-interface chrome-har-capturer 传入url,直接获取har format文件 https://github.com/cyrus-and/chrome-har-capturer

什么是WebDriver WebDriver是一个开源工具,用于在许多浏览器上自动测试web应用。它提供了导航到网页,用户输入,JavaScript执行等功能。 WebDriver W3C标准 https://w3c.github.io/webdriver/webdriver-spec.html

什么是chromedriver ChromeDriver是一个独立的服务,它为Chromium实现WebDriver’s wire protocol 协议 chromedriver正在实施并转向W3C标准。ChromeDriver适用于Android版Chrome和桌面版Chrome(Mac,Linux,Windows和ChromeOS)。

chromedriver已经实现的w3c标准功能 https://chromium.googlesource.com/chromium/src/+/master/docs/chromedriver_status.md

chromedriver由chromium team维护

使用Selenium驱动chromedriver import time #导入webdriver from selenium import webdriver

#指定chromedriver的path位置 driver = webdriver.Chrome('/path/to/chromedriver') # Optional argument, if not specified will search path. driver.get('http://www.google.com/xhtml'); time.sleep(5) # Let the user actually see something! search_box = driver.find_element_by_name('q') search_box.send_keys('ChromeDriver') search_box.submit() time.sleep(5) # Let the user actually see something! driver.quit() 控制chromedriver的生命周期 Controlling ChromeDriver’s lifetime ChromeDriver类在创建时启动ChromeDriver服务器进程,并在调用退出时终止它。 这可能会浪费大量时间用于大型测试套件,其中每个测试都会创建一个ChromeDriver实例。

有两种方法可以解决这个问题:

Use the ChromeDriverService. This is available for most languages and allows you to start/stop the ChromeDriver server yourself. See here for a Java example (with JUnit 4): ``` @RunWith(BlockJUnit4ClassRunner.class) public class ChromeTest extends TestCase { private static ChromeDriverService service; private WebDriver driver;

@BeforeClass public static void createAndStartService() { service = new ChromeDriverService.Builder() .usingDriverExecutable(new File(“path/to/my/chromedriver”)) .usingAnyFreePort() .build(); service.start(); }

@AfterClass public static void createAndStopService() { service.stop(); }

@Before public void createDriver() { driver = new RemoteWebDriver(service.getUrl(), DesiredCapabilities.chrome()); }

@After public void quitDriver() { driver.quit(); }

@Test public void testGoogleSearch() { driver.get(“http://www.google.com”); // rest of the test… } }

python : import time

from selenium import webdriver import selenium.webdriver.chrome.service as service

service = service.Service(‘/path/to/chromedriver’) service.start() capabilities = {‘chrome.binary’: ‘/path/to/custom/chrome’} driver = webdriver.Remote(service.service_url, capabilities) driver.get(‘http://www.google.com/xhtml’); time.sleep(5) # Let the user actually see something! driver.quit()

  1. Start the ChromeDriver server separately before running your tests, and connect to it using the Remote WebDriver. Terminal: $ ./chromedriver Started ChromeDriver port=9515 version=14.0.836.0

java: WebDriver driver = new RemoteWebDriver(“http://127.0.0.1:9515”, DesiredCapabilities.chrome()); driver.get(“http://www.google.com”); ```

https://div.io/topic/1464 https://sites.google.com/a/chromium.org/chromedriver/ https://github.com/SeleniumHQ/selenium/wiki/JsonWireProtocol

https://github.com/seleniumhq/selenium https://sites.google.com/a/chromium.org/chromedriver/getting-started

https://github.com/SeleniumHQ/selenium/wiki/DesiredCapabilities.md https://sites.google.com/a/chromium.org/chromedriver/capabilities http://peter.sh/examples/?/chromium-switches.html

分类: 知识 标签: 暂无标签

目录