概要
wappalyzerという技術情報を調べることができるpuppeteerに依存したライブラリを使わざるをえない状況でpuppetterに入門したのでメモに残す。 環境は以下の通り。
docker info
Client: Docker Engine - Community
Version: 26.0.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.13.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.25.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
google-chrome --version Google Chrome 123.0.6312.105
試す
両方のパターン共通でGoogle Chromeのremote debuggingを活用する。
ホストでheadless chromeを動かしてDockerコンテナからchromeを参照するパターン
- スクリプトを走らせるコンテナを起動。
docker run -it --rm --add-host="host.docker.internal:host-gateway" -v .:/app node:20-alpine /bin/ash / # cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 172.17.0.1 host.docker.internal 172.17.0.2 75db39fe9484
- ホスト側でaddressを
host.docker.internalに合わせてchromeをheadlessで起動
google-chrome --remote-debugging-port=9222 -remote-debugging-address=172.17.0.1 --headless --user-data-dir=/tmp --disable-gpu --enable-logging
- ホスト側のchromeと疎通できるかを確認。host.docker.internalだと拒絶されるのでIPでリクエストを投げる。
/ # curl http://host.docker.internal:9222/json/version
Host header is specified and is not an IP address or localhost./ #
/ # curl http://172.17.0.1:9222/json/version
{
"Browser": "HeadlessChrome/123.0.6312.105",
"Protocol-Version": "1.3",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/123.0.6312.105 Safari/537.36",
"V8-Version": "12.3.219.16",
"WebKit-Version": "537.36 (@399174dbe6eff0f59de9a6096129c0c827002b3a)",
"webSocketDebuggerUrl": "ws://172.17.0.1:9222/devtools/browser/e4e73e63-bd32-4fd8-bb35-2957709be2eb"
}
- CHROMIUM_WEBSOCKETという環境変数を設定するとbrowserWSEndpoint(https://pptr.dev/api/puppeteer.connectoptions)をそれに設定するようなライブラリの仕様なので環境変数を渡して実行。確認できた。
/app # CHROMIUM_WEBSOCKET="ws://172.17.0.1:9222/devtools/browser/e4e73e63-bd32-4fd8-bb35-2957709be2eb" node index.js
{
"urls": {
"https://example.com/": {
"status": 200
}
},
"technologies": [
{
"slug": "azure",
"name": "Azure",
"description": "Azure is a cloud computing service for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.",
"confidence": 100,
"version": null,
"icon": "Azure.svg",
"website": "https://azure.microsoft.com",
"cpe": null,
"categories": [
{
"id": 62,
"slug": "paas",
"name": "PaaS"
}
]
},
{
"slug": "docker",
"name": "Docker",
略
headless chromeをDockerで動かして別のDockerコンテナからchromeを参照するパターン
- puppeteerのimageはchromeを含んでいるのでホストでchromeを動かしているときと大体同じ。
services: app: image: node:20-alpine volumes: - .:/app tty: true puppeteer: image: ghcr.io/puppeteer/puppeteer networks: - default cap_add: [SYS_ADMIN] command: > /bin/sh -c 'google-chrome --remote-debugging-port=9222 -remote-debugging-address=0.0.0.0 --headless --user-data-dir=/tmp --disable-gpu --enable-logging' extra_hosts: - "host.docker.internal:host-gateway"
- digしてからはやること同じ。疎通を確認できた。
docker compose exec app ash
/ # apk add curl bind-tools
OK: 22 MiB in 38 packages
/ # dig puppeteer
; <<>> DiG 9.18.24 <<>> puppeteer
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32473
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;puppeteer. IN A
;; ANSWER SECTION:
puppeteer. 600 IN A 172.30.0.2
;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Tue Apr 09 11:57:23 UTC 2024
;; MSG SIZE rcvd: 52
/ # curl http://172.30.0.2:9222/json/version
{
"Browser": "HeadlessChrome/123.0.6312.86",
"Protocol-Version": "1.3",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/123.0.6312.86 Safari/537.36",
"V8-Version": "12.3.219.14",
"WebKit-Version": "537.36 (@9b72c47a053648d405376c5cf07999ed626728da)",
"webSocketDebuggerUrl": "ws://172.30.0.2:9222/devtools/browser/3eefaec7-61da-4604-9202-fafb77a88dff"
}
- スクリプトを実行。確認できた。
/ # cd app
/app # ls
compose.yml index.js node_modules package-lock.json package.json
/app # CHROMIUM_WEBSOCKET="ws://172.30.0.2:9222/devtools/browser/3eefaec7-61da-4604-9202-fafb77a88dff" node index.js
{
"urls": {
"https://example.com/": {
"status": 200
}
},
"technologies": [
{
"slug": "azure",
"name": "Azure",
"description": "Azure is a cloud computing service for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.",
"confidence": 100,
"version": null,
"icon": "Azure.svg",
"website": "https://azure.microsoft.com",
"cpe": null,
"categories": [
{
"id": 62,