To sum up:
We try to use https://github.com/internetarchive/brozzler with browserless/chrome
and have trouble invoking Page.navigate
Chrome devtools protocol API function. We send various commands prior to that and they work as expected but this specific command doesn't return anything and we get a websocket timeout after 30 sec.
Details:
We use this common send_to_chrome
function to send various API commands that are listed below:
https://github.com/internetarchive/brozzler/blob/master/brozzler/browser.py#L328
[2022-09-13 09:43:59,849: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Network.enable","id":0}
[2022-09-13 09:43:59,850: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Page.enable","id":1}
[2022-09-13 09:43:59,850: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Console.enable","id":2}
[2022-09-13 09:43:59,850: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Runtime.enable","id":3}
[2022-09-13 09:43:59,851: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"ServiceWorker.enable","id":4}
[2022-09-13 09:43:59,851: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"ServiceWorker.setForceUpdateOnPageLoad","id":5}
[2022-09-13 09:43:59,851: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Network.setBlockedURLs","params":{"urls":["*google-analytics.com/analytics.js*","*google-analytics.com/ga.js*","*google-analytics.com/ga_exp.js*","*google-analytics.com/urchin.js*","*google-analytics.com/collect*","*google-analytics.com/r/collect*","*google-analytics.com/__utm.gif*","*google-analytics.com/gtm/js?*","*google-analytics.com/cx/api.js*","*cdn.ampproject.org/*/amp-analytics*.js"]},"id":6}
We even try Browser.getVersion
and get a valid response:
[2022-09-13 09:44:00,355: WARNING/ForkPoolWorker-1] {'id': 8, 'result': {'protocolVersion': '1.3', 'product': 'Chrome/89.0.4389.114', 'revision': '@1ea76e193b4fadb723bfea2a19a66c93a1bc0ca6', 'userAgent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36', 'jsVersion': '8.9.255.24'}}
But when we try Page.navigate
, the websocket connection hangs and times out after 30 sec.
[2022-09-13 09:44:00,356: INFO/ForkPoolWorker-1] navigating to page https://iskme.org/
[2022-09-13 09:44:00,356: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Page.navigate","params":{"url":"https://iskme.org/"},"id":9}
We also tried navigating to about:blank
with the same results. This should return instantly.
Some extra details in case it helps.
We tried connecting with and with a proxy without any difference
ws://vbanos-dev.us.archive.org:3000/?--proxy-server=http://207.241.233.197:8000
ws://vbanos-dev.us.archive.org:3000/
We run browserless/chrome like this:
sudo docker run \
-e "MAX_CONCURRENT_SESSIONS=1" \
-e "EXIT_ON_HEALTH_FAILURE=true" \
-e 'DISABLED_FEATURES=["pdfEndpoint", "contentEndpoint"]' \
-e "DEFAULT_IGNORE_HTTPS_ERRORS=true" \
-e "DEFAULT_HEADLESS=false" \
-e "DEFAULT_BLOCK_ADS=true" \
-e "PREBOOT_CHROME=true" \
-e "KEEP_ALIVE=true" \
-p 3000:3000 --restart always \
-d browserless/chrome:1.45-chrome-stable
We picked that version because it has Chrome 89 which we already use. We also tried with image browserless/chrome
but we got the same results.
Any ideas would be welcome.
Thank you.
Possibly related to the things that puppeteer sends. I'll debug puppeteer and see what messages they send. Is there a branch you've got your changes on?
@vbanos do you happen to have some docker logs from browserless? You can do DEBUG=*
in your env variables to print more logs.
I'm sorry but I don't have a branch with these changes.
We have another internal software package that imports Brozzler as a library and I did these experiments there.
You just need to make a websocket connection to the browser and send the following msg:
{"method":"Page.navigate","params":{"url":"https://iskme.org/"},"id":9}
If this is difficult for you, I will make a branch.
@joelgriffith I discovered that it works as expected with -e "DEFAULT_HEADLESS=true" \
(All earlier attempts were using false
).
Are there any special requirement when using DEFAULT_HEADLESS=false
? E.g. should I have a DISPLAY
set?
For example, in other cases, we set export DISPLAY=:1;
and we run Xvnc4 :1
to create a virtual display that is used by the headful browser.
From our experience headful browser is more stable and faster than headless. We prefer that but its not set in stone, we could change it.
@vbanos no need to do anything special for headful, it's supported as both a docker parameter (which in this case will default to whatever value you set it to), or as a "just in time" parameter when you connect to the endpoint: https://www.browserless.io/docs/chrome-flags#setting-headless
We do use xvfb
internally for headful sessions, and you can see how we launch it here: https://github.com/browserless/chrome/blob/master/start.sh#L16
EDIT: If you happen to have logs for when headless is defaulted to false
that'd help!
@joelgriffith please help me with a newbie question about logging.
I use option -e "DEBUG=-*"
and I login to the running container using bash
.
Where can I see the logs? I can't find them and I can't find any relevant info in the docs. Thank you.
np -- I assume you're running it via docker, if so it'd be like:
docker run \
--rm \
-p 3000:3000 \
-e "DEBUG=*" \
browserless/chrome:latest
The debug
env variable is a nodeJS convention for essentially telling all the libraries and pacakges to print to console. Depending on how you've got docker set up this will print it straight into your terminal session or retain them via a second docker command if docker is daemonized.
@vbanos just wanted to check in -- were you able to get this working with the above docker command?
@joelgriffith I focused on this task again and have the following question.
When I send a Page.navigate
cmd to browserless/chrome, I see the following logs and then it stalls:
2022-09-30T11:19:06.446Z browserless:job W0PFDE5BYAKCVLLDL4A7JGNWTP2GW2Q8: Starting session.
2022-09-30T11:19:06.446Z browserless:job W0PFDE5BYAKCVLLDL4A7JGNWTP2GW2Q8: Proxying request to /devtools/browser route: ws://127.0.0.1:40849/devtools/browser/863aa496-fc09-48c3-9f1a-e38392626ff6.
2022-09-30T11:19:06.448Z puppeteer:protocol:RECV ◀ {"id":17,"result":{},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
2022-09-30T11:19:06.448Z puppeteer:protocol:RECV ◀ {"id":16,"result":{},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
2022-09-30T11:19:06.449Z puppeteer:protocol:RECV ◀ {"method":"Target.targetCreated","params":{"targetInfo":{"targetId":"4a4486b4-e461-44f2-8d5d-b3b7f5cca83e","type":"browser","title":"","url":"","attached":false,"canAccessOpener":false}}}
2022-09-30T11:19:06.449Z puppeteer:protocol:RECV ◀ {"method":"Target.targetInfoChanged","params":{"targetInfo":{"targetId":"4a4486b4-e461-44f2-8d5d-b3b7f5cca83e","type":"browser","title":"","url":"","attached":true,"canAccessOpener":false}}}
2022-09-30T11:19:07.430Z puppeteer:protocol:RECV ◀ {"method":"Page.lifecycleEvent","params":{"frameId":"CEC97C9476943519C0621772B7BC30E6","loaderId":"C944D80AB2F22440F12A067FA9DCE8D5","name":"networkAlmostIdle","timestamp":22341248.432368},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
2022-09-30T11:19:07.431Z puppeteer:protocol:RECV ◀ {"method":"Page.lifecycleEvent","params":{"frameId":"CEC97C9476943519C0621772B7BC30E6","loaderId":"C944D80AB2F22440F12A067FA9DCE8D5","name":"networkIdle","timestamp":22341248.432368},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
From my understanding, my Brozzler client code is waiting for a chrome API response with method=Page.loadEventFired
but it doesn't get it and this is why it stalls.
https://github.com/internetarchive/brozzler/blob/master/brozzler/browser.py#L239
Any ideas would be welcome. Thanks.
In case it helps, the Brozzler
logs when we try to navigate to a page are the following:
[2022-09-30 11:54:39,991: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Network.enable","id":0}
[2022-09-30 11:54:39,991: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Page.enable","id":1}
[2022-09-30 11:54:39,992: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Console.enable","id":2}
[2022-09-30 11:54:39,992: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Runtime.enable","id":3}
[2022-09-30 11:54:39,993: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"ServiceWorker.enable","id":4}
[2022-09-30 11:54:39,993: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"ServiceWorker.setForceUpdateOnPageLoad","id":5}
[2022-09-30 11:54:40,498: INFO/ForkPoolWorker-1] navigating to page https://cnn.com/
[2022-09-30 11:54:40,499: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Page.navigate","params":{"url":"https://cnn.com/"},"id":8}
@vbanos is there a way I can easily replicate this in the brozzler repo? I think there's something missing either in the launch flags or in the CDP domains that will require me to tinker with it a bit. Let me know if a call is easier.
Owner Name | browserless |
Repo Name | chrome |
Full Name | browserless/chrome |
Language | TypeScript |
Created Date | 2017-11-17 |
Updated Date | 2023-03-22 |
Star Count | 5309 |
Watcher Count | 47 |
Fork Count | 516 |
Issue Count | 29 |
Issue Title | Created Date | Updated Date |
---|