Page.navigate Chrome devtools API command stalling

This issue has been tracked since 2022-09-13.

To sum up:
We try to use https://github.com/internetarchive/brozzler with browserless/chrome and have trouble invoking Page.navigate Chrome devtools protocol API function. We send various commands prior to that and they work as expected but this specific command doesn't return anything and we get a websocket timeout after 30 sec.

Details:
We use this common send_to_chrome function to send various API commands that are listed below:
https://github.com/internetarchive/brozzler/blob/master/brozzler/browser.py#L328

[2022-09-13 09:43:59,849: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Network.enable","id":0}
[2022-09-13 09:43:59,850: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Page.enable","id":1}
[2022-09-13 09:43:59,850: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Console.enable","id":2}
[2022-09-13 09:43:59,850: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Runtime.enable","id":3}
[2022-09-13 09:43:59,851: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"ServiceWorker.enable","id":4}
[2022-09-13 09:43:59,851: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"ServiceWorker.setForceUpdateOnPageLoad","id":5}
[2022-09-13 09:43:59,851: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Network.setBlockedURLs","params":{"urls":["*google-analytics.com/analytics.js*","*google-analytics.com/ga.js*","*google-analytics.com/ga_exp.js*","*google-analytics.com/urchin.js*","*google-analytics.com/collect*","*google-analytics.com/r/collect*","*google-analytics.com/__utm.gif*","*google-analytics.com/gtm/js?*","*google-analytics.com/cx/api.js*","*cdn.ampproject.org/*/amp-analytics*.js"]},"id":6}

We even try Browser.getVersion and get a valid response:

[2022-09-13 09:44:00,355: WARNING/ForkPoolWorker-1] {'id': 8, 'result': {'protocolVersion': '1.3', 'product': 'Chrome/89.0.4389.114', 'revision': '@1ea76e193b4fadb723bfea2a19a66c93a1bc0ca6', 'userAgent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36', 'jsVersion': '8.9.255.24'}}

But when we try Page.navigate, the websocket connection hangs and times out after 30 sec.

[2022-09-13 09:44:00,356: INFO/ForkPoolWorker-1] navigating to page https://iskme.org/
[2022-09-13 09:44:00,356: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7fc13f1512e0>: {"method":"Page.navigate","params":{"url":"https://iskme.org/"},"id":9}

We also tried navigating to about:blank with the same results. This should return instantly.

Some extra details in case it helps.
We tried connecting with and with a proxy without any difference

ws://vbanos-dev.us.archive.org:3000/?--proxy-server=http://207.241.233.197:8000
ws://vbanos-dev.us.archive.org:3000/

We run browserless/chrome like this:

sudo docker run \
            -e "MAX_CONCURRENT_SESSIONS=1" \
            -e "EXIT_ON_HEALTH_FAILURE=true" \
            -e 'DISABLED_FEATURES=["pdfEndpoint", "contentEndpoint"]' \
            -e "DEFAULT_IGNORE_HTTPS_ERRORS=true" \
            -e "DEFAULT_HEADLESS=false" \
            -e "DEFAULT_BLOCK_ADS=true" \
            -e "PREBOOT_CHROME=true" \
            -e "KEEP_ALIVE=true" \
            -p 3000:3000 --restart always \
            -d browserless/chrome:1.45-chrome-stable

We picked that version because it has Chrome 89 which we already use. We also tried with image browserless/chrome but we got the same results.

Any ideas would be welcome.
Thank you.

joelgriffith wrote this answer on 2022-10-02

Possibly related to the things that puppeteer sends. I'll debug puppeteer and see what messages they send. Is there a branch you've got your changes on?

@vbanos do you happen to have some docker logs from browserless? You can do DEBUG=* in your env variables to print more logs.

vbanos wrote this answer on 2022-10-02

I'm sorry but I don't have a branch with these changes.
We have another internal software package that imports Brozzler as a library and I did these experiments there.
You just need to make a websocket connection to the browser and send the following msg:
{"method":"Page.navigate","params":{"url":"https://iskme.org/"},"id":9}

If this is difficult for you, I will make a branch.

vbanos wrote this answer on 2022-10-02

@joelgriffith I discovered that it works as expected with -e "DEFAULT_HEADLESS=true" \ (All earlier attempts were using false).

Are there any special requirement when using DEFAULT_HEADLESS=false ? E.g. should I have a DISPLAY set?

For example, in other cases, we set export DISPLAY=:1; and we run Xvnc4 :1 to create a virtual display that is used by the headful browser.

From our experience headful browser is more stable and faster than headless. We prefer that but its not set in stone, we could change it.

joelgriffith wrote this answer on 2022-10-02

@vbanos no need to do anything special for headful, it's supported as both a docker parameter (which in this case will default to whatever value you set it to), or as a "just in time" parameter when you connect to the endpoint: https://www.browserless.io/docs/chrome-flags#setting-headless

We do use xvfb internally for headful sessions, and you can see how we launch it here: https://github.com/browserless/chrome/blob/master/start.sh#L16

EDIT: If you happen to have logs for when headless is defaulted to false that'd help!

vbanos wrote this answer on 2022-10-02

@joelgriffith please help me with a newbie question about logging.
I use option -e "DEBUG=-*" and I login to the running container using bash.
Where can I see the logs? I can't find them and I can't find any relevant info in the docs. Thank you.

joelgriffith wrote this answer on 2022-10-02

np -- I assume you're running it via docker, if so it'd be like:

docker run \
  --rm \
  -p 3000:3000 \
  -e "DEBUG=*" \
  browserless/chrome:latest

The debug env variable is a nodeJS convention for essentially telling all the libraries and pacakges to print to console. Depending on how you've got docker set up this will print it straight into your terminal session or retain them via a second docker command if docker is daemonized.

joelgriffith wrote this answer on 2022-10-02

@vbanos just wanted to check in -- were you able to get this working with the above docker command?

vbanos wrote this answer on 2022-10-02

Not tried it yet due to other tasks, I'll get back to you soon. Thank you for asking.

vbanos wrote this answer on 2022-10-02

@joelgriffith I focused on this task again and have the following question.

When I send a Page.navigate cmd to browserless/chrome, I see the following logs and then it stalls:

2022-09-30T11:19:06.446Z browserless:job W0PFDE5BYAKCVLLDL4A7JGNWTP2GW2Q8: Starting session.
2022-09-30T11:19:06.446Z browserless:job W0PFDE5BYAKCVLLDL4A7JGNWTP2GW2Q8: Proxying request to /devtools/browser route: ws://127.0.0.1:40849/devtools/browser/863aa496-fc09-48c3-9f1a-e38392626ff6.
2022-09-30T11:19:06.448Z puppeteer:protocol:RECV ◀ {"id":17,"result":{},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
2022-09-30T11:19:06.448Z puppeteer:protocol:RECV ◀ {"id":16,"result":{},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
2022-09-30T11:19:06.449Z puppeteer:protocol:RECV ◀ {"method":"Target.targetCreated","params":{"targetInfo":{"targetId":"4a4486b4-e461-44f2-8d5d-b3b7f5cca83e","type":"browser","title":"","url":"","attached":false,"canAccessOpener":false}}}
2022-09-30T11:19:06.449Z puppeteer:protocol:RECV ◀ {"method":"Target.targetInfoChanged","params":{"targetInfo":{"targetId":"4a4486b4-e461-44f2-8d5d-b3b7f5cca83e","type":"browser","title":"","url":"","attached":true,"canAccessOpener":false}}}
2022-09-30T11:19:07.430Z puppeteer:protocol:RECV ◀ {"method":"Page.lifecycleEvent","params":{"frameId":"CEC97C9476943519C0621772B7BC30E6","loaderId":"C944D80AB2F22440F12A067FA9DCE8D5","name":"networkAlmostIdle","timestamp":22341248.432368},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}
2022-09-30T11:19:07.431Z puppeteer:protocol:RECV ◀ {"method":"Page.lifecycleEvent","params":{"frameId":"CEC97C9476943519C0621772B7BC30E6","loaderId":"C944D80AB2F22440F12A067FA9DCE8D5","name":"networkIdle","timestamp":22341248.432368},"sessionId":"EBEF83A0F0E20C99A1FF898F45969A5B"}

From my understanding, my Brozzler client code is waiting for a chrome API response with method=Page.loadEventFired but it doesn't get it and this is why it stalls.
https://github.com/internetarchive/brozzler/blob/master/brozzler/browser.py#L239

Any ideas would be welcome. Thanks.

vbanos wrote this answer on 2022-10-02

In case it helps, the Brozzler logs when we try to navigate to a page are the following:

[2022-09-30 11:54:39,991: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Network.enable","id":0}
[2022-09-30 11:54:39,991: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Page.enable","id":1}
[2022-09-30 11:54:39,992: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Console.enable","id":2}
[2022-09-30 11:54:39,992: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Runtime.enable","id":3}
[2022-09-30 11:54:39,993: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"ServiceWorker.enable","id":4}
[2022-09-30 11:54:39,993: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"ServiceWorker.setForceUpdateOnPageLoad","id":5}
[2022-09-30 11:54:40,498: INFO/ForkPoolWorker-1] navigating to page https://cnn.com/
[2022-09-30 11:54:40,499: DEBUG/ForkPoolWorker-1] sending message to <websocket._app.WebSocketApp object at 0x7f9531bcea60>: {"method":"Page.navigate","params":{"url":"https://cnn.com/"},"id":8}
joelgriffith wrote this answer on 2022-10-02

Right off the bat it is interesting that the ID's aren't numeric. We should be seeing 6, and 7 someplace...

joelgriffith wrote this answer on 2022-10-05

@vbanos is there a way I can easily replicate this in the brozzler repo? I think there's something missing either in the launch flags or in the CDP domains that will require me to tinker with it a bit. Let me know if a call is easier.

More Details About Repo
Owner Name browserless
Repo Name chrome
Full Name browserless/chrome
Language TypeScript
Created Date 2017-11-17
Updated Date 2023-03-22
Star Count 5309
Watcher Count 47
Fork Count 516
Issue Count 29

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date