I discovered a minor problem with the parser, which handles line-breaks differently than a real browser DOM.
A quick example to demonstrate the browser DOM behavior:
As you can see, browsers convert crlf
into lf
when parsing text. (I used a test-script to confirm that this is the behavior on both latest and very old versions of Chrome, Edge, Firefox, Opera, Safari and IE.)
To demonstrate the behavior in htmlparser2
:
import { parseDocument } from "htmlparser2";
const doc = parseDocument(
"<html>" +
"<head>" +
"<script>" +
"function hello(){}\r\n" +
"hello();" +
"</head>" +
"</html>"
);
console.log(JSON.stringify(doc.children[0].children[0].children[0].children[0].data));
// ^ html ^ head ^ script ^ text node
The output is:
"function hello(){}\r\nhello();</head></html>"
^^^^
This is of course easy to work around with e.g. string.replace(/\r\n/g, "\n")
on the document before parsing.
But it isn't strictly compatible with standard HTML/DOM behavior and cost me a lot of debugging hours - for example, if somebody was to checksum the content of text-nodes, or working with any sort of browser automation scripts with test-expectations based on the contents of text-nodes, this is going to cause problems.
I figured this was worth reporting to save somebody else from the same headache.
Thanks for the report! This is unfortunately one of the scenarios where htmparser2 prioritises performance over correctness — and in most cases this is a valid choice.
As always, parse5
is the better choice when trying to parse markup exactly the way a browser would. So if you actually want to checksum parsed data, have a look at that!
Unfortunately, my workaround also breaks the start/end indices, which won't be accurate if I change the input before parsing.
For what it's worth, the DOM itself gets around the performance problem by offering the normalize method, which lets you do this kind of normalization on-demand, avoiding the trade-off between performance and correctness.
Owner Name | fb55 |
Repo Name | htmlparser2 |
Full Name | fb55/htmlparser2 |
Language | TypeScript |
Created Date | 2011-08-27 |
Updated Date | 2023-03-19 |
Star Count | 3793 |
Watcher Count | 50 |
Fork Count | 370 |
Issue Count | 4 |
Issue Title | Created Date | Updated Date |
---|