Non-standard handling of `crlf` line breaks

This issue has been tracked since 2021-10-28.

I discovered a minor problem with the parser, which handles line-breaks differently than a real browser DOM.

A quick example to demonstrate the browser DOM behavior:

image

As you can see, browsers convert crlf into lf when parsing text. (I used a test-script to confirm that this is the behavior on both latest and very old versions of Chrome, Edge, Firefox, Opera, Safari and IE.)

To demonstrate the behavior in htmlparser2:

import { parseDocument } from "htmlparser2";

const doc = parseDocument(
  "<html>" +
  "<head>" +
  "<script>" +
  "function hello(){}\r\n" +
  "hello();" +
  "</head>" +
  "</html>"
);

console.log(JSON.stringify(doc.children[0].children[0].children[0].children[0].data));
//                             ^ html      ^ head      ^ script    ^ text node

The output is:

"function hello(){}\r\nhello();</head></html>"
                   ^^^^

This is of course easy to work around with e.g. string.replace(/\r\n/g, "\n") on the document before parsing.

But it isn't strictly compatible with standard HTML/DOM behavior and cost me a lot of debugging hours - for example, if somebody was to checksum the content of text-nodes, or working with any sort of browser automation scripts with test-expectations based on the contents of text-nodes, this is going to cause problems.

I figured this was worth reporting to save somebody else from the same headache.

fb55 wrote this answer on 2021-10-28

Thanks for the report! This is unfortunately one of the scenarios where htmparser2 prioritises performance over correctness — and in most cases this is a valid choice.

As always, parse5 is the better choice when trying to parse markup exactly the way a browser would. So if you actually want to checksum parsed data, have a look at that!

mindplay-dk wrote this answer on 2021-10-29

parse5 doesn't support XML (XHTML5) so that's not an option. But I will stick with my workaround then. 🙂

mindplay-dk wrote this answer on 2021-12-06

Unfortunately, my workaround also breaks the start/end indices, which won't be accurate if I change the input before parsing.

For what it's worth, the DOM itself gets around the performance problem by offering the normalize method, which lets you do this kind of normalization on-demand, avoiding the trade-off between performance and correctness.

More Details About Repo
Owner Name fb55
Repo Name htmlparser2
Full Name fb55/htmlparser2
Language TypeScript
Created Date 2011-08-27
Updated Date 2023-03-19
Star Count 3793
Watcher Count 50
Fork Count 370
Issue Count 4

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date