WordPress/wp-includes/html-api
dmsnell 5b3b3f7df2 HTML API: Add normalize() to give us the HTML we always wanted.
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets.

This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing.

Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not
able to provide the full safety that will eventually appear with `set_inner_html()`.

Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text).

Developed in https://github.com/wordpress/wordpress-develop/pull/7331
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell, westonruter.
Fixes #62036.

Built from https://develop.svn.wordpress.org/trunk@59076


git-svn-id: http://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd
2024-09-20 22:32:17 +00:00
..
class-wp-html-active-formatting-elements.php HTML API: Add missing tags in IN BODY insertion mode to HTML Processor. 2024-07-22 22:24:15 +00:00
class-wp-html-attribute-token.php HTML API: Track spans of text with (offset, length) instead of (start, end). 2023-12-10 13:19:28 +00:00
class-wp-html-decoder.php HTML API: Add missing @global tag on HTML Decoder. 2024-09-02 20:55:14 +00:00
class-wp-html-doctype-info.php HTML API: Parse DOCTYPE tokens and set HTML parser mode accordingly. 2024-08-23 14:55:15 +00:00
class-wp-html-open-elements.php HTML API: Only examine HTML nodes in pop_until() instack of open elements. 2024-09-04 19:25:14 +00:00
class-wp-html-processor-state.php HTML API: Respect document compat mode when handling CSS class names. 2024-09-04 04:34:15 +00:00
class-wp-html-processor.php HTML API: Add normalize() to give us the HTML we always wanted. 2024-09-20 22:32:17 +00:00
class-wp-html-span.php HTML API: Add PHP type annotations. 2024-07-19 23:44:16 +00:00
class-wp-html-stack-event.php HTML API: Add PHP type annotations. 2024-07-19 23:44:16 +00:00
class-wp-html-tag-processor.php HTML API: Add normalize() to give us the HTML we always wanted. 2024-09-20 22:32:17 +00:00
class-wp-html-text-replacement.php HTML API: Add PHP type annotations. 2024-07-19 23:44:16 +00:00
class-wp-html-token.php HTML API: Add support for SVG and MathML (Foreign content) 2024-08-08 07:25:15 +00:00
class-wp-html-unsupported-exception.php HTML API: Add context to Unsupported_Exception class for improved debugging. 2024-07-12 22:29:13 +00:00
html5-named-character-references.php Introduce Token Map: An optimized static translation class. 2024-05-23 19:56:08 +00:00