Formatting: Normalize to Unicode NFC encoding before converting accent characters in remove_accents().

This changeset adds Unicode sequence normalization from NFD to NFC, via the `normalizer_normalize()` PHP function which is available with the recommended `intl` PHP extension.

This fixes an issue where NFD characters were not properly sanitized. It also provides a unit test for NFD sequences (alternate Unicode representations of the same characters).

Props NumidWasNotAvailable, targz, nacin, nunomorgadinho, p_enrique, gitlost, SergeyBiryukov, markoheijnen, mikeschroder, ocean90, pento, helen, rodrigosevero, zodiac1978, ironprogrammer, audrasjb, azaozz, laboiteare, nuryko, virgar, dxd5001, onnimonni, johnbillion.
Fixes #24661, #47763, #35951.
See #30130, #52654.

Built from https://develop.svn.wordpress.org/trunk@53754


git-svn-id: http://core.svn.wordpress.org/trunk@53313 1a063a9b-81f0-0310-95a4-ce76da25c4cd
This commit is contained in:
audrasjb 2022-07-21 21:11:12 +00:00
parent 0bbca2a3ab
commit f7921555ca
2 changed files with 11 additions and 1 deletions

View File

@ -1584,6 +1584,7 @@ function utf8_uri_encode( $utf8_string, $length = 0, $encode_ascii_characters =
* @since 4.8.0 Added locale support for `bs_BA`. * @since 4.8.0 Added locale support for `bs_BA`.
* @since 5.7.0 Added locale support for `de_AT`. * @since 5.7.0 Added locale support for `de_AT`.
* @since 6.0.0 Added the `$locale` parameter. * @since 6.0.0 Added the `$locale` parameter.
* @since 6.1.0 Added Unicode NFC encoding normalization support.
* *
* @param string $string Text that might have accent characters. * @param string $string Text that might have accent characters.
* @param string $locale Optional. The locale to use for accent removal. Some character * @param string $locale Optional. The locale to use for accent removal. Some character
@ -1597,6 +1598,15 @@ function remove_accents( $string, $locale = '' ) {
} }
if ( seems_utf8( $string ) ) { if ( seems_utf8( $string ) ) {
// Unicode sequence normalization from NFD (Normalization Form Decomposed)
// to NFC (Normalization Form [Pre]Composed), the encoding used in this function.
if ( function_exists( 'normalizer_normalize' ) ) {
if ( ! normalizer_is_normalized( $string, Normalizer::FORM_C ) ) {
$string = normalizer_normalize( $string, Normalizer::FORM_C );
}
}
$chars = array( $chars = array(
// Decompositions for Latin-1 Supplement. // Decompositions for Latin-1 Supplement.
'ª' => 'a', 'ª' => 'a',

View File

@ -16,7 +16,7 @@
* *
* @global string $wp_version * @global string $wp_version
*/ */
$wp_version = '6.1-alpha-53753'; $wp_version = '6.1-alpha-53754';
/** /**
* Holds the WordPress DB revision, increments when changes are made to the WordPress DB schema. * Holds the WordPress DB revision, increments when changes are made to the WordPress DB schema.