Most "ReplaceAt" commonly methods seen online fail when replace a character at a specific position in a Unicode string.
Unicode String Replace At Issue
Lets examine Unicode string "🎶🔥é-"
🎶 Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
🔥 Fire Emoji U+1F525 - 4-byte Unicode character
é Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
- Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character
😊 Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)
🎶🔥é- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎶🔥 are double byte UNICODE characters (> \u10000) of width or len 2 each
🎶🔥é- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online
This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.
Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).
Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo.
I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).
Unicode String Replace At Issue
Lets examine Unicode string "🎶🔥é-"
🎶 Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
🔥 Fire Emoji U+1F525 - 4-byte Unicode character
é Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
- Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character
😊 Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)
🎶🔥é- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎶🔥 are double byte UNICODE characters (> \u10000) of width or len 2 each
🎶🔥é- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online
This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.
Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).
Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo.
I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).
No comments:
Post a Comment