Most "ReplaceAt" commonly methods seen online fail when replace a character at a specific position in a Unicode string.
Unicode String Replace At Issue
Lets examine Unicode string "πΆπ₯Γ©-"
πΆ Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
π₯ Fire Emoji U+1F525 - 4-byte Unicode character
Γ© Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
- Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character
π Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)
πΆπ₯Γ©- is length of 6, but there are ONLY 4 characters! Why not len=4?
πΆπ₯ are double byte UNICODE characters (> \u10000) of width or len 2 each
πΆπ₯Γ©- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online
This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.
Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).
Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo.
I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).
Unicode String Replace At Issue
Lets examine Unicode string "πΆπ₯Γ©-"
πΆ Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
π₯ Fire Emoji U+1F525 - 4-byte Unicode character
Γ© Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
- Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character
π Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)
πΆπ₯Γ©- is length of 6, but there are ONLY 4 characters! Why not len=4?
πΆπ₯ are double byte UNICODE characters (> \u10000) of width or len 2 each
πΆπ₯Γ©- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online
This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.
Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).
Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo.
I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).
No comments:
Post a Comment