Saturday, August 24, 2019

C-Sharp A Faster Unicode ReplaceAt method that works with surrogate pairs and 4-byte Unicode characters

Most "ReplaceAt" commonly methods seen online fail when replace a character at a specific position in a Unicode string.

Unicode String Replace At Issue

Lets examine Unicode string "🎶🔥é-"

🎶 Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
🔥 Fire Emoji U+1F525 - 4-byte Unicode character
é  Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character

😊 Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)



🎶🔥é- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎶🔥 are double byte UNICODE characters (> \u10000) of width or len 2 each 
🎶🔥é- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online

This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes. 


Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo. 

I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).


No comments:

Post a Comment