A Unicode "ReplaceAt" string extension method handles Unicode string properly

Most "ReplaceAt" commonly methods seen online fail when replace a character at a specific position in a Unicode string.

Unicode String Replace At Issue

Lets examine Unicode string "🎶🔥é-"

🎶 Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
🔥 Fire Emoji U+1F525 - 4-byte Unicode character
é Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
- Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character

😊 Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)

🎶🔥é- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎶🔥 are double byte UNICODE characters (> \u10000) of width or len 2 each
🎶🔥é- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online

This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.

Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo. .

UnicocodeReplaceAt method replaces a character in a string at specific zero-based index and handles null char '\0' properly, by removing if from resultant string.

using System.Linq;
using System.Diagnostics;
using System;
using System.Text;
using System.Globalization;

//Created for 
//http://metadataconsulting.blogspot.com/2019/03/A-Unicode-ReplaceAt-string-extension-method-handles-Unicode-string-properly.html

public static class Program
{

 
 const char cEMPTY = '\0'; 
    static readonly string EMPTY = cEMPTY.ToString(); 
 
 public static string UnicodeReplaceAtFastest(this string s, int idx, string replace)
    {
        // This StringBuilder holds the output results.
        StringBuilder sb = new StringBuilder();

      // Use the ParseCombiningCharacters method to 
      // get the index of each real character in the string.
      Int32[] textElemIndex = StringInfo.ParseCombiningCharacters(s);
    
      if (string.IsNullOrEmpty(s)) return s;
      if (idx < 1) 
          return s; 

      string newstring = string.Empty;
      if (textElemIndex.Length == 1) {
          return replace; 
      }
      else if (idx < textElemIndex.Length)
      {
          idx = idx - 1; 
       return s.Remove(textElemIndex[idx], textElemIndex[idx + 1] - textElemIndex[idx]).Insert(textElemIndex[idx], replace.ToString()); 

      }
   else if (idx == textElemIndex.Length)
   {
    idx = idx - 1;
    return s.Remove(textElemIndex[idx], s.Length - textElemIndex[idx]).Insert(textElemIndex[idx], replace.ToString());

   }
      else
          return s; 
        
    }

 
 public static string UnicodeReplaceAtFast(this string s, int idx, string replace)
    {
        // This StringBuilder holds the output results.
        StringBuilder sb = new StringBuilder();

        // Use the enumerator returned from GetTextElementEnumerator 
        // method to examine each real character.
        TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);

        if (string.IsNullOrEmpty(s)) return s;
        if (idx < 1)
            return s;

        //string newstring = string.Empty;
        //if (textElemIndex.Length == 1)
        //{
        //    return replace;
        //}
        //else if (idx <= textElemIndex.Length)
        //{
            idx = idx - 1;

            while (charEnum.MoveNext())
            {
                if (charEnum.ElementIndex != idx)
                    sb.Append(charEnum.GetTextElement()); 
                else
     sb.Append(replace);
                    
                // i++;   
            }

            return sb.ToString();

        //}
        //else
        //    return s;

    }
 
    public static string UnicodeReplaceAt(this string str, int offset, char replaceChar)
    {
        int count = 1; //number of characters to remove at location offset
        string replaceBy = replaceChar.ToString();
        return new StringInfo(str).ReplaceByPosition(replaceBy, offset, count).String;
    }

    public static StringInfo ReplaceByPosition(this StringInfo str, string replaceBy, int offset, int count)
    {
        if (replaceBy != EMPTY)
            return str.RemoveByTextElements(offset, count).InsertByTextElements(offset, replaceBy);
  else if (!string.IsNullOrEmpty(replaceBy))
   return str.RemoveByTextElements(offset, count).InsertByTextElements(offset, replaceBy);
        else
            return str.RemoveByTextElements(offset, count);
    }

    public static StringInfo RemoveByTextElements(this StringInfo str, int offset, int count)
    { 
  //Tue 20-Aug-19 11:32am metadataconsulting.ca - replaceat index > string.len return orginal string
  if (offset > str.LengthInTextElements)
   return str;
  
        return new StringInfo(string.Concat(
            str.SubstringByTextElements(0, offset),
            offset + count < str.LengthInTextElements
                ? str.SubstringByTextElements(offset + count, str.LengthInTextElements - count - offset)
                : string.Empty
            ));
    }
    public static StringInfo InsertByTextElements(this StringInfo str, int offset, string insertStr)
    {
        //Tue 20-Aug-19 11:32am metadataconsulting.ca - replaceat index > string.len return orginal string
  if (offset > str.LengthInTextElements)
   return str;
  
  if (string.IsNullOrEmpty(str.String))
            return new StringInfo(insertStr);

        return new StringInfo(string.Concat(
            str.SubstringByTextElements(0, offset),
            insertStr,
            str.LengthInTextElements - offset > 0 ? str.SubstringByTextElements(offset, str.LengthInTextElements - offset) : ""
        ));
    }

    public static string SubsituteStringStringBuilder(this string s, int idx, char replaceChar)
    {
        if (string.IsNullOrEmpty(s) || idx >= s.Length || idx < 0)
            return s;

        return new StringBuilder(s).Remove(idx, 1).Insert(idx, replaceChar.ToString()).ToString();
    }

    public static string ReplaceAtSubstring(this string s, int idx, char replaceChar)
    {
        if (string.IsNullOrEmpty(s) || idx >= s.Length || idx < 0)
            return s;

        return s.Substring(0, idx) + replaceChar.ToString() + s.Substring(idx + replaceChar.ToString().Length, s.Length - (idx + replaceChar.ToString().Length));

    }

    public static string ReplaceAtStringManipulation(this string s, int idx, char replaceChar)
    {
        if (string.IsNullOrEmpty(s) || idx >= s.Length || idx < 0)
            return s;

        return s.Remove(idx, 1).Insert(idx, replaceChar.ToString());
    }

    public static string ReplaceAtLinq(this string value, int index, char newchar)
    {
        if (value.Length <= index)
            return value;
        else
            return string.Concat(value.Select((c, i) => i == index ? newchar : c));
    }

    public static string ReplaceAtCharArray(this string input, uint index, char newChar)
    {
        if (string.IsNullOrEmpty(input) || index >= input.Length)
            return input;

        char[] chars = input.ToCharArray();
        chars[index] = newChar;
        return new string(chars);
    }

    public static void Main()
    {
        //In .NET 4.5 and later also UTF-16 is supported
        //Console.OutputEncoding = System.Text.Encoding.Unicode;  
        
  //é  Latin Small Letter e with Acute U+00E9 - single byte Unicode character
  //😊 Smiling Face with Smiling Eyes Emoji U+1F60A - double byte Unicode character
  //🎶 Multiple Musical Notes Emoji U+1F3B6 - - double byte Unicode character
  //🔥 Fire Emoji U+1F525 -- double byte Unicode character
  
  Console.WriteLine("Unicode String Replace At Issue");
  Console.WriteLine("Lets examine string \"🎶🔥é-\"");  
        Console.WriteLine("🎶🔥é- is length of " + "🎶🔥é-".Length + ", but there are ONLY 4 characters! Why not len=4?"); 
  Console.WriteLine("🎶🔥 are double byte UNICODE characters (> \\u10000) of width or len 2 each ");
  Console.WriteLine("🎶🔥é- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online"); 
  
  Console.WriteLine(); 
  
  Stopwatch sw = new Stopwatch();
        sw.Start();
        Console.WriteLine("🎶🔥é- using ReplaceAtCharArray".ReplaceAtCharArray(4, 'X'));
        sw.Stop();
        Console.WriteLine("in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

        sw.Restart();
        Console.WriteLine("🎶🔥é- using ReplaceAtLinq".ReplaceAtLinq(4, 'Y'));
        sw.Stop();
        Console.WriteLine("in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

        sw.Restart();
        Console.WriteLine("🎶🔥é- using ReplaceAtStringManipulation".ReplaceAtStringManipulation(4, 'Z'));
        sw.Stop();
        Console.WriteLine("in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

        sw.Restart();
        Console.WriteLine("🎶🔥é- using ReplaceAtSubstring".ReplaceAtSubstring(4, 'A'));
        sw.Stop();
        Console.WriteLine("in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

        sw.Restart();
        Console.WriteLine("🎶🔥é- using SubsituteStringStringBuilder".SubsituteStringStringBuilder(4, 'W'));
        sw.Stop();
        Console.WriteLine("in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

  sw.Restart();
        Console.WriteLine("🎶🔥é- using UnicodeReplaceAt".UnicodeReplaceAt(4, '4'));
        sw.Stop();
        Console.WriteLine("in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
  
  Console.WriteLine(); 
  Console.WriteLine("UnicodeReplaceAt replaces properly at position 4 in zero based index string");
        Console.WriteLine(); 
  sw.Restart();
        Console.Write("🎶🔥é- using UnicodeReplaceAt(0, '0')".UnicodeReplaceAt(0, '0'));
        sw.Stop();
  Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
  sw.Restart();
        Console.Write("🎶🔥é- using UnicodeReplaceAt(1, '1')".UnicodeReplaceAt(1, '1'));
        sw.Stop();
  Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
  sw.Restart();
        Console.Write("🎶🔥é- using UnicodeReplaceAt(2, '2')".UnicodeReplaceAt(2, '2'));
        sw.Stop();
  Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
  sw.Restart();
        Console.Write("🎶🔥é- using UnicodeReplaceAt(3, '3')".UnicodeReplaceAt(3, '3'));
        sw.Stop();
  Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
  sw.Restart();
        Console.Write("🎶🔥é- using UnicodeReplaceAt(4, '4')".UnicodeReplaceAt(4, '4'));
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
     Console.Write("🎶🔥é-".UnicodeReplaceAt(5, '5')+" using UnicodeReplaceAt(5, '5') - this is beyond end of string, so return orginal string");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        
  Console.WriteLine(); 
  Console.WriteLine(); 
  Console.WriteLine(" FAST testing.");
  Console.WriteLine(); 
  Console.WriteLine(); 
  
  sw.Reset();
        sw.Start();
        Console.Write("🎶🔥é-a\u0304\u0308bc\u0327".UnicodeReplaceAtFast(5, "😊") + " using UnicodeReplaceAtFast(5, '😊') - cool but still O(100) more :(");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));


        sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("a\u0304".UnicodeReplaceAtFast(1, "😊") + " using UnicodeReplaceAtFast(1, '😊') - bounds check - 1 char");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

        sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("a\u0304".UnicodeReplaceAtFast(0, "😊") + " using UnicodeReplaceAtFast(0, '😊') - bounds check - wrong index");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
    
  Console.WriteLine(); 
  Console.WriteLine(); 
  Console.WriteLine(" FASTEST testing.");
  Console.WriteLine(); 
  Console.WriteLine(); 
  
  sw.Reset();
        sw.Start();
        Console.Write("🎶🔥é-a\u0304\u0308bc\u0327".UnicodeReplaceAtFastest(5, "😊") + " using UnicodeReplaceAtFastest(5, '😊') - cool but still O(100) more than string.Replace :(");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));


        sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("a\u0304".UnicodeReplaceAtFastest(1, "😊") + " using UnicodeReplaceAtFastest(1, '😊') - bounds check - 1 char");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));

        sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("a\u0304".UnicodeReplaceAtFastest(0, "😊") + " using UnicodeReplaceAtFastest(0, '😊') - bounds check - 0 index");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
    
  sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("a\u0304".UnicodeReplaceAtFastest(5, "😊") + " using UnicodeReplaceAtFastest(5, '😊') - bounds check - after end index");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
    
  sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("🎶🔥é-a\u0304\u0308bc\u0327".UnicodeReplaceAtFastest(6, "😊") + " using UnicodeReplaceAtFastest(6, '😊') - bounds check - after end index");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
    
  sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("🎶🔥é-a\u0304\u0308bc\u0327".UnicodeReplaceAtFastest(7, "😊") + " using UnicodeReplaceAtFastest(7, '😊') - bounds check - after end index");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
    
  sw.Reset();
        sw.Start();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
        Console.Write("🎶🔥é-a\u0304\u0308bc\u0327".UnicodeReplaceAtFastest(8, "😊") + " using UnicodeReplaceAtFastest(8, '😊') - bounds check - after end index");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
    

  Console.WriteLine(); 
  Console.WriteLine("String.Replace works, but replaces all characters, not at specific location as above functions");
        Console.WriteLine(); 
  sw.Reset();
        sw.Start();
  Console.Write("🎶🔥é- using String.Replace".Replace("🔥", "+") + "('🔥', '+')");
        sw.Stop();
        Console.WriteLine(" in {0} ticks.", sw.ElapsedTicks.ToString("N0"));
  
    }
}

Output

Unicode String Replace At Issue
Lets examine string "🎶🔥é-"
🎶🔥é- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎶🔥 are double byte UNICODE characters (> \u10000) of width or len 2 each 
🎶🔥é- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online

🎶🔥X- using ReplaceAtCharArray
in 2,089 ticks.
🎶🔥Y- using ReplaceAtLinq
in 3,724 ticks.
🎶🔥Z- using ReplaceAtStringManipulation
in 1,411 ticks.
🎶🔥A- using ReplaceAtSubstring
in 1,768 ticks.
🎶🔥W- using SubsituteStringStringBuilder
in 1,701 ticks.
🎶🔥é-4using UnicodeReplaceAt
in 5,811 ticks.

UnicodeReplaceAt replaces properly at position 4 in zero based index string

0🔥é- using UnicodeReplaceAt(0, '0') in 31 ticks.
🎶1é- using UnicodeReplaceAt(1, '1') in 24 ticks.
🎶🔥2- using UnicodeReplaceAt(2, '2') in 23 ticks.
🎶🔥é3 using UnicodeReplaceAt(3, '3') in 76 ticks.
🎶🔥é-4using UnicodeReplaceAt(4, '4') in 22 ticks.
🎶🔥é- using UnicodeReplaceAt(5, '5') - this is beyond end of string, so return orginal string in 22 ticks.


 FAST testing.


🎶🔥😊-ā̈bç using UnicodeReplaceAtFast(5, '😊') - cool but still O(100) more :( in 2,014 ticks.
 in 3 ticks.
😊 using UnicodeReplaceAtFast(1, '😊') - bounds check - 1 char in 23 ticks.
 in 3 ticks.
ā using UnicodeReplaceAtFast(0, '😊') - bounds check - wrong index in 19 ticks.


 FASTEST testing.


🎶🔥é-😊bç using UnicodeReplaceAtFastest(5, '😊') - cool but still O(100) more than string.Replace :( in 2,155 ticks.
 in 2 ticks.
😊 using UnicodeReplaceAtFastest(1, '😊') - bounds check - 1 char in 23 ticks.
 in 3 ticks.
ā using UnicodeReplaceAtFastest(0, '😊') - bounds check - 0 index in 21 ticks.
 in 3 ticks.
😊 using UnicodeReplaceAtFastest(5, '😊') - bounds check - after end index in 63 ticks.

String.Replace works, but replaces all characters, not at specific location as above functions

🎶+é- using String.Replace('🔥', '+') in 15 ticks.

Metadata Consulting [dot] ca - Blog

Pages

Tuesday, March 19, 2019

A Unicode "ReplaceAt" string extension method handles Unicode string properly

No comments:

Post a Comment

Search and do some good