Wednesday, August 28, 2019

C-Sharp A Faster Unicode ReplaceAt method that works with surrogate pairs and 4-byte Unicode characters

Most "ReplaceAt" commonly methods seen online fail when replace a character at a specific position in a Unicode string.

Unicode String Replace At Issue

Lets examine Unicode string "🎢πŸ”₯Γ©-"

🎢 Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
πŸ”₯ Fire Emoji U+1F525 - 4-byte Unicode character
Γ©  Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character

😊 Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)

🎢πŸ”₯Γ©- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎢πŸ”₯ are double byte UNICODE characters (> \u10000) of width or len 2 each 
🎢πŸ”₯Γ©- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online

This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes. 


Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo. 

I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).


Tuesday, August 27, 2019

C-Sharp How can I read raw (CF_HTML) clipboard data


Here's how to read raw HTML from Clipboard using P/Invoke Win32 Native methods specify clipboard data type 
CF_HTML


Avoid Clipboard.GetText(TextDataFormat.Html) and use the below P/Invoke, especially with .NET 4.0 Framework or less, because funny characters are introduced.

See my blogpost 
How to get HTML from the Windows system clipboard directly using PInvoke Win32 Native methods avoiding bad funny characters
using System;
using System.Runtime.InteropServices;
using System.Text;

//--------------------------------------------------------------------------------
http://metadataconsulting.blogspot.com/2019/06/How-to-get-HTML-from-the-Windows-system-clipboard-directly-using-PInvoke-Win32-Native-methods-avoiding-bad-funny-characters.html
//--------------------------------------------------------------------------------

public class ClipboardHelper
{
 #region Win32 Native PInvoke
 
 [DllImport("User32.dll", SetLastError = true)]
 private static extern uint RegisterClipboardFormat(string lpszFormat);
 //or specifically - private static extern uint RegisterClipboardFormatA(string lpszFormat);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool IsClipboardFormatAvailable(uint format);

 [DllImport("User32.dll", SetLastError = true)]
 private static extern IntPtr GetClipboardData(uint uFormat);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool OpenClipboard(IntPtr hWndNewOwner);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool CloseClipboard();

 [DllImport("Kernel32.dll", SetLastError = true)]
 private static extern IntPtr GlobalLock(IntPtr hMem);

 [DllImport("Kernel32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool GlobalUnlock(IntPtr hMem);

 [DllImport("Kernel32.dll", SetLastError = true)]
 private static extern int GlobalSize(IntPtr hMem);
 
 #endregion

 public static string GetHTMLWin32Native()
 {

  string strHTMLUTF8 = string.Empty; 
  uint CF_HTML = RegisterClipboardFormatA("HTML Format");
  if (CF_HTML != null || CF_HTML == 0)
    return null;

  if (!IsClipboardFormatAvailable(CF_HTML))
   return null;

  try
  {
   if (!OpenClipboard(IntPtr.Zero))
    return null;

   IntPtr handle = GetClipboardData(CF_HTML);
   if (handle == IntPtr.Zero)
    return null;

   IntPtr pointer = IntPtr.Zero;

   try
   {
    pointer = GlobalLock(handle);
    if (pointer == IntPtr.Zero)
     return null;

    uint size = GlobalSize(handle);
    byte[] buff = new byte[size];

    Marshal.Copy(pointer, buff, 0, (int)size);

   strHTMLUTF8 = System.Text.Encoding.UTF8.GetString(buff);
   }
   finally
   {
    if (pointer != IntPtr.Zero)
     GlobalUnlock(handle);
   }
  }
  finally
  {
   CloseClipboard();
  }

  return strHTMLUTF8; 
 }
}


Saturday, August 24, 2019

C-Sharp A Faster Unicode ReplaceAt method that works with surrogate pairs and 4-byte Unicode characters

Most "ReplaceAt" commonly methods seen online fail when replace a character at a specific position in a Unicode string.

Unicode String Replace At Issue

Lets examine Unicode string "🎢πŸ”₯Γ©-"

🎢 Unicode Character 'MULTIPLE MUSICAL NOTES' (U+1F3B6) - 4-byte Unicode character
πŸ”₯ Fire Emoji U+1F525 - 4-byte Unicode character
Γ©  Latin Small Letter e with Acute U+00E9 - 2-byte Unicode character
Unicode Character 'HYPHEN-MINUS' (U+002D) - 2-byte Unicode character

😊 Smiling Face with Smiling Eyes Emoji U+1F60A - 4-byte Unicode character (replacement)



🎢πŸ”₯Γ©- is length of 6, but there are ONLY 4 characters! Why not len=4?
🎢πŸ”₯ are double byte UNICODE characters (> \u10000) of width or len 2 each 
🎢πŸ”₯Γ©- below will replace space after lasting character '-' (position 4) with a sub using most common techniques seen online

This is due to the fact that Unicode code points outside of the Basic Multilingual Plane (BMP) > U+FFFF, are are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes. 


Specifically, the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

Below is an large enumeration of common ReplaceAt implementations available on internet. They all fail, except for one that using StringInfo. 

I have optimized this method UnicodeReplaceAtFastest to be fastest implementation so far, and beats old UnicodeReplaceAt by 2ms (on average).