Tuesday, June 11, 2019

How to get HTML from the Windows system clipboard directly using PInvoke Win32 Native methods avoiding bad funny characters


Grabbing HTML data from clip using Clipboard.GetText in C# .NET is straightforward

string htmlText = Clipboard.GetText(TextDataFormat.Html);

but there's a an issue with .NET 4 Framework (for backward compatibility on WinXP) using Winforms. 


DataFormats.Html
 specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework, and I had a suspicion that it was reading as Windows-1252 encoding. Some of these characters are wide, taking 2 or more characters to represent 1 character. Specifically, they have byte count of 2 or more. 
You get allot of wrong encodings, which leading to funny/bad characters such as 'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
For example '€' is encoded as '€' in Windows-1252. But trying to get Windows-1252 encoding conversion does not work, as in this code snippet.

1
2
3
4
5
6
7
            Encoding win1252 = Encoding.GetEncoding(1252);
            Encoding utf8 = Encoding.UTF8;  

            byte[] win1252Bytes = win1252.GetBytes(htmlText);
            byte[] utf8Bytes = Encoding.Convert(win1252, utf8, win1252Bytes);

            htmlText = utf8.GetString(utf8Bytes);

Test string for Win-1252 conversion.
in:Unicode Character “” (U+2014)
out: Unicode Character “—” (U+2014)

But clearly the above assumption was wrong, I could not figure out the encoding.


It turns out to be a bug in the .NET 4 Frameork as discussed at an old Google Groups (microsoft.public.dotnet.framework.windowsforms) forum - here.



Here's how to get HTML from the Windows system clipboard directly using PInvoke Win32 Native methods avoiding bad funny characters.


Avoid 
Clipboard.GetText(TextDataFormat.Html)and use C call.


using System;
using System.Runtime.InteropServices;
using System.Text;

//--------------------------------------------------------------------------------
http://metadataconsulting.blogspot.com/2019/06/How-to-get-HTML-from-the-Windows-system-clipboard-directly-using-PInvoke-Win32-Native-methods-avoiding-bad-funny-characters.html
//--------------------------------------------------------------------------------

public class ClipboardHelper
{
 #region Win32 Native PInvoke
 
 [DllImport("User32.dll", SetLastError = true)]
 private static extern uint RegisterClipboardFormat(string lpszFormat);
 //or specifically - private static extern uint RegisterClipboardFormatA(string lpszFormat);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool IsClipboardFormatAvailable(uint format);

 [DllImport("User32.dll", SetLastError = true)]
 private static extern IntPtr GetClipboardData(uint uFormat);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool OpenClipboard(IntPtr hWndNewOwner);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool CloseClipboard();

 [DllImport("Kernel32.dll", SetLastError = true)]
 private static extern IntPtr GlobalLock(IntPtr hMem);

 [DllImport("Kernel32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool GlobalUnlock(IntPtr hMem);

 [DllImport("Kernel32.dll", SetLastError = true)]
 private static extern int GlobalSize(IntPtr hMem);
 
 #endregion

 public static string GetHTMLWin32Native()
 {

  string strHTMLUTF8 = string.Empty; 
  uint CF_HTML = RegisterClipboardFormatA("HTML Format");
  if (CF_HTML != null || CF_HTML == 0)
    return null;

  if (!IsClipboardFormatAvailable(CF_HTML))
   return null;

  try
  {
   if (!OpenClipboard(IntPtr.Zero))
    return null;

   IntPtr handle = GetClipboardData(CF_HTML);
   if (handle == IntPtr.Zero)
    return null;

   IntPtr pointer = IntPtr.Zero;

   try
   {
    pointer = GlobalLock(handle);
    if (pointer == IntPtr.Zero)
     return null;

    uint size = GlobalSize(handle);
    byte[] buff = new byte[size];

    Marshal.Copy(pointer, buff, 0, (int)size);

   strHTMLUTF8 = System.Text.Encoding.UTF8.GetString(buff);
   }
   finally
   {
    if (pointer != IntPtr.Zero)
     GlobalUnlock(handle);
   }
  }
  finally
  {
   CloseClipboard();
  }

  return strHTMLUTF8; 
 }
}


2 comments:

  1. "RegisterClipboardFormatA" is something you've forgotten in your code to declare.

    ReplyDelete