Pages

Tuesday, June 11, 2019

How to get HTML from the Windows system clipboard directly using PInvoke Win32 Native methods avoiding bad funny characters


Grabbing HTML data from clip using Clipboard.GetText in C# .NET is straightforward

string htmlText = Clipboard.GetText(TextDataFormat.Html);

but there's a an issue with .NET 4 Framework (for backward compatibility on WinXP) using Winforms. 


DataFormats.Html
 specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework, and I had a suspicion that it was reading as Windows-1252 encoding. Some of these characters are wide, taking 2 or more characters to represent 1 character. Specifically, they have byte count of 2 or more. 
You get allot of wrong encodings, which leading to funny/bad characters such as 'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
For example '€' is encoded as '€' in Windows-1252. But trying to get Windows-1252 encoding conversion does not work, as in this code snippet.

1
2
3
4
5
6
7
            Encoding win1252 = Encoding.GetEncoding(1252);
            Encoding utf8 = Encoding.UTF8;  

            byte[] win1252Bytes = win1252.GetBytes(htmlText);
            byte[] utf8Bytes = Encoding.Convert(win1252, utf8, win1252Bytes);

            htmlText = utf8.GetString(utf8Bytes);

Test string for Win-1252 conversion.
in:Unicode Character “” (U+2014)
out: Unicode Character “—” (U+2014)

But clearly the above assumption was wrong, I could not figure out the encoding.


It turns out to be a bug in the .NET 4 Frameork as discussed at an old Google Groups (microsoft.public.dotnet.framework.windowsforms) forum - here.



Here's how to get HTML from the Windows system clipboard directly using PInvoke Win32 Native methods avoiding bad funny characters.


Avoid 
Clipboard.GetText(TextDataFormat.Html)and use C call.


using System;
using System.Runtime.InteropServices;
using System.Text;

//--------------------------------------------------------------------------------
http://metadataconsulting.blogspot.com/2019/06/How-to-get-HTML-from-the-Windows-system-clipboard-directly-using-PInvoke-Win32-Native-methods-avoiding-bad-funny-characters.html
//--------------------------------------------------------------------------------

public class ClipboardHelper
{
 #region Win32 Native PInvoke
 
 [DllImport("User32.dll", SetLastError = true)]
 private static extern uint RegisterClipboardFormat(string lpszFormat);
 //or specifically - private static extern uint RegisterClipboardFormatA(string lpszFormat);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool IsClipboardFormatAvailable(uint format);

 [DllImport("User32.dll", SetLastError = true)]
 private static extern IntPtr GetClipboardData(uint uFormat);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool OpenClipboard(IntPtr hWndNewOwner);

 [DllImport("User32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool CloseClipboard();

 [DllImport("Kernel32.dll", SetLastError = true)]
 private static extern IntPtr GlobalLock(IntPtr hMem);

 [DllImport("Kernel32.dll", SetLastError = true)]
 [return: MarshalAs(UnmanagedType.Bool)]
 private static extern bool GlobalUnlock(IntPtr hMem);

 [DllImport("Kernel32.dll", SetLastError = true)]
 private static extern int GlobalSize(IntPtr hMem);
 
 #endregion

 public static string GetHTMLWin32Native()
 {

  string strHTMLUTF8 = string.Empty; 
  uint CF_HTML = RegisterClipboardFormatA("HTML Format");
  if (CF_HTML != null || CF_HTML == 0)
    return null;

  if (!IsClipboardFormatAvailable(CF_HTML))
   return null;

  try
  {
   if (!OpenClipboard(IntPtr.Zero))
    return null;

   IntPtr handle = GetClipboardData(CF_HTML);
   if (handle == IntPtr.Zero)
    return null;

   IntPtr pointer = IntPtr.Zero;

   try
   {
    pointer = GlobalLock(handle);
    if (pointer == IntPtr.Zero)
     return null;

    uint size = GlobalSize(handle);
    byte[] buff = new byte[size];

    Marshal.Copy(pointer, buff, 0, (int)size);

   strHTMLUTF8 = System.Text.Encoding.UTF8.GetString(buff);
   }
   finally
   {
    if (pointer != IntPtr.Zero)
     GlobalUnlock(handle);
   }
  }
  finally
  {
   CloseClipboard();
  }

  return strHTMLUTF8; 
 }
}






2 comments:

  1. "RegisterClipboardFormatA" is something you've forgotten in your code to declare.

    ReplyDelete