Grabbing HTML data from clip using Clipboard.GetText in C# .NET is straightforward,
string htmlText = Clipboard.GetText(TextDataFormat.Html); but there's a an issue with .NET 4 Framework (for backward compatibility on WinXP) using Winforms. DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework, and I had a suspicion that it was reading as Windows-1252 encoding. Some of these characters are wide, taking 2 or more characters to represent 1 character. Specifically, they have byte count of 2 or more.
You get allot of wrong encodings, which leading to funny/bad characters such as 'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
For example '€' is encoded as '€' in Windows-1252. But trying to get Windows-1252 encoding conversion does not work, as in this code snippet.
Test string for Win-1252 conversion.
in:Unicode Character “—” (U+2014)out: Unicode Character “—†(U+2014)
Full explanation here at this dedicated website Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters.
Even better https://www.utf8-chartable.de/unicode-utf8-table.pl It turns out to be a bug in the .NET 4 Frameork as discussed at an old Google Groups (microsoft.public.dotnet.framework.windowsforms) forum - here. |
Here's how to get HTML from the Windows system clipboard directly using PInvoke Win32 Native methods avoiding bad funny characters.
Avoid Clipboard.GetText(TextDataFormat.Html)and use C call.
using System; using System.Runtime.InteropServices; using System.Text; //-------------------------------------------------------------------------------- http://metadataconsulting.blogspot.com/2019/06/How-to-get-HTML-from-the-Windows-system-clipboard-directly-using-PInvoke-Win32-Native-methods-avoiding-bad-funny-characters.html //-------------------------------------------------------------------------------- public class ClipboardHelper { #region Win32 Native PInvoke [DllImport("User32.dll", SetLastError = true)] private static extern uint RegisterClipboardFormat(string lpszFormat); //or specifically - private static extern uint RegisterClipboardFormatA(string lpszFormat); [DllImport("User32.dll", SetLastError = true)] [return: MarshalAs(UnmanagedType.Bool)] private static extern bool IsClipboardFormatAvailable(uint format); [DllImport("User32.dll", SetLastError = true)] private static extern IntPtr GetClipboardData(uint uFormat); [DllImport("User32.dll", SetLastError = true)] [return: MarshalAs(UnmanagedType.Bool)] private static extern bool OpenClipboard(IntPtr hWndNewOwner); [DllImport("User32.dll", SetLastError = true)] [return: MarshalAs(UnmanagedType.Bool)] private static extern bool CloseClipboard(); [DllImport("Kernel32.dll", SetLastError = true)] private static extern IntPtr GlobalLock(IntPtr hMem); [DllImport("Kernel32.dll", SetLastError = true)] [return: MarshalAs(UnmanagedType.Bool)] private static extern bool GlobalUnlock(IntPtr hMem); [DllImport("Kernel32.dll", SetLastError = true)] private static extern int GlobalSize(IntPtr hMem); #endregion public static string GetHTMLWin32Native() { string strHTMLUTF8 = string.Empty; uint CF_HTML = RegisterClipboardFormatA("HTML Format"); if (CF_HTML != null || CF_HTML == 0) return null; if (!IsClipboardFormatAvailable(CF_HTML)) return null; try { if (!OpenClipboard(IntPtr.Zero)) return null; IntPtr handle = GetClipboardData(CF_HTML); if (handle == IntPtr.Zero) return null; IntPtr pointer = IntPtr.Zero; try { pointer = GlobalLock(handle); if (pointer == IntPtr.Zero) return null; uint size = GlobalSize(handle); byte[] buff = new byte[size]; Marshal.Copy(pointer, buff, 0, (int)size); strHTMLUTF8 = System.Text.Encoding.UTF8.GetString(buff); } finally { if (pointer != IntPtr.Zero) GlobalUnlock(handle); } } finally { CloseClipboard(); } return strHTMLUTF8; } }
"RegisterClipboardFormatA" is something you've forgotten in your code to declare.
ReplyDeleteThank you, updated now.
Delete