Thursday, July 14, 2016

C# Code Share - StringBuilder.ToUTF8String()

Problem: Recently, I was working a C# code solution and getting an code fault with StringBuilder.ToString() which causing an EncoderFallbackException error:  

"Unable to translate Unicode character \uxxxx at index xxx to specified code page." 

Clearly I was attempting to write characters outside the UTF-8 character range, which I did not want. I just wanted a clean UTF-8 character string. I could catch  EncoderFallbackException and clean but I want to avoid this altogether and not interested in results there.
   
Requirment:  So I needed a solution to convert StringBuilder to "clean" UTF-8 then assemble a string, before using the  StringBuilder.ToString()function. "Clean" meaning strip all characters outside the UTF-8 range.

Issue: Most solutions on Stackoverflow and MSDN convert Stringbuilder to string using (

StringBuilder.ToString()function and then convert to UTF-8, very common.


Solution: Ideally I wanted this stringbuidler.ToUTF8String(),  but StringBuild is a not inheritable. I did not find a good solution to this and thought I would share.

So here is my working code for
StringBuildertoUTF8String(StringBuilder sb) class that converts Stringbuilder to UTF-8 String.

Get source code hosted one OneDrive Text Editor with syntax highlighting.

HTML generated using hilite.me
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
 // Seperate file StringExtension.cs
 //-----------------------------------------------------------------------------------------------------------------   
   using System.Globalization;
  
   /// <summary>
   /// Converts string to clean UTF-8 encoded string, stripping characters outside the UTF-8 range
   /// </summary>
   //Extension methods must be defined in a static class
   public static class StringExtension
   {
        private static readonly Encoding Utf8Encoder = UTF8Encoding.GetEncoding("UTF-8", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

        // This is the extension method.
        // The first parameter takes the "this" modifier
        // and specifies the type for which the method is defined.
        public static string ToUTF8String(this string str)
        {
            return ConvertToUTF8String(str);
        }

        public static string ConvertToUTF8String(this string value)
        {
            return Utf8Encoder.GetString(Utf8Encoder.GetBytes(value));
        }
   }
//---------------------------------------------------------------------------------------------------------------------------


     //set for UTF-8 encoding
     private static readonly Encoding Utf8Encoder = UTF8Encoding.GetEncoding("UTF-8", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());
    
    //this does not work, StringBuilder is not inheritable aka public sealed class StringBuilder : ISerializable
    public sealed class UTF8StringBuilder : Stringbuilder
    {
        private readonly Encoding stringBuilderEncoding;
        public UTF8StringBuilder(StringBuilder builder, Encoding desiredEncoding)
            : base(builder)
        {
            this.stringBuilderEncoding = desiredEncoding;
        }

        public override Encoding Encoding
        {
            get
            {
                return this.stringBuilderEncoding;
            }
        }
    }

   
   /// <summary>
   /// Converts StringBuilder and returns clean UTF-8 string, stripping characters outside the UTF-8 range
   /// </summary>
   public string StringBuildertoUTF8String (StringBuilder sb) {

        string strUTF8clean;
        char[] charsb = new char[sb.Length]; 
              
        //copies StringBuilder to char[]
        sb.CopyTo(0, charsb, 0, sb.Length);

        //get clean UTF-8 string from stringbuilder char[]  
        strUTF8clean = Utf8Encoder.GetString(Utf8Encoder.GetBytes(charsb));

        return strUTF8clean;

   }
   
   StringBuilder stringbuidler = new StringBuilder();
      
   //Problem : EncodingFallbackExcpetion using stringbuilder.ToString() 
   //          get error EncoderFallbackException "Unable to translate Unicode character \uxxxx at index xxx to specified code page."
         
   //Issue : Most solutions convert the StringBuilder string (stringbuidler.ToString()) then convert to UTF-8 string, easy enough
   
   //Requirement: Need to convert StringBuilder to UTF8 then assemble a string, ideally stringbuidler.ToUTF8String()
   
   Console.WriteLine(stringbuidler.ToString());
      
   //ideal solution --  
   //X Console.WriteLine(stringbuidler.ToUTF8String());
  
   //1st attempt, using String Extension class that StringBuilder class might pick-up ? Given its part of the System.String class - https://msdn.microsoft.com/en-us/library/2839d5h5(v=vs.110).aspx  
   //X Console.WriteLine(stringbuidler.ToUTF8String());
      
   //2nd attempt, overload Stringbuilder class to UTF8StringBuilder class
   //X UTF8StringBuilder stringbuilder = new UTF8StringBuilder();  
   //X Console.WriteLine(stringbuidler.ToUTF8String);
      
   //3rd attempt, Working and simple (once you know how)
   Console.WriteLine(StringBuildertoUTF8String(stringbuidler));
   

No comments:

Post a Comment