Tuesday, March 22, 2022

C# Regular Expressions - Getting Substitutions Groupings Index and Length .NET Framework limitation workaround improvement

Reflecting on the .NET 20 anniversary, there are few shortcoming to the .NET Framework that still are glaring examples of tunnel vision. Firstly, is the lack of growth or concern of expanding the BCL with regards to greater coverage of Win32 APIs. This is the main reason why C++ is still around, because all the Win32 API are not implemented in .NET C#. Device driver developers are forced to use C++. 

Secondly, the focus of this post, is the lack of an correct implementation and stagnant vision / foresight in regards to dealing with regular expression substations in a programmatic way, since the inception of the C# language. 

For example, we are familiar with the matching of groups, which provides good in-sight into match of groups. 

match.Groups[1].Value;
match.Groups[1].Index;
match.Groups[1].Length;

But wouldn't this be nice, a mechanism to handle substitutions groupings for replacement strings.   

match.Replacements[1].Value = "$1";
match.Replacements[1].Index = 4;
match.Replacements[1].Length = 2; 
etc...

In this way, when I want to use  Regex.Replace to replace a match with substitution string 

@"Lorem$1AAA\$1BBB$1Lorem";

It would be nice to have a data structure to enumerate all the substitution groups ['$1'] in the substitution string. But one can easily code to get this, but it requires some tedious indexOf work.








Here's the resultant string, and this bug has been around for 20 yrs.

.NET 6 Result string (see live demo at dotnetfiddle.net)
Lorem Lorem AAA \Lorem BBB Lorem Lorem
Error/Bug : '\' is in result, but is used escape $1 and should not be part result. Work Around : Using matches and filtering the match == "\$1" from your results. Regex101 .NET Result string (see live demo at Regex101.com)
Lorem Lorem AAA Lorem BBB Lorem Lorem
Perl's implementation is the gold standard.
Regex101 PERL Correct Result string (see live demo at Regex101.com)
Lorem Lorem AAA $1 BBB Lorem Lorem

I'm suggesting a new option as well. 

RegexOptions.PERL

But the pièce de résistance, would to have the following structure populated for the substitution groups replacements values in the final resultant string!

replacement.Groups[1].Captures[0].Value = "Lorem";
replacement.Groups[1].Captures[0].Index = 6;
replacement.Groups[1].Captures[0].Length = 5;
replacement.Groups[1].Captures[1].Value = "Lorem";
replacement.Groups[1].Captures[1].Index = 132;
replacement.Groups[1].Captures[1].Lenght = 5; 
etc...

where
Groups[1] is $1
Captures[0] is repeated captures of $1 replacements (the string "Lorem") in output string!

Hey .NET Language team, above is my suggestions for a totally new language design component!


Workaround : You can get the indices of substitution group replacements this 
             but its a coding adventure.

Using matches group construct, you can pre-create an output string and replace the 
string with dummy characters for the groups not in the current match group. 
Then you can find the n-th index of the group substitution 
[$1 which evaluates to "Lorum"]in the dummy string to get the
correct indices and lengths. 
So substitution string 

Lorem Lorem AAA \Lorem BBB Lorem Lorem

becomes

XXXXXXLoremXXXXXXLoremXXXXXLoremXXXXXX

then you can find the correct locations of 'Lorem' ($1) in the resultant string. 

Trivially, unnamed groups you can use this; 

string fillunamedgroups = Regex.Replace(
                          s, @"\$\d+", delegate(Match match)
                          {
          return string.Concat(Enumerable.Repeat("⌀", match.Length));
                          });


See it in action with my .NET Regular Expression Test Tool 
soon to be available in Clipboard Plaintext PowerTool.















No comments:

Post a Comment