Pages

Tuesday, February 9, 2021

C# .NET Get difference between two unordered not unique lists aka duplicates allowed within a list and across both lists

Here's how to get the difference between two lists preserving duplicates in each list.

Update: C# .NET Get difference between two unordered not unique lists using 2 BitArrays for super fast speed - handles worst case scenario

//UNORDERED, NOT UNIQUE lists, lists with duplicates 

string[] arrA = new string[] { "1", "1", "2", "3", "4", "5", "9", "7", "7", "7" };
string[] arrB = new string[] { "1", "1", "2", "2", "3", "4", "5", "5", "5", "5" };

//DESIRED RESULT

2,5,5,5,7,7,7,9 OR 2,5,5,5,9,7,7,7 OR 9,7,7,7,2,5,5,5  //UNORDERED IS SUFFICIENT

NOTE: Tried Linq Except but it's designed as a set command, and as a consequence it removes duplicates.

IEnumerable<string> AminuBdifference = arrA.Except(arrB); //removes dups
IEnumerable<string> BminuAdifference = arrB.Except(arrA); //remove dups

IEnumerable<string> Uniondifference = arrA.Union(arrb); //ideally, but no good dups removed

See https://dotnetfiddle.net/kzjUJi  for quick example.


Briefly, some highlights of set theory, since most Linq commands are designed to be set commands

In the image below, it shows a representation for sets (distinct values, can be unordered), and it demonstrates one important fact about sets.  

The order of difference ("subtraction") is important in set theory.

List1 - List2 <> List2 - List1 

See https://dotnetfiddle.net/kzjUJi  for a demonstration of this.

So hence in image below, you see difference produce in order between upper and lower sets. 
If we take union we get [3,2,1,7,8] vs [8,7,3,2,1]  reflecting the difference order choice.

This is important when you look at all the solutions that were attempted on Stack Overflow, because the generally only do 1 difference 'subtraction' only.

All the implementation on this question were wrong on Stack Overflow.

c# - Difference between two lists preserving duplicates - Stack Overflow

For the required solution, we are using lists with duplicates so order is unimportant and should yield same result, since we are getting the union of the two list differences.

Here's my 1st pass at this implementation and it's faster then all mentioned in Stack Overflow post.

Update : Massive speed improvement - see my post here.

Update 2: C# .NET Get difference between two unordered not unique lists using 2 BitArrays for super fast speed - handles worst case scenario


Source Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics; 
                    
public static class Program
{
     public static int loopcnt = 0; 
     public static IEnumerable<T> RemoveRange<T>(this IEnumerable<T> source, IEnumerable<T> second)
     {
         var tempList = source.ToList();

         foreach (var item in second)
         {
             tempList.Remove(item); //this is cause a copy like RemoveAt below
         }

         return tempList;
     }

    //RemoveAt for arrays, this is problematic copy and shrink time consuming - see my next post on how to fix this!
    public static void RemoveAt<T>(ref T[] arr, int index)
    {
        for (int a = index; a < arr.Length - 1; a++)
        {
            // moving elements downwards, to fill the gap at [index]
            arr[a] = arr[a + 1];
            loopcnt++; //not required but here for time anaylsis
        }
        // finally, let's decrement Array's size by one
        Array.Resize(ref arr, arr.Length - 1);
    }
    
    //CHALLENGE - Get difference between two unorderd, not unique lists, duplicates allowed within a list and across both lists.
    //Will demonstrate all solutions from https://stackoverflow.com/questions/15801655/difference-between-two-lists-preserving-duplicates 
    //are WRONG!
    //SOLUTION - https://metadataconsulting.blogspot.com/2021/02/CSharp-dotNet-Get-difference-between-two-unordered-not-unique-lists-aka-duplicates-allowed-within-a-list-and-across-both-lists.html
    //OPTIMIZE - Next blog post will drastically reduce increase the speed of this
    public static void Main()
    {
        
            Console.WriteLine("Get difference between two unordered, not unique (duplicates) lists!");
            Console.WriteLine("Solution by https://metadataconsulting.blogspot.com/2021/02/CSharp-dotNet-Get-difference-between-two-unordered-not-unique-lists-aka-duplicates-allowed-within-a-list-and-across-both-lists.html");
            Console.WriteLine();
        
            Stopwatch sw = new Stopwatch();
            sw.Start(); 

            //UNORDERED, NOT UNIQUE lists, lists with duplicates 
            string[] arrA = new string[] { "1", "1", "2", "3", "4", "5", "9", "7", "7", "7" };
            string[] arrB = new string[] { "1", "1", "2", "2", "3", "4", "5", "5", "5", "5" };

            //DESIRED RESULT = 2,5,5,5,7,7,7,9 OR 2,5,5,5,9,7,7,7 OR 9,7,7,7,2,5,5,5 UNORDERED IS SUFFICIENT, WE LOOKING FOR SPEED
        
            sw.Stop();
            Console.WriteLine("A  = " + string.Join(",", arrA));
            Console.WriteLine("B  = " + string.Join(",", arrB));
            Console.WriteLine("A Count  = " + arrA.Length);    
            Console.WriteLine("B Count  = " + arrB.Length);
            Console.WriteLine();
            sw.Restart();

            //MY IMPLEMENTATION PURE ARRAY SOLUTION -     
        
            string[] diff = new string[arrA.Length+arrB.Length]; //contains B-A 
                        
            int d = 0; //index for diff list
            bool nf = true; //not found 
            loopcnt = 0; 
        
            int a = 0;  
            for (int b = 0; b < arrB.Length; b++)
            {
                nf = true;

                for (a = 0; a < arrA.Length; a++)
                {
                    loopcnt++; //not required but here for time anaylsis                
                    if (arrA[a] == arrB[b])
                    {
                        nf = false;

                        RemoveAt(ref arrA, a); //magic - removes instance found, A is left with differences not found in B, mathematically A-B = diffsA 

                        break;

                    }
                    
                }

                if (nf) //not found in set B 
                {
                    diff[d] = arrB[b]; //diffs B not found in A, mathematically B-A = diffsB (order important here, driven by B, so those results appear)
                    d++;
                }
                
            }
            
            sw.Stop(); 
            Console.WriteLine("No. of loops = " + loopcnt);
            Console.WriteLine("(A-B) Result = " + string.Join(",", arrA).TrimEnd(','));
            Console.WriteLine("(B-A) Result = " + string.Join(",", diff).TrimEnd(','));
            Console.WriteLine("Result Unorderd Final = " + (string.Join(",", diff)).TrimEnd(',') + "," + string.Join(",", arrA) + " in " + sw.ElapsedTicks + " ticks.");

            List<string> C = diff.ToList();

            List<string> D = diff.ToList().Concat(arrA.ToList()).ToList();
            D.Sort();

            Console.WriteLine("Result  Ordered Final = " + string.Join(",", D).TrimStart(','));
            
            sw.Reset();
            sw.Start();
            
            Console.WriteLine(); 
            Console.WriteLine("000 StackOverflow answers all wrong 000");
            Console.WriteLine("https://stackoverflow.com/questions/15801655/difference-between-two-lists-preserving-duplicates"); 
            Console.WriteLine("000 I don't post on StackOverflow because of the idiotic moderators and comments 000");
            Console.WriteLine(); 

            //https://stackoverflow.com/questions/15801655/difference-between-two-lists-preserving-duplicates
            //Dustin Kingen
            List<string> M = new List<string>() { "1", "1", "2", "3", "4", "5", "9", "7", "7", "7" };
            List<string> N = new List<string>() { "1", "1", "2", "2", "3", "4", "5", "5", "5", "5" };
         
            var ListDiffs = M.RemoveRange(N);
            sw.Stop();  
            Console.WriteLine("RemoveRage Result = " + string.Join(",", ListDiffs) + " in " + sw.ElapsedTicks + " ticks.");

            sw.Reset();
            sw.Start();
            
            //https://stackoverflow.com/questions/15801655/difference-between-two-lists-preserving-duplicates
            //Ani
            List<string> list2 = new List<string>() { "1", "1", "2", "3", "4", "5", "9", "7", "7", "7" };
            List<string> list1 = new List<string>() { "1", "1", "2", "2", "3", "4", "5", "5", "5", "5" };
            
            var lookup2 = list2.ToLookup(str => str);

            var result = from str in list1
                         group str by str into strGroup
                         let missingCount
                              = Math.Max(0, strGroup.Count() - lookup2[strGroup.Key].Count())
                         from missingStr in strGroup.Take(missingCount)
                         select missingStr;
            sw.Stop();
            Console.WriteLine("Lookup Result = " + string.Join(",", result) + " in " + sw.ElapsedTicks + " ticks.");
            sw.Reset();
            sw.Start();
            
            List<string> AA = new List<string>() { "1", "1", "2", "3", "4", "5", "9", "7", "7", "7" };
            List<string> BB = new List<string>() { "1", "1", "2", "2", "3", "4", "5", "5", "5", "5" };
         
            //https://stackoverflow.com/questions/15801655/difference-between-two-lists-preserving-duplicates
            //Sergey Kalinichenko 
            var diffLinq = AA.Select(e => new { Key = e, Val = 1 })
                    .Concat(BB.Select(e => new { Key = e, Val = -1 }))
                    .GroupBy(e => e.Key, e => e.Val)
                    .SelectMany(g => Enumerable.Repeat(g.Key, Math.Max(0, g.Sum())))
                    .ToList();

            sw.Stop();
            Console.WriteLine("diffLinq Result = " + string.Join(",", diffLinq) + " in " + sw.ElapsedTicks + " ticks.");




    }
}

No comments:

Post a Comment