I am currently working on a project were we are collecting huge amount of data. Recently we discovered that millions of rows were not unique. Because we are using Machine Learning to find patterns in the data and to cluster it, these dublicate rows were polluting the results of the learning algoritme. Therefor we had to come up with a solution to remove all dublicates and prevent new dublicates in the future. Continue reading