A simple MinHash (original paper link) implementation to identify similar documents based on keywords. A good explanation can be found in the Mining of Massive Datasets course by Stanford. Chapter 3 till Section 3.3 covers MinHashing and all the concepts required to understand the code.
For large number of documents (10000) in this case, MinHashing is correctly able to identify all 80 pairs of plagiarized documents correctly.
- Parse ground truth data to create plagiarized document mappings
- Converting documents to 3-word shingles and create mapping
- Defining similarity matrices. Use triangular matrices to reduce memory complexity
- Creating MinHash signatures for each document
- Comparing all signatures