Implemented simhash technique to estimate duplicated pages in a given dataset. University project for Information Retrieval (Spring 2015)
Final report can be found here in Greek.
- Matlab 2012b+
- Matlab: 'Statistics and Machine Learning Toolbox
- Java 1.6 (Matlab 2012b needs that version)
The main program is proj.m
- In
DataHasher.java
on lines 45 and 48 insert path for Desktop. - Compile with
javac -source 1.6 -target 1.6 DataHasher.java
. - In Matlab workspace run
which classpath.txt
and we add the path to the directory ofDataHasher.class
. - Run
proj.m
and choose whether the input is from a .csv file or from an online source.