Skip to content

Commit 0d40566

Browse files
committed
Added Duplicate File Finder
1 parent 9689b1f commit 0d40566

File tree

3 files changed

+201
-0
lines changed

3 files changed

+201
-0
lines changed

Duplicate-File-Finder/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Duplicate File Finder
2+
3+
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
4+
5+
Many a time, we find duplicate files residing in our Directories, especially documents and downloads, there are various
6+
reasons:
7+
8+
- downloading the same file from various sources.
9+
- auto backup on the cloud,
10+
- it slipped out of our mind that we downloaded it already in the first place, etc.
11+
12+
Manually selecting them is actually a hassle, but why do such a boring task when
13+
automation can do the trick. This sweet and simple script helps you to compare various files in a
14+
directory, find the duplicate, list them out, and then even allows you to delete them.
15+
16+
**Sweet!!!**
17+
18+
## Setup
19+
20+
- Setup a `python 3.x` virtual environment.
21+
- `Activate` the environment
22+
- Install the dependencies using ```pip3 install -r requiremnts.txt```
23+
- You are all set and the [script](file_finder.py) is Ready to run.
24+
- Clearly Follow the Instructions provided in the comments.
25+
26+
### Usage
27+
28+
In Command Line Interface, Run the script using -
29+
30+
`python image_finder.py <path of folder1, path of folder2, .....>`
31+
32+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1. folder1 - *Parent Folder*
33+
34+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2. folder2, folder3 .... - *Subsequent Folders*
35+
36+
>- This acts as a reference for duplicate files, i.e. this contains the original copy, hence no file is deleted from this folder.
37+
>- Comparisons are done with in the folder, and from Parent to Subsequent Folders.
38+
39+
## Dependencies
40+
41+
1. python3
42+
2. keyboard
43+
44+
## Detailed explanation
45+
46+
The Script works on a simple fundamental. Two files with same [`md5checksum`](https://en.wikipedia.org/wiki/MD5) will
47+
have similar contents. So in the script all we aim to do is determine the checksum, compare and find the duplicates.
48+
49+
## Output
50+
51+
- Running Script on a single folder `Stand_Alone`. In this example I pressed [n] in order to not delete anything.
52+
53+
![Pasting the Magnet Link](https://i.imgur.com/pcABYx4.png)
54+
55+
- Stand_Alone folder Before Deleting the files.
56+
57+
![Pasting the Magnet Link](https://i.imgur.com/PqwxrPQ.png)
58+
59+
- After Deleting the Files, i.e. Pressing [y] at the prompt.
60+
61+
![Pasting the Magnet Link](https://i.imgur.com/34fR6w3.png)
62+
63+
- `Parent`, `Duplicate`, `Duplicate_1` folder before running the script.
64+
65+
![Pasting the Magnet Link](https://i.imgur.com/XcJ0we3.png)
66+
67+
- Running the scripts on the Folder and deleting the duplicate files.
68+
69+
![Pasting the Magnet Link](https://i.imgur.com/ZaEcroF.png)
70+
71+
![Pasting the Magnet Link](https://i.imgur.com/tfo2day.png)
72+
73+
- Final Result, Notice that all the files in `Parent` Folder remain as it is.
74+
75+
>Also notice that similar files but wth different extensions are not deleted, cause technically they aren't same.
76+
77+
![Pasting the Magnet Link](https://i.imgur.com/d8VXy5m.png)
78+
79+
## Author(s)
80+
81+
Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)

Duplicate-File-Finder/file_finder.py

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Imports
2+
import hashlib
3+
import os
4+
import sys
5+
import keyboard
6+
7+
8+
def image_finder(parent_folder):
9+
# A dictionary to store Hash of Images corresponding to names
10+
"""
11+
Sample -
12+
{hash:[names]}
13+
"""
14+
duplicate_img = {}
15+
for dirName, subdirs, fileList in os.walk(parent_folder):
16+
# Iterating over various Sub-Folders
17+
print('Scanning %s...' % dirName)
18+
for filename in fileList:
19+
# Get the path to the file
20+
path = os.path.join(dirName, filename)
21+
# Calculate hash
22+
file_hash = hash_file(path)
23+
# Add or append the file path in the dictionary
24+
if file_hash in duplicate_img:
25+
duplicate_img[file_hash].append(path)
26+
else:
27+
duplicate_img[file_hash] = [path]
28+
return duplicate_img
29+
30+
31+
def delete_duplicate(duplicate_img):
32+
# Deleting those values whose keys are not unique
33+
for key in duplicate_img:
34+
file_list = duplicate_img[key]
35+
while len(file_list) > 1:
36+
item = file_list.pop()
37+
os.remove(item)
38+
39+
40+
# Joins two dictionaries
41+
def join_dicts(dict1, dict2):
42+
for key in dict2.keys():
43+
if key in dict1:
44+
dict1[key] = dict1[key] + dict2[key]
45+
else:
46+
dict1[key] = dict2[key]
47+
48+
49+
# For finding Hash of various Files
50+
# If 2 files have the same md5checksum,they most likely have the same content
51+
def hash_file(path, blocksize=65536):
52+
img_file = open(path, 'rb')
53+
hasher = hashlib.md5()
54+
buf = img_file.read(blocksize)
55+
while len(buf) > 0:
56+
hasher.update(buf)
57+
buf = img_file.read(blocksize)
58+
img_file.close()
59+
# Return Hex MD5
60+
return hasher.hexdigest()
61+
62+
63+
def print_results(dict1):
64+
results = list(filter(lambda x: len(x) > 1, dict1.values()))
65+
if len(results) > 0:
66+
print('Found Duplicated Images - ')
67+
print('Details -')
68+
print('<--------------------->')
69+
for result in results:
70+
# Print Path of Files
71+
for subresult in result:
72+
print('\t%s' % subresult)
73+
print('<--------------------->')
74+
75+
else:
76+
print('Unable to identify Similar Images')
77+
78+
79+
if __name__ == '__main__':
80+
if len(sys.argv) > 1:
81+
duplicate = {}
82+
folders = sys.argv[1:]
83+
for i in folders:
84+
# Iterate the folders given
85+
if os.path.exists(i):
86+
# Find the duplicated files and append them to the dictionary
87+
join_dicts(duplicate, image_finder(i))
88+
else:
89+
print('%s is not a valid path, please verify' % i)
90+
sys.exit()
91+
print_results(duplicate)
92+
# Delete Duplicate Images
93+
# Comment if not required
94+
print("Do you want to delete the Duplicate Images (If Any)? Press [y] for Yes.")
95+
while True:
96+
if keyboard.read_key() == "y":
97+
print("Deleting Duplicate Files\n")
98+
delete_duplicate(duplicate)
99+
print("Thank You\n")
100+
break
101+
else:
102+
print("Nothing Deleted!!! Thank You\n")
103+
break
104+
else:
105+
print("Use Command Line Interface")
106+
print("Hint: python file_finder.py <path of folders>")
107+
print("Please Read comments for greater detailing")
108+
'''
109+
Suggestions :------
110+
Usage - python file_finder.py <path of folder1, path of folder2, .....>
111+
folder1 - Parent Folder
112+
folder2, folder3 .... - Subsequent Folders
113+
Comparisons are done with in the folder, and from Parent to Subsequent Folders.
114+
115+
No Files are deleted form Parent Folder but the files which are Duplicate to the files in Subsequent Folders are
116+
deleted. Make sure that the paths are correct
117+
118+
Be careful during Keyboard Input.
119+
'''
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
keyboard==0.13.5

0 commit comments

Comments
 (0)