-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
195 lines (148 loc) · 8.27 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
*********************************************************
* Duncan Johnson
* Eddy Zhang
*********************************************************
Purpose
Gerp takes in a directory and spends time indexing each word
from it, which then allows the user to search for words, and
every line with that words is quickly output to a given file.
Users can search for words case-sensitively or without regard
for upper and lowercase, and users can change the file to which
the output is sent.
---------------------------------------------------------
Program Files
main.cpp
Parses command line arguments from the user, passes
the directory to search from and output file to Gerp.
Runs Gerp.
Gerp.h
Public constructor that takes the command line arguments from main,
private member variables for the case sensitive and lowercase Hash
Tables, vector of all lines of text that could be printed to output.
Gerp.cpp
Sets up the two different Hash Tables by traversing the directory,
indexing each file, putting the lines into a vector and each word
into the two Hash Tables.
Also runs game loop, asking the user for commands and executes
the corresponding functionality.
HashTable.h
Declaration of HashTable class. Hashtable is implemented by an array of
Buckets, and each bucket contains a KeyValue type list. KeyValue has two
elements -- a string word and a list of int indicies that represent the
lines the word appears. We used chaining to solve collisions.
HashTable.cpp
Implements the Hash ADT, which gives us constant time access to each
Bucket and the ability to control when the Hash expands in order to
keep the average constant time access and insertion.
Inserts words as keys and integers as values, clients can get a list
of all integers inserted for a given word.
unit_tests.h
Unit tests for all Hash Table and Gerp functionality, file output was
put in temporary files and then read back in using assertions to make sure
it's the correct output. Hash Table member variables were made private
in order to check that data was put into the Hash Table correctly.
Makefile
Commands to build Gerp, Gerp.o, HashTable.o, and unit_test our program.
---------------------------------------------------------
Compile
make gerp
---------------------------------------------------------
Run
./gerp DirectoryToIndex OutputFile
where DirectoryToIndex is the directory the user wants to search and
the results will be printed in OutputFile.
---------------------------------------------------------
Architectural Overview
Each instance of Gerp has three important data structures:
a vector of all lines of text from the indexed directory
a Hash Table of case-sensitive words
a Hash Table of all lowercase words
Each Hash Table is an array of Buckets.
A Bucket is a a struct we created which has a Linked List of
KeyValues. This allows us to chain together collisions inside each Bucket.
KeyValue is a struct we created with two pieces of information:
a string word
a list of integers
This lets us iterate through the chain to find the exact list of
integers we want to return to the client, and when the client inserts
a new pair of string and int, we can append the integer to the KeyValue
with the matching key.
---------------------------------------------------------
Data Structures
We choose array to implement our hash table because array has constant
access time O(1).
It takes O(n) to expand the hash table and rehash all the elements.
We use chaining to prevent collisions. Since we always keep the load factor
less than 70%, we are able to assure that the length of each list is short
enough to maintain an approximate constant accessing and inserting time.
For each KeyValue, we also used linked list to store the indicies of words,
which takes O(1) to insert and we only need to iterate through the whole
list, so getting index-based access is not necessary. In order to make sure
we don't insert the same line twice (in the case it has the same word twice),
we first inserted all of the words on the line into a set to eliminate any
duplicates, this takes O(log(w) since set is a BST), where w is the number
of words on the line, but w is never larger than 20 or 30 or so, so this does
not effect our constant time insert very much.
Since we use array for hash table, each bucket is stored on the heap, which
takes the majority of the memory.
---------------------------------------------------------
Algorithms
Tree traversal uses recursion, for each folder in our directory,
we first parse all of the files using the ParseFile() function, which
will be explained below, and then recursively searches all subdirectories,
passing along the path so that the vector of lines correctly stores the
filepath of each file so that when the user queries a word, we can quickly
print out the line.
In ParseFile(), we first open the file, then we iterate through each line
of the file and call ParseLine() on it, passing along the line number.
ParseLine() turns the line into a stringstream so we can read each word
one at a time, we strip each word of its leading and trailing non-
alphanumeric characters and add all words to a non-duplicate set.
We then iterate through that set, adding each word to the case-sensitive
hash, and the lowercase version of each word to the lowercase hash.
Finally, we assemble the whole line that will be printed to the Output File,
including the file path, the line number, and the actual contents of the line.
Our second intersting algorithm is the HashTable insert function.
First we get the bucket that we need to search through using our GetBucket
function which uses the native hash function to compute our hash value.
Then we iterate through the list of KeyValue pairs in the Bucket to check
if the word already exists. If it does, we simply add the index that the word
appears on to the list of integers in the KeyValue with the matching word.
If the word does not exist in the bucket, we create a new KeyValue object
with the inserted word and the list starting with just the int index being
inserted currently, and add it to the bucket.
During this process, if we're created a new KeyValue object, we make cause
our Load Factor to exceed 70%, which could have effects on our access and
insertion time. In order to expand, we create a new hash array that is
twice the size of the old one, and then we iterate through each Bucket
of the old hash array, rehashing each KeyValue into its new Bucket. We
then delete the old hash array, and our hash is now twice as large.
---------------------------------------------------------
Testing
There were a lot of phases to our testing.
First, our Week 1 testing was very simple putting in strings and asserting
that the string that came out was correctly parsed, and traversing through
the directories included us creating test directories and diffing the
output.
Next, we built our Hash Table implementation and tested the insert and
getValues functions by manually iterating through every Bucket, checking
every KeyValue to see if data was insertted in the correct place. We tested
the expand function by making really small Hash Tables, inserting new
elements, and making sure data was still accessible.
Our next phase was the Gerp setup, when Gerp traverses through the directory
and places all the data into the vector and Hash Tables. We reused a lot
of the testing functionality from the Hash Table to make sure that the
Hash Tables were being set up correctly when grabbing data from an actual
directory.
Our final unit testing phase was the Gerp search phase, which is the bulk
of the Gerp game loop. We made sure that searching for words normally
and with the @insensitive flag worked both when the word existed --
the correct line prints out, and when the word doesn't exist -- the correct
"Not Found" line outputs into the file.
We used diff testing to match our implementation against the reference
the_gerp, running the same commands on the same directory, then sorting
the lines and diffing them to make sure that the output is the same.
We also ran our program on Valgrind on the medium Gutenberg dataset, and
found a latent Valgrind error with a function which we were able to then
diagnose and fix, so Valgrind was very helpful in making sure we don't
have any hidden errors.