README

*********************************************************
* Duncan Johnson
* Eddy Zhang
*********************************************************

Purpose

Gerp takes in a directory and spends time indexing each word 
from it, which then allows the user to search for words, and
every line with that words is quickly output to a given file. 
Users can search for words case-sensitively or without regard
for upper and lowercase, and users can change the file to which
the output is sent. 

---------------------------------------------------------

Program Files

main.cpp
    Parses command line arguments from the user, passes 
    the directory to search from and output file to Gerp.
    Runs Gerp.
    
Gerp.h
    Public constructor that takes the command line arguments from main,
    private member variables for the case sensitive and lowercase Hash 
    Tables, vector of all lines of text that could be printed to output.

Gerp.cpp
    Sets up the two different Hash Tables by traversing the directory, 
    indexing each file, putting the lines into a vector and each word 
    into the two Hash Tables.
    Also runs game loop, asking the user for commands and executes
    the corresponding functionality.

HashTable.h
    Declaration of HashTable class. Hashtable is implemented by an array of 
    Buckets, and each bucket contains a KeyValue type list. KeyValue has two 
    elements -- a string word and a list of int indicies that represent the 
    lines the word appears. We used chaining to solve collisions.

HashTable.cpp
    Implements the Hash ADT, which gives us constant time access to each 
    Bucket and the ability to control when the Hash expands in order to 
    keep the average constant time access and insertion.
    Inserts words as keys and integers as values, clients can get a list
    of all integers inserted for a given word.


unit_tests.h
    Unit tests for all Hash Table and Gerp functionality, file output was
    put in temporary files and then read back in using assertions to make sure
    it's the correct output. Hash Table member variables were made private
    in order to check that data was put into the Hash Table correctly.

Makefile
    Commands to build Gerp, Gerp.o, HashTable.o, and unit_test our program.

---------------------------------------------------------

Compile

    make gerp

---------------------------------------------------------

Run

    ./gerp DirectoryToIndex OutputFile

    where DirectoryToIndex is the directory the user wants to search and 
    the results will be printed in OutputFile.
    
---------------------------------------------------------

Architectural Overview

Each instance of Gerp has three important data structures:
a vector of all lines of text from the indexed directory
a Hash Table of case-sensitive words
a Hash Table of all lowercase words

Each Hash Table is an array of Buckets.

A Bucket is a a struct we created which has a Linked List of 
KeyValues. This allows us to chain together collisions inside each Bucket.

KeyValue is a struct we created with two pieces of information:
a string word
a list of integers
This lets us iterate through the chain to find the exact list of
integers we want to return to the client, and when the client inserts
a new pair of string and int, we can append the integer to the KeyValue
with the matching key. 


---------------------------------------------------------

Data Structures

We choose array to implement our hash table because array has constant 
access time O(1).

It takes O(n) to expand the hash table and rehash all the elements.
We use chaining to prevent collisions. Since we always keep the load factor 
less than 70%, we are able to assure that the length of each list is short 
enough to maintain an approximate constant accessing and inserting time.

For each KeyValue, we also used linked list to store the indicies of words, 
which takes O(1) to insert and we only need to iterate through the whole 
list, so getting index-based access is not necessary. In order to make sure
we don't insert the same line twice (in the case it has the same word twice),
we first inserted all of the words on the line into a set to eliminate any
duplicates, this takes O(log(w) since set is a BST), where w is the number 
of words on the line, but w is never larger than 20 or 30 or so, so this does
not effect our constant time insert very much.

Since we use array for hash table, each bucket is stored on the heap, which 
takes the majority of the memory.

---------------------------------------------------------

Algorithms

Tree traversal uses recursion, for each folder in our directory,
we first parse all of the files using the ParseFile() function, which
will be explained below, and then recursively searches all subdirectories,
passing along the path so that the vector of lines correctly stores the
filepath of each file so that when the user queries a word, we can quickly
print out the line.

In ParseFile(), we first open the file, then we iterate through each line
of the file and call ParseLine() on it, passing along the line number.
ParseLine() turns the line into a stringstream so we can read each word
one at a time, we strip each word of its leading and trailing non-
alphanumeric characters and add all words to a non-duplicate set.
We then iterate through that set, adding each word to the case-sensitive
hash, and the lowercase version of each word to the lowercase hash.
Finally, we assemble the whole line that will be printed to the Output File,
including the file path, the line number, and the actual contents of the line.

Our second intersting algorithm is the HashTable insert function.
First we get the bucket that we need to search through using our GetBucket
function which uses the native hash function to compute our hash value.
Then we iterate through the list of KeyValue pairs in the Bucket to check
if the word already exists. If it does, we simply add the index that the word
appears on to the list of integers in the KeyValue with the matching word.
If the word does not exist in the bucket, we create a new KeyValue object
with the inserted word and the list starting with just the int index being
inserted currently, and add it to the bucket. 

During this process, if we're created a new KeyValue object, we make cause
our Load Factor to exceed 70%, which could have effects on our access and 
insertion time. In order to expand, we create a new hash array that is 
twice the size of the old one, and then we iterate through each Bucket
of the old hash array, rehashing each KeyValue into its new Bucket. We
then delete the old hash array, and our hash is now twice as large.

---------------------------------------------------------

Testing

There were a lot of phases to our testing.

First, our Week 1 testing was very simple putting in strings and asserting
that the string that came out was correctly parsed, and traversing through 
the directories included us creating test directories and diffing the
output.

Next, we built our Hash Table implementation and tested the insert and
getValues functions by manually iterating through every Bucket, checking
every KeyValue to see if data was insertted in the correct place. We tested
the expand function by making really small Hash Tables, inserting new 
elements, and making sure data was still accessible.

Our next phase was the Gerp setup, when Gerp traverses through the directory
and places all the data into the vector and Hash Tables. We reused a lot 
of the testing functionality from the Hash Table to make sure that the 
Hash Tables were being set up correctly when grabbing data from an actual
directory.

Our final unit testing phase was the Gerp search phase, which is the bulk
of the Gerp game loop. We made sure that searching for words normally 
and with the @insensitive flag worked both when the word existed --
the correct line prints out, and when the word doesn't exist -- the correct
"Not Found" line outputs into the file. 

We used diff testing to match our implementation against the reference
the_gerp, running the same commands on the same directory, then sorting
the lines and diffing them to make sure that the output is the same.

We also ran our program on Valgrind on the medium Gutenberg dataset, and 
found a latent Valgrind error with a function which we were able to then
diagnose and fix, so Valgrind was very helpful in making sure we don't
have any hidden errors.