Index the corpus on the disk. Present results to the user based on the below option the user select.
- boolean search
- Ranked search
- Text classification
- Inexact retrieval
- Create a Token processor
- Create In-memory positional inverted index
- Boolean query processing
- And query
- Or query
- Not query
- Phrase query
- n-grams and Wildcard queries
- author query(special query Soundex search)
- Disk-memory positional inverted index (with weight!)
- Ranked Retrieval
- Spelling correction
- Text classification
- In-exact retrieval - vocab elimination
The search engine's fundamental is mapping the key to a value where the key is a tokenized word and value is a list of documents with other information.
Using Porter2 stemmer and a few more rules, we shall create keys for Information retrieval.
Each document in the corpus shall be provided with a docId in the form of an increasing manner. Each tokenized word in the corpus shall be paired with the docIds that consist of the tokenized words.
In this mode, results are retrieved only if a document in the corpus satisfies all the boolean query criteria answered by the user. A query can be in the form of:
Term literal -> a simple one-word
Phrase literal -> "term literal (term literal)+"
Wildcard literal -> term literal with leading, trailing or embedded * characters
Search Token -> term literal | phrase literal | wildcard literal
AND queries -> ([Search Token]+[Search Token])+
park+evening
"+" denotes OR operation, looks for either park or evening
OR queries -> [Search Token] [Search Token]
park evening
" " denotes OR operation, looks for either park or evening
NOT query -> -[Search Token]
park - evening.
"-" denotes NOT operation, looks for the park without evening
AUTHOR queries -> :author
E.g., :author Stewart
Author queries are provisioned with the Soundex algorithm, where the author can get results of both Stewart and Stuart for the above-searched query.
We stored the key, value pair in In-memory till now. Moving forward, we use the disk-based index to save RAM from being exploited. Each key will be stored in a B+ tree with access to data (d wd,t tftd p1 p2 p3)+ in the disk.
The user gives a list of words as a search query, and results will be displayed in the order of decreasing relevance to the query.
Users will be suggested with an alternate query (only in ranked retrieval mode) if the search results of any single term of the query are zero or less than a threshold limit.
Using two vector-based algorithms, Knn classification and Rocchio classification, we tried to finish the well-known disputed federalist-papers problem. We are glad that our results are similar to the results of other researchers.
By taking a few metrics into consideration, we shall improve search retrievals speed by compromising a few search results, which should not bother since the deviation in the precision of results is very minute.
CECS 529 (Search Engine Technology) - Fall 2020
- Haritha Nimmagadda
- Abhinay Kacham
- Varun Lingabathini
The development of this project is closed.
Demo video: YouTube