Information Retrieval (Spring 2018)
- Pre-process the documents by removing all HTML tags and convert everything into lower case.
- Implement a stop list and a stemmer to pre-process the documents
- Build an inverted index (including dictionary and posting lists) for the documents(Please make sure to keep all the frequency information)
- 1.calculate the length of the corresponding doc vector for each doc
- 2.pre-process the query and calc the length of query vector
- 3.compute the tf-idf similarity scores
- a multi-threaded spider that fetches and parses webpages
- the URL frontier which stores to-be-crawled URLs
- the URL repository that stores crawled URLs
Please feed the collected documents to the search engine that you implemented in step 2. Please implement a Web-based interface to take user queries and return answers (document names, snapshot with search term(s) highlighted, and URL) to the user. You only need to provide a reasonable (not so fancy) interface, you can use WYSIWYG editors to generate HTML. Keep this version of your search engine, since it will be compared with two future versions.
- Used Flask framework to render the UI as well as hosting the server.
- Used bootstrap to render and style the UI