A simple Java implementation of a Boolean Retrieval System. This is the Project #1 of the Information Retrieval university course.
Write an IR system able to answer:
- Boolean queries with
AND
,OR
andNOT
; - Wildcard and phrase queries;
- Provide a way to save and load the entire index from disk, to avoid re-indexing when the program starts;
- Some normalization or stemming can be performed;
- Spelling correction can be implemented;
- Evaluate the system on a set of test queries.
The CMU Movie Summary Corpus consists of 42,306 movie plot summaries extracted from Wikipedia and aligned metadata extracted from Freebase.
In order to answer queries, the system needs to index the entire corpus first. This operation can take a long time so a
pre-built index, located inside the data
folder, is used by default for faster start-up. To start the system, just
build and run the application. If you want the complete indexing operation to be performed instead, you need to:
- Download the movie corpus here;
- Extract the
MovieSummaries
folder and place it in the project root folder; - Delete the pre-built index or move it in another place;
- Build and run the application.
As required, the system supports boolean queries, single wildcard queries and two-terms phrase queries.
Both single- and multiple-term boolean queries are supported. The system expects the boolean operators to be specified
in uppercase. By default, if no operators are specified, the query is converted to an AND
query. In a NOT
query,
when multiple terms are specified, the system computes the negation of the OR-ed list of terms.
indiana AND jones
rocky balboa adriana
yoda AND luke AND darth
yoda luke darth
yoda OR skywalker
yoda OR luke OR darth OR skywalker
NOT the
NOT and is or be are the on at in
cat*
catastroph*
c*lly
*atastrophically
"forrest gump"
"darth maul"
"order 66"
"rocky balboa"
"indiana jones"