SUTD 50.040 Natural Language Processing Course Homework and Projects taught by Professor Lu Wei. For more information, refer to https://istd.sutd.edu.sg/undergraduate/courses/50040-natural-language-processing.
Word embeddings are dense vectors that represent words, and capable of capturing semantic and syntactic similarity, relations with other words, etc. This homework uses two methods to learn word embeddings: Count-based (Co-occurrence matrices) and Prediction-based (Word2Vec - CBOW and Skip-gram model).The dataset used is "text8" that consists of one single line of text.
Constituency parsing aims to extract a constituency-based parse tree from a sentence that represents its syntactic structure according to a phrase structure grammar. This homework implements a constituency parser based on probabilistic context-free grammars (PCFGs) and evaluate its performance. The dataset used is a version of the “Penn Treebank” released in the NLTK corpora.
3. HW3 [Written] - Language Model, Dependency Parsing, Context Free Grammar, Transition-based Parsing
Part 1: IBM Model 1 using hard and soft expectation-maximization (EM) algorithm.
Part 2: Seq2Seq Attention Model using a Bidirectional-LSTM Encoder and a Unidirectional-LSTM Decoder.
5. HW5 [Written] - Phrase-based Machine Translation, Synchronous CFG, Word Alignment Model, Attention
- Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999
- Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed. draft), 2018
- Yoav Goldberg, Neural Network Methods for Natural Language Processing, 2017
- D. Poole, Linear Algebra: A Modern Introduction. 3rd edition, 2010.
- J. L. Devore, Probability and Statistics for Engineering and the Science. 8th edition, 2011