- Visualizing and exploring data points is a great (fun) way to get insights about data.
- What if we can make individual movie into single representation?
- If the representation quality is good enough similar movies would be clusttered together, so that recommending movie would become a trivial thing.
- Scrape movie list from movie review web site, Watcha
- Get 5200 + rated movies by critic Lee Dongjyn.
- Features processed: ["title", "plot", "date", "genere", "director", "country", "user_rating", "critic_rating"]
- Make representation vector for each movie
- Using open source Korean NLP model (KoBERT), make plot into single sentence vector
- Features describing movies are used together [director, date, genere, country, user_rating]
- Train movie rating prediction model (forward path: bottom to top)
- [Loss] (output, critic rating)
- [MLP]
- [plot vector] + [genere embs concat] + [director emb] + [country emb] + [date emb] + [user rating]
- [koBERT (partially freezed)]
- [tokenized plot text]
- After training, get representation vector from hidden dim of MLP model
- Reduce dimensionality using t-SNE (hidden_dim -> 2 dim)
- Make interactive plot on Web
- Used Nomic AI's deepscatter library (https://github.com/nomic-ai/deepscatter) for efficiency and speed.
- You can see the result on this page (Still WIP)
- https://scatterfilm.web.app