Viberary is a project that will recommend you books based not on genre or title, but vibe by performing semantic search across a set of learned embeddings on a dataset of books from Goodreads and their metadata.
The idea is pretty simple: return book recommendations based on the vibe of the book that you put in. So you don't put in "I want science fiction", you'd put in "atmospheric, female lead, worldbuilding, funny" as a prompt, and get back a list of books
My approach is:
- Explore the data [Done]
- Post 0: Working with the data in BigQuery
- Post 1: Working with the data in Pandas
- Post 2: Doing research with ChatGPT
- Build a baseline model in Word2Vec [In progress]
- Deploy the baseline model to "prod" (aka a single server) and test it out [In progress]
- Build a model using base BERT (or DistilBERT, etc.) and also deploy that and evaluate them against each other.
- At the same time, write a document about what embeddings are and how they fit into modern machine learning workflows
Since the project is actively in exploration and development, there are a lot of winding codepaths, experiments, and dead ends in the codebase. It is not production-grade for ANY definition of production. I'll let you know when it's ready.
For now, there are a couple key directories:
notebooks
- Exploration and development of the input data, various concepts, algorithms, etc. The best resource there is this notebook, which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, and storing and querying those embeddings in Redis Search. This is the solution I'm working towards for the first baseline production model.flask_server
- A model learned in Word2Vec AND Fasttext from the code here (https://github.com/veekaybee/viberary/blob/main/notebooks/05_duckdb_0.7.1.ipynb) and deployed on a tiny Flask server on a GitHub droplet. This is not production-grade, but allows for model serving and evaluation.
Demo here:
word2vec_viberary.mov
-
word2vec
- Word2Vec implemented in PyTorch. I did this before I implemented Word2Vec in Gensim to learn about PyTorch idioms and paradigms. Annotated output is here. -
docs
- This serves and rebuilds viberary.pizza -
api
- Me starting to learn Go for what will eventually be the production-grade server (ported from Flask
- "Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning"
- "PinnerSage"
- "Making Machine Learning Easy with Embeddings"
- "Research Rabbit Collection"
UCSD Book Graph, with the critical part being the user-generated shelf labels.. Sample row: Note these are all encoded as strings!
{
"isbn": "0413675106",
"text_reviews_count": "2",
"series": [
"1070125"
],
"country_code": "US",
"language_code": "",
"popular_shelves": [
{
"count": "2979",
"name": "to-read"
},
{
"count": "291",
"name": "philosophy"
},
{
"count": "187",
"name": "non-fiction"
},
{
"count": "80",
"name": "religion"
},
{
"count": "76",
"name": "spirituality"
},
{
"count": "76",
"name": "nonfiction"
}
],
"asin": "",
"is_ebook": "false",
"average_rating": "3.81",
"kindle_asin": "",
"similar_books": [
"888460",
"734023",
"147311",
"219106",
"313972",
"238866",
"196325",
"200137",
"588008",
"112774",
"2355135",
"336248",
"520437",
"421044",
"870160",
"534289",
"64794",
"276697"
],
"description": "Taoist philosophy explained using examples from A A Milne's Winnie-the-Pooh.",
"format": "",
"link": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
"authors": [
{
"author_id": "27397",
"role": ""
}
],
"publisher": "",
"num_pages": "",
"publication_day": "",
"isbn13": "9780413675101",
"publication_month": "",
"edition_information": "",
"publication_year": "",
"url": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
"image_url": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png",
"book_id": "89371",
"ratings_count": "11",
"work_id": "41333541",
"title": "The Te Of Piglet",
"title_without_series": "The Te Of Piglet"
}