Skip to content

mouli3c3/viberary

 
 

Repository files navigation

Viberary

Viberary is a project that will recommend you books based not on genre or title, but vibe by performing semantic search across a set of learned embeddings on a dataset of books from Goodreads and their metadata.

The idea is pretty simple: return book recommendations based on the vibe of the book that you put in. So you don't put in "I want science fiction", you'd put in "atmospheric, female lead, worldbuilding, funny" as a prompt, and get back a list of books

Reference implementation:

Actual Architecture:

My approach is:

  1. Explore the data [Done]
  1. Build a baseline model in Word2Vec [In progress]
  2. Deploy the baseline model to "prod" (aka a single server) and test it out [In progress]
  3. Build a model using base BERT (or DistilBERT, etc.) and also deploy that and evaluate them against each other.
  4. At the same time, write a document about what embeddings are and how they fit into modern machine learning workflows

Repo Structure

Since the project is actively in exploration and development, there are a lot of winding codepaths, experiments, and dead ends in the codebase. It is not production-grade for ANY definition of production. I'll let you know when it's ready.

For now, there are a couple key directories:

  • notebooks - Exploration and development of the input data, various concepts, algorithms, etc. The best resource there is this notebook, which covers the end-to-end workflow of starting with raw data, processing in DuckDB, learning a Word2Vec embeddings model, and storing and querying those embeddings in Redis Search. This is the solution I'm working towards for the first baseline production model.
  • flask_server - A model learned in Word2Vec AND Fasttext from the code here (https://github.com/veekaybee/viberary/blob/main/notebooks/05_duckdb_0.7.1.ipynb) and deployed on a tiny Flask server on a GitHub droplet. This is not production-grade, but allows for model serving and evaluation.

Demo here:

word2vec_viberary.mov
  • word2vec - Word2Vec implemented in PyTorch. I did this before I implemented Word2Vec in Gensim to learn about PyTorch idioms and paradigms. Annotated output is here.

  • docs - This serves and rebuilds viberary.pizza

  • api - Me starting to learn Go for what will eventually be the production-grade server (ported from Flask

Relevant Literature and Bibliography

Input Data Sample

UCSD Book Graph, with the critical part being the user-generated shelf labels.. Sample row: Note these are all encoded as strings!

{
  "isbn": "0413675106",
  "text_reviews_count": "2",
  "series": [
    "1070125"
  ],
  "country_code": "US",
  "language_code": "",
  "popular_shelves": [
    {
      "count": "2979",
      "name": "to-read"
    },
    {
      "count": "291",
      "name": "philosophy"
    },
    {
      "count": "187",
      "name": "non-fiction"
    },
    {
      "count": "80",
      "name": "religion"
    },
    {
      "count": "76",
      "name": "spirituality"
    },
    {
      "count": "76",
      "name": "nonfiction"
    }
  ],
  "asin": "",
  "is_ebook": "false",
  "average_rating": "3.81",
  "kindle_asin": "",
  "similar_books": [
    "888460",
    "734023",
    "147311",
    "219106",
    "313972",
    "238866",
    "196325",
    "200137",
    "588008",
    "112774",
    "2355135",
    "336248",
    "520437",
    "421044",
    "870160",
    "534289",
    "64794",
    "276697"
  ],
  "description": "Taoist philosophy explained using examples from A A Milne's Winnie-the-Pooh.",
  "format": "",
  "link": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
  "authors": [
    {
      "author_id": "27397",
      "role": ""
    }
  ],
  "publisher": "",
  "num_pages": "",
  "publication_day": "",
  "isbn13": "9780413675101",
  "publication_month": "",
  "edition_information": "",
  "publication_year": "",
  "url": "https://www.goodreads.com/book/show/89371.The_Te_Of_Piglet",
  "image_url": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png",
  "book_id": "89371",
  "ratings_count": "11",
  "work_id": "41333541",
  "title": "The Te Of Piglet",
  "title_without_series": "The Te Of Piglet"
}

Embeddings Sample

Screen Shot 2023-02-18 at 2 10 15 PM

About

Good books, good vibes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.3%
  • CSS 1.6%
  • HTML 1.1%