Skip to content

brendenpelkie/LLM_organic_synthesis

 
 

Repository files navigation

Extracting Structured Data from Free-form Organic Synthesis Text

Organic synthesis procedures are traditionally represented by free-form texts. This project explores how large language models can convert such unstructured texts to structured data, so they can be used for downstream data science or machine learning applications.

For more details, see

OPENAI models

Data processing and inference scripts for OPENAI models can be found in the folder models_openai. These models are fine-tuned with 300 data points and evaluated using another set of 50 data points.

Demo page: OPENAI inference

The app in demo_apps/dash_app shows inference results from fine-tuned OPENAI models. OPENAI API key is required in the deployment script.

Demo page: inferences on the test set

The app in demo_apps/github_page shows precomputed inference results from an OPENAI davinci model. It is a static page from Dash using Epix Zhang's code, and is synced to the github_page branch.

Synthesis procedure data

Throughout this project, organic synthesis procedures, free text or structured, are extracted from the Open Reaction Database. Related scripts can be found in the folder ord_data.

About

The current team (06/2023) includes:

  • Qianxiang Ai
  • Stefan Bringuier
  • Hassan Harb
  • Brenden Pelkie
  • Jacob N Sanders
  • Marcus Schwarting
  • Jiale Shi

This project was conceived during the LLM Hackathon on 2023/03/29. We thank Ben Blaiszik for his generous financial support to this project.

Releases

No releases published

Packages

No packages published

Languages

  • HTML 96.1%
  • Python 2.1%
  • Jupyter Notebook 1.8%