Python 3 interface used to extract data from PubMed publications using LLMs, part of the PubLLican project.
Setup
Configuration
Running the workflow
-
Create create and activate a virtual environment if your IDE does not do so automatically
-
Install package dependencies by running
pip install requirements.txt
-
Create
.env
file by runningcp .env.example .env
Be careful as this will overwrite your current.env
file in case you already have one setup -
Add any API keys or other environment variables to
.env
file -
Create a config file by running
cp config.json.example config.json
Be carefulas this will overwrite your currentconfig.json
file in case you already have one setup -
Run setup script by running
python setup.py
Most things are able to be configured in config.json
if desired. The fields are pretty self-explanatory.
In the config file, there is a field called "llm"
, which looks something like this:
{
"llm": {
"current": {
"type": "anthropic",
"model": "claude-3-haiku-20240307"
}
},
"rest of config.json file..."
}
-
The
type
parameter tells thellms
package what model type it is and what code to run for it to work with that model. Here are the currently supported types:Type Description Requirements anthropic
Anthropic's language-based models e.g. https://www.anthropic.com/claude $ANTHROPIC_API_KEY
environment variable must be setopenai
OpenAI's language-based models e.g. ChatGPT $OPENAI_API_KEY
environment variable must be set -
The
model
parameter tells the API what specific model to use (if applicable). See documentation for more details.
PRs adding support for more LLMs are welcome
-
Download the paper. There are two options:
- To get the paper JSON (preferred), run
python getPaperJSON.py <pmid>
- To get the paper PDF, run:
python getPaperPDF.py <pmid>
Note that not every publication will have a downloadable PDF, in which case getPaperJSON can be used instead
- To get the paper JSON (preferred), run
-
Convert the paper into plaintext
-
If
getPaperJSON
was used, runpython getTextFromJSON.py <pmid>
-
If
getPaperPDF
was used, runpython getTextFromPDF.py <pmid>
-
-
Query the LLM for the paper's species by running
python getPaperSpecies.py <pmid>
-
Query the LLM for the paper's genes by running
python getPaperGeness.py <pmid>
-
Query the LLM for the paper's GO terms by running
python getPaperGOTerms.py <pmid>
-
Validate the GO terms by running
python validateGOTermDescriptions.py <pmid>