Python 3 interface used to extract data from PubMed publications using LLMs, part of the PubLLican project.
See the Experimental branch for latest updates, although this may not run without configureation changes.
Setup
Configuration
Running the workflow
-
Create create and activate a virtual environment if your IDE does not do so automatically
-
Install package dependencies by running
pip install requirements.txt
-
Create
.env
file by runningcp .env.example .env
Be careful as this will overwrite your current.env
file in case you already have one setup -
Add any API keys or other environment variables to
.env
file -
Create a config file by running
cp config.json.example config.json
Be carefulas this will overwrite your currentconfig.json
file in case you already have one setup -
Run setup script by running
python setup.py
Most things are able to be configured in config.json
if desired. The fields are pretty self-explanatory.
In the config file, there is a field called "llm"
, which looks something like this:
{
"llm": {
"current": {
"type": "anthropic",
"model": "claude-3-haiku-20240307"
}
},
"rest of config.json file..."
}
-
The
type
parameter tells thellms
package what model type it is and what code to run for it to work with that model. Here are the currently supported types:Type Description Requirements anthropic
Anthropic's language-based models e.g. Claude $ANTHROPIC_API_KEY
environment variable must be setopenai
OpenAI's language-based models e.g. ChatGPT $OPENAI_API_KEY
environment variable must be set -
The
model
parameter tells the API what specific model to use (if applicable). See documentation for more details.
PRs adding support for more LLMs are welcome
-
Download the paper. There are two options:
- To get the paper JSON (preferred), run
python getPaperJSON.py <pmid>
- To get the paper PDF, run:
python getPaperPDF.py <pmid>
Note that not every publication will have a downloadable PDF, in which case getPaperJSON can be used instead
- To get the paper JSON (preferred), run
-
Convert the paper into plaintext
-
If
getPaperJSON
was used, runpython getTextFromJSON.py <pmid>
-
If
getPaperPDF
was used, runpython getTextFromPDF.py <pmid>
-
-
Query the LLM for the paper's species by running
python getPaperSpecies.py <pmid>
-
Query the LLM for the paper's genes by running
python getPaperGeness.py <pmid>
-
Query the LLM for the paper's GO terms by running
python getPaperGOTerms.py <pmid>
-
Validate the GO terms by running
python validateGOTermDescriptions.py <pmid>