Pubmedsoso

Table of Contents

Main Features
Dependency Modules
Usage
ExcelHelper Module
TO DO List
Debugging Guide
Update Log

A small tool for automatically batch extracting PubMed literature information and downloading free literature.

Main Features

A tool for crawling and downloading PubMed literature information, written based on aiohttp, pandas, and xpath. It retrieves relevant literature information according to the parameters you provide and downloads the corresponding PDF originals. The download speed is approximately 1 second per article. It can extract most of the literature information and automatically generate an Excel file, including titles, abstracts, keywords, author lists, author affiliations, whether it is free, and whether it is a review type, etc.

After automatic downloading, some information will be stored in local text files for reference. Search data will be stored in an sqlite3 database. Finally, after execution, all information will be automatically exported and an Excel file will be generated.

Dependency Modules

Based on Python 3.9, also supports higher versions. Mainly uses pandas, xpath, asyncio, aiohttp.

In the Release section on the right side of the project page, there is an executable file packed with Nuitka, which includes all dependencies and can be directly downloaded and executed in the Windows command line.

If you need to run it in a Python environment, please install the corresponding dependencies according to the requirements.txt file.

  asyncio~=3.4.3
  aiohttp~=3.11.8
  requests~=2.32.3
  lxml~=5.3.0
  pandas~=2.2.3
  openpyxl~=3.1.5
  colorlog~=6.9.0

Usage

Clone the project and install the required dependency files in the command line environment. It is recommended to use Python virtual environment tools such as anaconda or miniconda3.

git clone https://github.com/hiddenblue/Pubmedsoso.git
cd Pubmedsoso
pip install -r requirements.txt

If it is inconvenient to install the git tool, you can directly download the ZIP from the Release on the right side of the page and unzip it for execution.

Switch to the project folder in the terminal and execute python main.py with keyword parameters, or directly execute the executable file pubmedsoso.exe + keywords. For example:

python main.py -k headache -n 5 -d 10 -y 5

-k headache is the search keyword (keyword) input for this run. If your keyword contains spaces, please enclose the entire keyword in double quotes. It supports PubMed advanced query box, such as "headache AND toothache", which includes logical expressions such as AND, NOT, OR.

-n parameter followed by a number refers to the number of pages to be searched, with 50 articles per page.

-d parameter followed by a number indicates the number of articles to be downloaded.

-y optional parameter, followed by the year range of the information you want to search, in years. For example, -y 5 means literature from the last five years.

When entering the number of pages, each page will contain 50 search results. There is no need to set a large number, otherwise, it will take a long time to execute.

Then enter the number of articles you need to download. The program will find free PMC articles from the search results and automatically download them. The download speed depends on your network condition.

Each article download will automatically timeout and skip to the next one if it exceeds 30 seconds.

PS > python main.py --help
usage: python main.py -k keyword
pubmedsoso is a python program for crawler article information and download pdf file
optional arguments:
  --help,       -h    show this help message and exit
  --version,    -v    use --version to show the version
  --keyword,    -k    specify the keywords to search pubmed
  --pagenum,    -n    add --pagenum or -n to specify the page number to
  --year        -y    add --year or -y to specify year scale you would to
  --downloadnum,-d    a digit number to  specify the number to download
  --directory   -D    use a vaild dir path specify the pdf save directory.
  --output      -o    add --output filename to appoint name of pdf file
  --loglevel    -l    set the console log level, e.g debug

If you are familiar with IDEs, you can run main.py in Python environments such as pycharm or vscode.

According to the prompt, enter y or n to decide whether to execute the program with the given parameters.

pubmedsoso will crawl and download according to the normal search order.

The literature will be automatically downloaded to the "document/pub/" folder mentioned earlier, and a txt file with the original traversal information will be generated. Finally, an Excel file will be generated after the program execution is complete.

WARNING: Please do not excessively crawl the PubMed website. Since this project uses asynchronous mechanisms, it has high concurrency capabilities. Parameters related to access speed can be set in config.py, and the default values are not too large.

ExcelHelper Module

This is a module to facilitate exporting historical information to Excel after crawling. It can be executed separately, such as in an IDE or command line by executing python ExcelHelper.py.

With the above prompt, you can choose to export historical records from the sqlite3 database, which will automatically generate an exported file locally. Duplicate-named Excel files are not allowed and need to be deleted as prompted.

TO DO List

❏ Precise search and download, this is still a bit difficult*
✓ Custom keyword download, waiting for me to figure out the PubMed search parameter URL generation rules (already implemented)
❏ Automatic completion download of non-free literature via SciHub, perhaps allowing users to write adapters themselves*
❏ A usable GUI interface*
❏ It would be best to include a free Baidu translation plugin, which might be useful sometimes*
✓ Refactor the project using OOP and more modern tools
✓ Refactor the code using asynchronous methods to improve execution efficiency
✓ A usable logging system may also be needed
❏ A subscription-based, proactive literature push mechanism could be developed to push the latest literature to users

Debugging Guide

Due to the peculiarities of the asyncio asynchronous module, some special issues may arise during debugging on Windows.

If you need to develop and debug the code, you need to modify two places:

In GetSearchResult.py:

try:
    if platform.system() == "Windows":
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    html_list = asyncio.run(WebHelper.getSearchHtmlAsync(param_list))

If debugging on Windows, please comment out the conditional execution statement above, otherwise, it will take effect during debugging and cause errors.

Additionally, asyncio.run() is used in multiple places in the project. During debugging, the debug parameter needs to be enabled, otherwise, the runtime will get stuck and report a TypeError: 'Task' object is not callable error.

Update Log

2022.5.16 Updated the automatic creation of the document/pub folder feature, no need to manually create the folder, it will automatically check and create.

2023.08.05 Updated to fix the bug where abstract crawling failed, and users no longer need to manually copy and paste webpage parameters.

2024.11.23 The author unexpectedly remembered this somewhat embarrassing project and quietly updated it, "Is this really the code I wrote? How could it be so bad?"

2024.12.02 Refactored the entire code based on OOP, xpath, and asyncio asynchronous, removed the runtime speed limit, the speed is about 100 times the original, "I’m so tired after writing this."

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
.idea		.idea
assets		assets
utils		utils
.gitignore		.gitignore
GetEachInfo.py		GetEachInfo.py
GetSearchResult.py		GetSearchResult.py
LICENSE		LICENSE
README.adoc		README.adoc
README_CN.adoc		README_CN.adoc
clean.py		clean.py
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pubmedsoso

Main Features

Dependency Modules

Usage

ExcelHelper Module

TO DO List

Debugging Guide

Update Log

About

Releases 12

Packages

Languages

License

hiddenblue/Pubmedsoso

Folders and files

Latest commit

History

Repository files navigation

Pubmedsoso

Main Features

Dependency Modules

Usage

ExcelHelper Module

TO DO List

Debugging Guide

Update Log

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Languages

Packages