Skip to content

yuchen-lea/pdfhelper

Repository files navigation

Readme

中文文档

Changelog

About

This is a command-line tool designed to handle PDF files, with the goal of seamlessly integrating PDFs into various note-taking workflows. Currently, the following capabilities are offered:

  1. Managing the TOC: Export the table of contents from a PDF to a user-friendly, plain text list and import it back into the PDF after any modifications. This TOC file is easy to read and convenient for users to modify. The format can be viewed here.
  2. Annotations Management:
    • Export formatted text annotations: Extract annotations like highlights, text, squares, and other types of annots from a PDF, capture relevant document images, and support OCR extraction from these images. Using Mako templates, you can import the formatted PDF annotations into your preferred note-taking system.
    • Manage XFDF Annotations: Export XFDF annotations from the PDF and import them back into the PDF. XFDF files can be imported by PDF readers like XChange. Since some OCR software may flatten annotations during the OCR process, you can export to XFDF before OCR and then import the XFDF after OCR to retain the full annotation functionality.
    • Delete annotations from the PDF: Easily share original PDFs with others.
  3. Page Label and Number Conversion: Convert page labels to page numbers and vice versa. Sometimes, while the data is stored as a page number, readers might require navigation based on page labels. This feature addresses that discrepancy.

Start

git clone --depth 1 https://github.com/yuchen-lea/pdfhelper.git
cd pdfhelper
make install
pip install -r requirements.txt

Then you can use the pdfhelper command-line tool!

For subsequent updates, simply pull the latest code from git and run make update.

Usage

pdfhelper -h

Some useful functions to process a PDF file.

positional arguments: {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,export-info,import-info,page-label-to-number,page-number-to-label} export-toc Export the TOC of the PDF. import-toc Import TOC from a file into the PDF. delete-annot Delete annotations from the PDF. export-xfdf-annot Export XFDF annotations of the PDF. import-xfdf-annot Import XFDF annotations of the PDF. export-annot Export formatted text annotations of the PDF. export-info Export information of the PDF. import-info Import information of the PDF. page-label-to-number Convert page label to page number. page-number-to-label Convert page number to page label. INFILE PDF file to process

options: -h, –help show this help message and exit –version, -v show program’s version number and exit

TOC format

Sample toc file:

@label 1=A
@label 8=[p-]II
@label 16=1
- Cover#1
- The Ten Commandments#2
- The Five Rules#3
- Contents#8
- Foreword#10
- Preface#12
# 1 = 16
- 1. Toys#2
- 2. Do It, Do It Again, and Again, and Again ...#14
- 3. Cons the Magnificent#32
# +2
- 4. Numbers Games#58

Here, you see three ways of customization:

  1. Defining a Page Number: To bookmark page 3 with the title “The Five Rules”:
    - The Five Rules#3
        
    • 🙋‍ List indentation is the same as toc indentation.
  2. Setting the First Page: To bookmark page 17 with the title “1. Toys” (considering the first page is numbered 16):
    # 1 = 16
    - 1. Toys#2
        
    • 🙋‍Note: Using the pattern “# number1=number2” will treat the physical page number2 of the PDF as number1. Any subsequent page numbers set as ‘x’ will actually point to the physical page calculated as x+number2-number1. This is suitable for setting the first page number, for example, “# 1=19”, as well as setting the starting page number for the second volume, like “# 250=5”.
  3. Accounting for Page Gaps: To bookmark the title “4. Numbers Games” on page 75 (calculated as 58 + (16-1) + 2):
    # +2
    - 4. Numbers Games#58
        
    • useful when there are missing or extra pages. At the location of missing pages (for instance, where blank pages counted in the pagination have been removed), set “# -[number of missing pages]”. At the location where pages are added (like illustration pages not counted in the pagination), set “# +[number of added pages]”.
  4. Set the labels:
    • Starting from page 1, the page numbering style will be uppercase letters. Page 1 will display as “A”, page 2 as “B”, and so on, until page 7.
    • Starting from page 8, the page numbering style will switch to uppercase Roman numerals with the prefix “p-“. Page 8 will display as “p-II”, page 9 as “p-III”, and so on, until page 15.
    • Starting from page 16, the page numbering will be in Arabic numerals. Page 16 will display as “1”, page 17 as “2”, and so on, until the end of the document.

Export Annotations

Currently, the following annotation types are supported:

TypeResult
Textcomment
FreeTextcomment
Squarecomment + picture (set the zoom factor by --image-zoom) + text (extract from the PDF, or use the --ocr-service and --ocr-language to recognize text within images.)
Highlightcomment + text (extract from the PDF)
Underlinecomment + text (extract from the PDF)
Squigglycomment + text (extract from the PDF)
StrikeOutcomment + text (extract from the PDF)
Inkcomment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by --image-zoom) + text (extract from the PDF, or use the --ocr-service and --ocr-language to recognize text within images.)
Linecomment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by --image-zoom) + text (extract from the PDF, or use the --ocr-service and --ocr-language to recognize text within images.)

You can customize the note format by:

  • --with-toc
  • --toc-list-item-format
  • --annot-list-item-format

Credits

This project is inspired by the following tool:

About

Some useful functions to process pdf file

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published