Skip to content

Commit

Permalink
Added docs content
Browse files Browse the repository at this point in the history
  • Loading branch information
Balearica committed Nov 23, 2024
1 parent d5d6bb0 commit a0cf6a2
Show file tree
Hide file tree
Showing 16 changed files with 200 additions and 1 deletion.
Binary file added img/data_table_adv2_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data_table_adv2_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data_table_adv_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data_table_adv_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data_table_adv_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data_table_adv_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/data_table_adv_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/download_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/edit_layout_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/edit_layout_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/edit_text_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/recognize_prompt_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ nav_order: 1
---

# Overview
Alch.io is a web application for extracting tabular data from scanned documents and PDF files. After importing a series of images or PDF document to alch.io, users can recognize text (if needed), select and edit regions containing tables, and export those tables as an Excel file.
Alch is a web application for extracting tabular data from scanned documents and PDF files. After importing a series of images or PDF document to Alch, users can recognize text (if needed), select and edit regions containing tables, and export those tables as an Excel file.
45 changes: 45 additions & 0 deletions layout-controls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
layout: default
title: Layout Controls
nav_order: 4
---

# Edit Table Layout

### Adding and Modifying Tables
- Add table.
- Add a table by clicking the `Add Data Table` button and selecting the area where the table should be inserted.
- Delete table.
- Delete a table by selecting the entire table (including all columns), right clicking, and selecting `Delete Table`.
- Resize table.
- A table can be resized by dragging the controls that appear when the table is selected.
- When the table is resized to exclude a column, that column is automatically deleted.

### Adding and Modifying Columns
- Adjusting column bounds.
- Column boundaries can be adjusted by clicking the column separator and dragging it to the left or right.
- Combining columns.
- Neighboring columns can be combined by selecting the columns, right clicking, and selecting `Combine Columns` from the context menu.
- Splitting columns.
- A single column can be split into multiple columns by selecting the column, right clicking, and selecting `Split Column` from the context menu.
- Adding/deleting columns.
- There are no "add column" or "delete column" buttons.
- Columns can be added/deleted through a combination of resizing the table, and splitting/combining existing columns.
- When the table is resized to exclude a column, that column is automatically deleted.

### Set Default Page Layout
It is possible to set a default layout, which will be applied to all pages where tables where pages have not been edited manually. Setting a default makes it easy to process documents such as invoices or reports, where 50 pages may contain the same layout.

1. To make the layout from the current page the default, click `Save As Default`.
2. To discard all edits made to an individual page, reverting it to the default, click `Revert To Default`.

# Text Assignment to Columns
By default, individual words are assigned to the column they overlap the most with. While this behavior is generally correct, users can modify how words are assigned to columns by right clicking column(s) and selecting options in the `Overlap Rules` drop-down menu. Specifically, the following properties can be modified.

- Is text assigned to columns on a word-by-word basis, or should entire lines be assigned to columns?
- Select `word` to assign text to column by word; select `line` to assign entire lines to the same column.
- Is text assigned to columns based on where the text starts, or based on where the majority of the text is found?
- Select `left` to assign text to the column where the text starts; select `majority` to assign text to the column it overlaps the most with.

# Downloading Data
To download tables in a tabular format, navigate to the `Download` tab, and then set the format to `.xlsx`. Excel (`.xlsx`) is currently the only supported format for writing tabular data.
81 changes: 81 additions & 0 deletions tables-walkthrough.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
layout: default
title: Table Extraction - Walkthrough
nav_order: 2
---

# Overview
This page provides a walkthrough for new users who want to extract tables from PDFs. For more detailed reference material, see the [Text Controls]({{ site.baseurl }}/text-controls.html) and [Layout Controls]({{ site.baseurl }}/layout-controls.html) pages.

# Example: Simple Table
In this example, we will extract a single table from an Amazon 8K filing. This table is simple, so extracting the data only takes a few steps.

### Step 1: Upload Document, Create Text Layer
Start by uploading your document. Documents may either be (1) a single PDF file or (2) a series of PNG or JPEG images containing scanned pages.

If the document uploaded is a series of images or an image-native PDF, you will be prompted to run OCR to recognize text. This step is optional when uploading an image-native PDF document that already contains PDF data, and is skipped entirely for text-native PDF documents.

![recognize_prompt_1.png]({{ site.baseurl }}/img/recognize_prompt_1.png)

### Step 2: Proofread Text Layer
If text recognition was suggested or required during the import step, the input is image-native rather than text-native. This means the text layer may contain errors. If accuracy is critical for your application, you should review the text layer and correct any errors before proceeding. If text recognition was not suggested during the import step, the input is a text-native PDF, and this step can be skipped.

To proofread the text layer, open the `Edit Text` tab. Text is not editable when the `Edit Layout` tab is open. Next, review the colored text layer that is printed over the document. Special attention should be paid to text printed in red, as this was flagged as low-confidence by the built-in recognition program.

![edit_text_1.png]({{ site.baseurl }}/img/edit_text_1.png)

Basic controls for editing are listed below.

- Edit text by double-clicking a word to enable editing.
- Delete text by selecting word(s), right clicking, and selecting `Delete Words`.
- Recognize additional words by clicking `Edit Text` > `Recognize Word` and then selecting the area around the word.
- Select `Edit Text` > `Recognize Area` if the region contains multiple words.

A full list of controls can be found on the [Text Controls]({{ site.baseurl }}/text-controls.html) page.

### Step 3: Add Table Layout
Once a high-quality text layer exists, tables can be identified and extracted. Open the `Edit Layout` tab to add new tables. After identifying a table in the document, select `Edit Layout` > `Add Data Table` and drawing a rectangle over the entire table. The table is represented by a colored rectangle, with different shades representing different columns.

![edit_layout_1.png]({{ site.baseurl }}/img/edit_layout_1.png)

Next, edit the table layout until the column bounds are correct. Basic controls for editing a table layout are listed below.
- Split a column by right clicking where it should be split and selecting `Split Column`.
- Combine multiple columns by selecting both, right clicking, and selecting `Merge Columns`.
- Resize the table and columns by clicking and dragging the table or column bounds.

![edit_layout_2.png]({{ site.baseurl }}/img/edit_layout_2.png)

A full list of controls can be found on the [Text Controls]({{ site.baseurl }}/layout-controls.html) page.

### Step 4: Export
Select `Download` > `Download` to export the tables identified in previous steps as a `.xlsx` workbook.

![download_1.png]({{ site.baseurl }}/img/download_1.png)

# Common Special Cases

## Values that Span Multiple Columns
Some tables include single entries that span the width of multiple columns. For example, below is part of a different table from the Amazon 8K filing showing shareholder votes by proposal. The description of each proposal spans the width of all columns.
![data_table_adv_1.png]({{ site.baseurl }}/img/data_table_adv_1.png)

By default, individual words are assigned to the column they overlap the most with. However, this behavior is undesirable in this case, as it results in the proposal descriptions being split up and assigned to the same columns as the vote totals.
![data_table_adv_2.png]({{ site.baseurl }}/img/data_table_adv_2.png)

To handle this case, start by creating a new column that includes only the start of the proposal description. Next, select the new column, open the `Set Overlap Rules` drop-down menu, set the rules to `Left` and `Line`. This tells Alch to include all lines where the left bound is inside the selected column.
![data_table_adv_3.png]({{ site.baseurl }}/img/data_table_adv_3.png)

We can confirm this change worked as expected by checking the viewer. All proposal descriptions are now highlighted the same color as the first column, indicating they are all being assigned to the first column.
![data_table_adv_4.png]({{ site.baseurl }}/img/data_table_adv_4.png)

The resulting `.xlsx` file is shown below. Basic cleaning steps in a program such as Excel, R, or Python can be used to produce a dataset where each row contains a proposal description and vote totals.
![data_table_adv_5.png]({{ site.baseurl }}/img/data_table_adv_5.png)

## Layouts that Span Multiple Pages
When a single table layout applies to most or all pages within a document, it is not necessary to re-draw the layout on every page. Instead, the current layout can be set as "default" by clicking `Save As Default`. The default layout is applied to all pages that have not been edited manually.
![data_table_adv2_1.png]({{ site.baseurl }}/img/data_table_adv2_1.png)

### Applying Layouts to a Subset of Pages
It is not currently possible to automatically apply a layout to a subset of pages. The only way to apply a layout to multiple pages is by setting it as default, which applies it to all pages which have not been individually edited. However, as data can be subset to a specific page range during the download step, applying the default layout to unneeded pages is generally not problematic.

For example, say that a 100 page document contains tables in pages 40-60, and all tables have the same layout. This document could be processed by setting a single default layout, and setting the output to only include pages 40-60.
![data_table_adv2_2.png]({{ site.baseurl }}/img/data_table_adv2_2.png)
73 changes: 73 additions & 0 deletions text-controls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
layout: default
title: Text Controls
nav_order: 3
---

# Navigation
### Pan
Users can pan using the following methods.
- Holding down `Mouse middle` and dragging the mouse.
- Using a 2 finger "pan" gesture (on touch pads).
- Using a 1 finger "pan" gesture (on mobile devices).

### Zoom
Users can zoom in/out using the following methods.
- Using `Ctrl + Scroll wheel`.
- Using pinch gesture (touch pads and mobile devices only).
- Using `+` and `-` buttons on the interface (next to `prev`/`next`)
- Using `Ctrl + +` and `Ctrl + -` keyboard shortcuts.

# Words
### Select
Individual words can be selected by clicking or tapping them.  Groups of words can be selected by clicking and dragging to create a selection box. Words can be added to an existing selection by holding down `Ctrl` when selecting them. There is currently no way to select multiple words on mobile devices.

Words can also be selected using the keyboard arrow keys. For a full list of shortcuts for selecting words, see the [Shortcuts Cheat Sheet section](#select-words).

### Split Words
A single word can be split into two words by positioning the cursor where the word should be split, right clicking, and selecting `Split Word`.

### Combine Words
Multiple adjacent words can be combined into a single word by selecting the words, right clicking, and selecting `Merge Words`.

# Keyboard Shortcut Cheat Sheet
### General

| Shortcut | Action |
|---------------------------|---------------------------------------------|
| `Ctrl + +` | Zoom in |
| `Ctrl + -` | Zoom out |
| `PageUp` | Previous page |
| `PageDown` | Next page |

### Select Words

| Shortcut | Action |
|---------------------------|---------------------------------------------|
| `Tab` | Select next word[^next-word] |
| `Shift + Tab` | Select previous word |
| `ArrowRight` | Select word to right |
| `ArrowLeft` | Select word to left |
| `ArrowUp` | Select word above selected word |
| `ArrowDown` | Select word beneath selected word |
| `Shift + ArrowRight` | Expand selection to right |
| `Shift + ArrowLeft` | Expand selection to left |

### Edit Words
Once words have been selected, shortcuts using the `Ctrl` modifier can be used to edit them.

| Shortcut | Action |
|---------------------------|---------------------------------------------|
| `Ctrl + i` | Toggle italic font style |
| `Ctrl + b` | Toggle bold font style |
| `Ctrl + Alt + +` | Increase word font size |
| `Ctrl + Alt + -` | Decrease word font size |
| `Ctrl + ArrowLeft` | Move the word's left bound to the left |
| `Ctrl + ArrowRight` | Move the word's left bound to the right |
| `Ctrl + Alt + ArrowLeft` | Move the word's right bound to the left |
| `Ctrl + Alt + ArrowRight` | Move the word's right bound to the right |
| `Enter` | Start/stop editing word text (from start) |
| `Alt + Enter` | Start/stop editing word text (from end) |
| `Ctrl + Delete` | Delete word(s) |

[^next-word]: The "next word" (selected with `Tab`) is selected based on what Alch believes the reading order is. After reaching the end of a line, the `Tab` shortcut will jump to whatever it believes the next line in the document is, whereas `ArrowRight` will go to the word visually to the right (if any). The "next word" can be unpredictable for documents without an unambiguous reading order--for example, data tables or documents with many floating elements.

0 comments on commit a0cf6a2

Please sign in to comment.