Skip to content

Commit

Permalink
Use Azure Form Recognizer as document preprocessing to extract text, …
Browse files Browse the repository at this point in the history
…tables, and document layout (Azure-Samples#37)

* Add Form Recognizer integration. Keep local PDF parser option.

* Add conversion of Form Recognizer tables in to HTML tables understandable by ChatGPT
Add table splitting logic to make table split across sessions less.

* Add FormRecognizer service into bicep deployments
Add Cognitive Service Users role to make Form Recognizer work with DefaultAzureCredential.

* Add Form Recognizer service as parameteres for prepdocs script.

* Add image of the table with health plan cost into Benefit_Options.pdf. Now user can ask following questions:
- What is cost difference between plans?
- I don't have any dependents. What would be savings if I switch to Standard?

* Add additional prompt to return tabular data as html table.
Add table format for answer in CSS.

* Update ReadMe with information about Form Recognizer cost.

* Fix spellings

* Add html escaping inside html table generation
  • Loading branch information
anatolip authored Mar 22, 2023
1 parent 1273a21 commit 6ac7c90
Show file tree
Hide file tree
Showing 13 changed files with 206 additions and 32 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap

> **IMPORTANT:** In order to deploy and run this example, you'll need an **Azure subscription with access enabled for the Azure OpenAI service**. You can request access [here](https://aka.ms/oaiapply). You can also visit [here](https://azure.microsoft.com/free/cognitive-search/) to get some free Azure credits to get you started.
> **AZURE RESOURCE COSTS** by default this sample will create Azure App Service and Azure Cognitive Search resources that have a monthly cost. You can switch them to free versions of each of them if you want to avoid this cost by changing the parameters file under the infra folder (though there are some limits to consider; for example, you can have up to 1 free Cognitive Search resource per subscription.)
> **AZURE RESOURCE COSTS** by default this sample will create Azure App Service and Azure Cognitive Search resources that have a monthly cost, as well as Form Recognizer resource that has cost per document page. You can switch them to free versions of each of them if you want to avoid this cost by changing the parameters file under the infra folder (though there are some limits to consider; for example, you can have up to 1 free Cognitive Search resource per subscription, and the free Form Recognizer resource only analyzes the first 2 pages of each document.)
### Prerequisites

Expand Down
3 changes: 2 additions & 1 deletion app/backend/approaches/chatreadretrieveread.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
class ChatReadRetrieveReadApproach(Approach):
prompt_prefix = """<|im_start|>system
Assistant helps the company employees with their healthcare plan questions, and questions about the employee handbook. Be brief in your answers.
Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below. If asking a clarifying question to the user would help, ask the question.
Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below. If asking a clarifying question to the user would help, ask the question.
For tabular information return it as an html table. Do not return markdown format.
Each source has a name followed by colon and the actual information, always include the source name for each fact you use in the response. Use square brakets to reference the source, e.g. [info1.txt]. Don't combine sources, list each source separately, e.g. [info1.txt][info2.pdf].
{follow_up_questions_prompt}
{injected_prompt}
Expand Down
1 change: 1 addition & 0 deletions app/backend/approaches/readretrieveread.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ class ReadRetrieveReadApproach(Approach):
template_prefix = \
"You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions. " \
"Answer the question using only the data provided in the information sources below. " \
"For tabular information return it as an html table. Do not return markdown format. " \
"Each source has a name followed by colon and the actual data, quote the source name for each piece of data you use in the response. " \
"For example, if the question is \"What color is the sky?\" and one of the information sources says \"info123: the sky is blue whenever it's not cloudy\", then answer with \"The sky is blue [info123]\" " \
"It's important to strictly follow the format where the name of the source is in square brackets at the end of the sentence, and only up to the prefix before the colon (\":\"). " \
Expand Down
1 change: 1 addition & 0 deletions app/backend/approaches/retrievethenread.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ class RetrieveThenReadApproach(Approach):
"You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions. " + \
"Use 'you' to refer to the individual asking the questions even if they ask with 'I'. " + \
"Answer the following question using only the data provided in the sources below. " + \
"For tabular information return it as an html table. Do not return markdown format. " + \
"Each source has a name followed by colon and the actual information, always include the source name for each fact you use in the response. " + \
"If you cannot answer using the sources below, say you don't know. " + \
"""
Expand Down
10 changes: 10 additions & 0 deletions app/frontend/src/components/Answer/Answer.module.css
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,16 @@
white-space: pre-line;
}

.answerText table {
border-collapse: collapse;
}

.answerText td,
.answerText th {
border: 1px solid;
padding: 5px;
}

.selected {
outline: 2px solid rgba(115, 118, 225, 1);
}
Expand Down
Binary file modified data/Benefit_Options.pdf
Binary file not shown.
26 changes: 26 additions & 0 deletions infra/core/ai/formrecognizer.bicep
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
param name string
param location string = resourceGroup().location
param tags object = {}

param customSubDomainName string = name
param kind string = 'FormRecognizer'
param publicNetworkAccess string = 'Enabled'
param sku object = {
name: 'S0'
}

resource account 'Microsoft.CognitiveServices/accounts@2022-10-01' = {
name: name
location: location
tags: tags
kind: kind
properties: {
customSubDomainName: customSubDomainName
publicNetworkAccess: publicNetworkAccess
}
sku: sku
}

output endpoint string = account.properties.endpoint
output id string = account.id
output name string = account.name
37 changes: 37 additions & 0 deletions infra/main.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@ param openAiResourceGroupName string = ''
param openAiResourceGroupLocation string = location

param openAiSkuName string = 'S0'

param formRecognizerServiceName string = ''
param formRecognizerResourceGroupName string = ''
param formRecognizerResourceGroupLocation string = location

param formRecognizerSkuName string = 'S0'

param gptDeploymentName string = 'davinci'
param gptModelName string = 'text-davinci-003'
param chatGptDeploymentName string = 'chat'
Expand All @@ -54,6 +61,10 @@ resource openAiResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' exi
name: !empty(openAiResourceGroupName) ? openAiResourceGroupName : resourceGroup.name
}

resource formRecognizerResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(formRecognizerResourceGroupName)) {
name: !empty(formRecognizerResourceGroupName) ? formRecognizerResourceGroupName : resourceGroup.name
}

resource searchServiceResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(searchServiceResourceGroupName)) {
name: !empty(searchServiceResourceGroupName) ? searchServiceResourceGroupName : resourceGroup.name
}
Expand Down Expand Up @@ -140,6 +151,19 @@ module openAi 'core/ai/cognitiveservices.bicep' = {
}
}

module formrecognizer 'core/ai/formrecognizer.bicep' = {
name: 'formrecognizer'
scope: formRecognizerResourceGroup
params: {
name: !empty(formRecognizerServiceName) ? formRecognizerServiceName : '${abbrs.cognitiveServicesFormRecognizer}${resourceToken}'
location: formRecognizerResourceGroupLocation
tags: tags
sku: {
name: formRecognizerSkuName
}
}
}

module searchService 'core/search/search-services.bicep' = {
name: 'search-service'
scope: searchServiceResourceGroup
Expand Down Expand Up @@ -194,6 +218,16 @@ module openAiRoleUser 'core/security/role.bicep' = {
}
}

module formRecognizerRoleUser 'core/security/role.bicep' = {
scope: formRecognizerResourceGroup
name: 'formrecognizer-role-user'
params: {
principalId: principalId
roleDefinitionId: 'a97b65f3-24c7-4388-baec-2e87135dc908'
principalType: 'User'
}
}

module storageRoleUser 'core/security/role.bicep' = {
scope: storageResourceGroup
name: 'storage-role-user'
Expand Down Expand Up @@ -274,6 +308,9 @@ output AZURE_OPENAI_RESOURCE_GROUP string = openAiResourceGroup.name
output AZURE_OPENAI_GPT_DEPLOYMENT string = gptDeploymentName
output AZURE_OPENAI_CHATGPT_DEPLOYMENT string = chatGptDeploymentName

output AZURE_FORMRECOGNIZER_SERVICE string = formrecognizer.outputs.name
output AZURE_FORMRECOGNIZER_RESOURCE_GROUP string = formRecognizerResourceGroup.name

output AZURE_SEARCH_INDEX string = searchIndexName
output AZURE_SEARCH_SERVICE string = searchService.outputs.name
output AZURE_SEARCH_SERVICE_RESOURCE_GROUP string = searchServiceResourceGroup.name
Expand Down
9 changes: 9 additions & 0 deletions infra/main.parameters.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,15 @@
"openAiSkuName": {
"value": "S0"
},
"formRecognizerServiceName": {
"value": "${AZURE_FORMRECOGNIZER_SERVICE}"
},
"formRecognizerResourceGroupName": {
"value": "${AZURE_FORMRECOGNIZER_RESOURCE_GROUP}"
},
"formRecognizerSkuName": {
"value": "S0"
},
"searchServiceName": {
"value": "${AZURE_SEARCH_SERVICE}"
},
Expand Down
2 changes: 1 addition & 1 deletion scripts/prepdocs.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ Start-Process -FilePath $venvPythonPath -ArgumentList "-m pip install -r ./scrip

Write-Host 'Running "prepdocs.py"'
$cwd = (Get-Location)
Start-Process -FilePath $venvPythonPath -ArgumentList "./scripts/prepdocs.py $cwd/data/* --storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX --tenantid $env:AZURE_TENANT_ID -v" -Wait -NoNewWindow
Start-Process -FilePath $venvPythonPath -ArgumentList "./scripts/prepdocs.py $cwd/data/* --storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX --formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE --tenantid $env:AZURE_TENANT_ID -v" -Wait -NoNewWindow
144 changes: 116 additions & 28 deletions scripts/prepdocs.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
import argparse
import glob
import html
import io
import re
import time
Expand All @@ -11,6 +12,7 @@
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import *
from azure.search.documents import SearchClient
from azure.ai.formrecognizer import DocumentAnalysisClient

MAX_SECTION_LENGTH = 1000
SENTENCE_SEARCH_LIMIT = 100
Expand All @@ -32,6 +34,9 @@
parser.add_argument("--searchkey", required=False, help="Optional. Use this Azure Cognitive Search account key instead of the current user identity to login (use az login to set current user for Azure)")
parser.add_argument("--remove", action="store_true", help="Remove references to this document from blob storage and the search index")
parser.add_argument("--removeall", action="store_true", help="Remove all blobs from blob storage and documents from the search index")
parser.add_argument("--localpdfparser", action="store_true", help="Use PyPdf local PDF parser (supports only digital PDFs) instead of Azure Form Recognizer service to extract text, tables and layout from the documents")
parser.add_argument("--formrecognizerservice", required=False, help="Optional. Name of the Azure Form Recognizer service which will be used to extract text, tables and layout from the documents (must exist already)")
parser.add_argument("--formrecognizerkey", required=False, help="Optional. Use this Azure Form Recognizer account key instead of the current user identity to login (use az login to set current user for Azure)")
parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
args = parser.parse_args()

Expand All @@ -41,24 +46,42 @@
search_creds = default_creds if args.searchkey == None else AzureKeyCredential(args.searchkey)
if not args.skipblobs:
storage_creds = default_creds if args.storagekey == None else args.storagekey
if not args.localpdfparser:
# check if Azure Form Recognizer credentials are provided
if args.formrecognizerservice == None:
print("Error: Azure Form Recognizer service is not provided. Please provide formrecognizerservice or use --localpdfparser for local pypdf parser.")
exit(1)
formrecognizer_creds = default_creds if args.formrecognizerkey == None else AzureKeyCredential(args.formrecognizerkey)

def blob_name_from_file_page(filename, page):
return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".pdf"
def blob_name_from_file_page(filename, page = 0):
if os.path.splitext(filename)[1].lower() == ".pdf":
return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".pdf"
else:
return os.path.basename(filename)

def upload_blobs(pages):
def upload_blobs(filename):
blob_service = BlobServiceClient(account_url=f"https://{args.storageaccount}.blob.core.windows.net", credential=storage_creds)
blob_container = blob_service.get_container_client(args.container)
if not blob_container.exists():
blob_container.create_container()
for i in range(len(pages)):
blob_name = blob_name_from_file_page(filename, i)
if args.verbose: print(f"\tUploading blob for page {i} -> {blob_name}")
f = io.BytesIO()
writer = PdfWriter()
writer.add_page(pages[i])
writer.write(f)
f.seek(0)
blob_container.upload_blob(blob_name, f, overwrite=True)

# if file is PDF split into pages and upload each page as a separate blob
if os.path.splitext(filename)[1].lower() == ".pdf":
reader = PdfReader(filename)
pages = reader.pages
for i in range(len(pages)):
blob_name = blob_name_from_file_page(filename, i)
if args.verbose: print(f"\tUploading blob for page {i} -> {blob_name}")
f = io.BytesIO()
writer = PdfWriter()
writer.add_page(pages[i])
writer.write(f)
f.seek(0)
blob_container.upload_blob(blob_name, f, overwrite=True)
else:
blob_name = blob_name_from_file_page(filename)
with open(filename,"rb") as data:
blob_container.upload_blob(blob_name, data, overwrite=True)

def remove_blobs(filename):
if args.verbose: print(f"Removing blobs for '{filename or '<all>'}'")
Expand All @@ -74,18 +97,74 @@ def remove_blobs(filename):
if args.verbose: print(f"\tRemoving blob {b}")
blob_container.delete_blob(b)

def split_text(pages):
def table_to_html(table):
table_html = "<table>"
rows = [sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index) for i in range(table.row_count)]
for row_cells in rows:
table_html += "<tr>"
for cell in row_cells:
tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
cell_spans = ""
if cell.column_span > 1: cell_spans += f" colSpan={cell.column_span}"
if cell.row_span > 1: cell_spans += f" rowSpan={cell.row_span}"
table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
table_html +="</tr>"
table_html += "</table>"
return table_html

def get_document_text(filename):
offset = 0
page_map = []
if args.localpdfparser:
reader = PdfReader(filename)
pages = reader.pages
for page_num, p in enumerate(pages):
page_text = p.extract_text()
page_map.append((page_num, offset, page_text))
offset += len(page_text)
else:
if args.verbose: print(f"Extracting text from '{filename}' using Azure Form Recognizer")
form_recognizer_client = DocumentAnalysisClient(endpoint=f"https://{args.formrecognizerservice}.cognitiveservices.azure.com/", credential=formrecognizer_creds, headers={"x-ms-useragent": "azure-search-chat-demo/1.0.0"})
with open(filename, "rb") as f:
poller = form_recognizer_client.begin_analyze_document("prebuilt-layout", document = f)
form_recognizer_results = poller.result()

for page_num, page in enumerate(form_recognizer_results.pages):
tables_on_page = [table for table in form_recognizer_results.tables if table.bounding_regions[0].page_number == page_num + 1]

# mark all positions of the table spans in the page
page_offset = page.spans[0].offset
page_length = page.spans[0].length
table_chars = [-1]*page_length
for table_id, table in enumerate(tables_on_page):
for span in table.spans:
# replace all table spans with "table_id" in table_chars array
for i in range(span.length):
idx = span.offset - page_offset + i
if idx >=0 and idx < page_length:
table_chars[idx] = table_id

# build page text by replacing charcters in table spans with table html
page_text = ""
added_tables = set()
for idx, table_id in enumerate(table_chars):
if table_id == -1:
page_text += form_recognizer_results.content[page_offset + idx]
elif not table_id in added_tables:
page_text += table_to_html(tables_on_page[table_id])
added_tables.add(table_id)

page_text += " "
page_map.append((page_num, offset, page_text))
offset += len(page_text)

return page_map

def split_text(page_map):
SENTENCE_ENDINGS = [".", "!", "?"]
WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]
if args.verbose: print(f"Splitting '{filename}' into sections")

page_map = []
offset = 0
for i, p in enumerate(pages):
text = p.extract_text()
page_map.append((i, offset, text))
offset += len(text)

def find_page(offset):
l = len(page_map)
for i in range(l - 1):
Expand Down Expand Up @@ -125,14 +204,24 @@ def find_page(offset):
if start > 0:
start += 1

yield (all_text[start:end], find_page(start))
start = end - SECTION_OVERLAP
section_text = all_text[start:end]
yield (section_text, find_page(start))

last_table_start = section_text.rfind("<table")
if (last_table_start > 2 * SENTENCE_SEARCH_LIMIT and last_table_start > section_text.rfind("</table")):
# If the section ends with an unclosed table, we need to start the next section with the table.
# If table starts inside SENTENCE_SEARCH_LIMIT, we ignore it, as that will cause an infinite loop for tables longer than MAX_SECTION_LENGTH
# If last table starts inside SECTION_OVERLAP, keep overlapping
if args.verbose: print(f"Section ends with unclosed table, starting next section with the table at page {find_page(start)} offset {start} table start {last_table_start}")
start = min(end - SECTION_OVERLAP, start + last_table_start)
else:
start = end - SECTION_OVERLAP

if start + SECTION_OVERLAP < end:
yield (all_text[start:end], find_page(start))

def create_sections(filename, pages):
for i, (section, pagenum) in enumerate(split_text(pages)):
def create_sections(filename, page_map):
for i, (section, pagenum) in enumerate(split_text(page_map)):
yield {
"id": re.sub("[^0-9a-zA-Z_-]","_",f"{filename}-{i}"),
"content": section,
Expand Down Expand Up @@ -219,9 +308,8 @@ def remove_from_index(filename):
remove_blobs(None)
remove_from_index(None)
else:
reader = PdfReader(filename)
pages = reader.pages
if not args.skipblobs:
upload_blobs(pages)
sections = create_sections(os.path.basename(filename), pages)
upload_blobs(filename)
page_map = get_document_text(filename)
sections = create_sections(os.path.basename(filename), page_map)
index_sections(os.path.basename(filename), sections)
2 changes: 1 addition & 1 deletion scripts/prepdocs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ echo 'Installing dependencies from "requirements.txt" into virtual environment'
./scripts/.venv/bin/python -m pip install -r scripts/requirements.txt

echo 'Running "prepdocs.py"'
./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" --tenant "$AZURE_TENANT_ID" -v
./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" --formrecognizerservice "$AZURE_FORMRECOGNIZER_SERVICE" --tenant "$AZURE_TENANT_ID" -v
1 change: 1 addition & 0 deletions scripts/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
pypdf==3.5.0
azure-identity==1.13.0b3
azure-search-documents==11.4.0b3
azure-ai-formrecognizer==3.2.1
azure-storage-blob==12.14.1

0 comments on commit 6ac7c90

Please sign in to comment.