Use Azure Form Recognizer as document preprocessing to extract text, …

…tables, and document layout (Azure-Samples#37) * Add Form Recognizer integration. Keep local PDF parser option. * Add conversion of Form Recognizer tables in to HTML tables understandable by ChatGPT Add table splitting logic to make table split across sessions less. * Add FormRecognizer service into bicep deployments Add Cognitive Service Users role to make Form Recognizer work with DefaultAzureCredential. * Add Form Recognizer service as parameteres for prepdocs script. * Add image of the table with health plan cost into Benefit_Options.pdf. Now user can ask following questions: - What is cost difference between plans? - I don't have any dependents. What would be savings if I switch to Standard? * Add additional prompt to return tabular data as html table. Add table format for answer in CSS. * Update ReadMe with information about Form Recognizer cost. * Fix spellings * Add html escaping inside html table generation
rikvermeer · Mar 22, 2023 · 6ac7c90 · 6ac7c90
1 parent 1273a21
commit 6ac7c90
Show file tree

Hide file tree

Showing 13 changed files with 206 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
 
 > **IMPORTANT:** In order to deploy and run this example, you'll need an **Azure subscription with access enabled for the Azure OpenAI service**. You can request access [here](https://aka.ms/oaiapply). You can also visit [here](https://azure.microsoft.com/free/cognitive-search/) to get some free Azure credits to get you started.
 
-> **AZURE RESOURCE COSTS** by default this sample will create Azure App Service and Azure Cognitive Search resources that have a monthly cost. You can switch them to free versions of each of them if you want to avoid this cost by changing the parameters file under the infra folder (though there are some limits to consider; for example, you can have up to 1 free Cognitive Search resource per subscription.)
+> **AZURE RESOURCE COSTS** by default this sample will create Azure App Service and Azure Cognitive Search resources that have a monthly cost, as well as Form Recognizer resource that has cost per document page. You can switch them to free versions of each of them if you want to avoid this cost by changing the parameters file under the infra folder (though there are some limits to consider; for example, you can have up to 1 free Cognitive Search resource per subscription, and the free Form Recognizer resource only analyzes the first 2 pages of each document.)
 
 ### Prerequisites
 

diff --git a/app/backend/approaches/chatreadretrieveread.py b/app/backend/approaches/chatreadretrieveread.py
@@ -10,7 +10,8 @@
 class ChatReadRetrieveReadApproach(Approach):
     prompt_prefix = """<|im_start|>system
 Assistant helps the company employees with their healthcare plan questions, and questions about the employee handbook. Be brief in your answers.
-Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below. If asking a clarifying question to the user would help, ask the question. 
+Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below. If asking a clarifying question to the user would help, ask the question.
+For tabular information return it as an html table. Do not return markdown format.
 Each source has a name followed by colon and the actual information, always include the source name for each fact you use in the response. Use square brakets to reference the source, e.g. [info1.txt]. Don't combine sources, list each source separately, e.g. [info1.txt][info2.pdf].
 {follow_up_questions_prompt}
 {injected_prompt}

diff --git a/app/backend/approaches/readretrieveread.py b/app/backend/approaches/readretrieveread.py
@@ -21,6 +21,7 @@ class ReadRetrieveReadApproach(Approach):
     template_prefix = \
 "You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions. " \
 "Answer the question using only the data provided in the information sources below. " \
+"For tabular information return it as an html table. Do not return markdown format. " \
 "Each source has a name followed by colon and the actual data, quote the source name for each piece of data you use in the response. " \
 "For example, if the question is \"What color is the sky?\" and one of the information sources says \"info123: the sky is blue whenever it's not cloudy\", then answer with \"The sky is blue [info123]\" " \
 "It's important to strictly follow the format where the name of the source is in square brackets at the end of the sentence, and only up to the prefix before the colon (\":\"). " \

diff --git a/app/backend/approaches/retrievethenread.py b/app/backend/approaches/retrievethenread.py
@@ -13,6 +13,7 @@ class RetrieveThenReadApproach(Approach):
 "You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions. " + \
 "Use 'you' to refer to the individual asking the questions even if they ask with 'I'. " + \
 "Answer the following question using only the data provided in the sources below. " + \
+"For tabular information return it as an html table. Do not return markdown format. "  + \
 "Each source has a name followed by colon and the actual information, always include the source name for each fact you use in the response. " + \
 "If you cannot answer using the sources below, say you don't know. " + \
 """

diff --git a/app/frontend/src/components/Answer/Answer.module.css b/app/frontend/src/components/Answer/Answer.module.css
@@ -19,6 +19,16 @@
     white-space: pre-line;
 }
 
+.answerText table {
+    border-collapse: collapse;
+}
+
+.answerText td,
+.answerText th {
+    border: 1px solid;
+    padding: 5px;
+}
+
 .selected {
     outline: 2px solid rgba(115, 118, 225, 1);
 }

diff --git a/data/Benefit_Options.pdf b/data/Benefit_Options.pdf
diff --git a/infra/core/ai/formrecognizer.bicep b/infra/core/ai/formrecognizer.bicep
@@ -0,0 +1,26 @@
+param name string
+param location string = resourceGroup().location
+param tags object = {}
+
+param customSubDomainName string = name
+param kind string = 'FormRecognizer'
+param publicNetworkAccess string = 'Enabled'
+param sku object = {
+  name: 'S0'
+}
+
+resource account 'Microsoft.CognitiveServices/accounts@2022-10-01' = {
+  name: name
+  location: location
+  tags: tags
+  kind: kind
+  properties: {
+    customSubDomainName: customSubDomainName
+    publicNetworkAccess: publicNetworkAccess
+  }
+  sku: sku
+}
+
+output endpoint string = account.properties.endpoint
+output id string = account.id
+output name string = account.name
diff --git a/infra/main.bicep b/infra/main.bicep
@@ -31,6 +31,13 @@ param openAiResourceGroupName string = ''
 param openAiResourceGroupLocation string = location
 
 param openAiSkuName string = 'S0'
+
+param formRecognizerServiceName string = ''
+param formRecognizerResourceGroupName string = ''
+param formRecognizerResourceGroupLocation string = location
+
+param formRecognizerSkuName string = 'S0'
+
 param gptDeploymentName string = 'davinci'
 param gptModelName string = 'text-davinci-003'
 param chatGptDeploymentName string = 'chat'
@@ -54,6 +61,10 @@ resource openAiResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' exi
   name: !empty(openAiResourceGroupName) ? openAiResourceGroupName : resourceGroup.name
 }
 
+resource formRecognizerResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(formRecognizerResourceGroupName)) {
+  name: !empty(formRecognizerResourceGroupName) ? formRecognizerResourceGroupName : resourceGroup.name
+}
+
 resource searchServiceResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(searchServiceResourceGroupName)) {
   name: !empty(searchServiceResourceGroupName) ? searchServiceResourceGroupName : resourceGroup.name
 }
@@ -140,6 +151,19 @@ module openAi 'core/ai/cognitiveservices.bicep' = {
   }
 }
 
+module formrecognizer 'core/ai/formrecognizer.bicep' = {
+  name: 'formrecognizer'
+  scope: formRecognizerResourceGroup
+  params: {
+    name: !empty(formRecognizerServiceName) ? formRecognizerServiceName : '${abbrs.cognitiveServicesFormRecognizer}${resourceToken}'
+    location: formRecognizerResourceGroupLocation
+    tags: tags
+    sku: {
+      name: formRecognizerSkuName
+    }
+  }
+}
+
 module searchService 'core/search/search-services.bicep' = {
   name: 'search-service'
   scope: searchServiceResourceGroup
@@ -194,6 +218,16 @@ module openAiRoleUser 'core/security/role.bicep' = {
   }
 }
 
+module formRecognizerRoleUser 'core/security/role.bicep' = {
+  scope: formRecognizerResourceGroup
+  name: 'formrecognizer-role-user'
+  params: {
+    principalId: principalId
+    roleDefinitionId: 'a97b65f3-24c7-4388-baec-2e87135dc908'
+    principalType: 'User'
+  }
+}
+
 module storageRoleUser 'core/security/role.bicep' = {
   scope: storageResourceGroup
   name: 'storage-role-user'
@@ -274,6 +308,9 @@ output AZURE_OPENAI_RESOURCE_GROUP string = openAiResourceGroup.name
 output AZURE_OPENAI_GPT_DEPLOYMENT string = gptDeploymentName
 output AZURE_OPENAI_CHATGPT_DEPLOYMENT string = chatGptDeploymentName
 
+output AZURE_FORMRECOGNIZER_SERVICE string = formrecognizer.outputs.name
+output AZURE_FORMRECOGNIZER_RESOURCE_GROUP string = formRecognizerResourceGroup.name
+
 output AZURE_SEARCH_INDEX string = searchIndexName
 output AZURE_SEARCH_SERVICE string = searchService.outputs.name
 output AZURE_SEARCH_SERVICE_RESOURCE_GROUP string = searchServiceResourceGroup.name

diff --git a/infra/main.parameters.json b/infra/main.parameters.json
@@ -20,6 +20,15 @@
     "openAiSkuName": {
       "value": "S0"
     },
+    "formRecognizerServiceName": {
+      "value": "${AZURE_FORMRECOGNIZER_SERVICE}"
+    },
+    "formRecognizerResourceGroupName": {
+      "value": "${AZURE_FORMRECOGNIZER_RESOURCE_GROUP}"
+    },
+    "formRecognizerSkuName": {
+      "value": "S0"
+    },
     "searchServiceName": {
       "value": "${AZURE_SEARCH_SERVICE}"
     },

diff --git a/scripts/prepdocs.ps1 b/scripts/prepdocs.ps1
@@ -32,4 +32,4 @@ Start-Process -FilePath $venvPythonPath -ArgumentList "-m pip install -r ./scrip
 
 Write-Host 'Running "prepdocs.py"'
 $cwd = (Get-Location)
-Start-Process -FilePath $venvPythonPath -ArgumentList "./scripts/prepdocs.py $cwd/data/* --storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX --tenantid $env:AZURE_TENANT_ID -v" -Wait -NoNewWindow
+Start-Process -FilePath $venvPythonPath -ArgumentList "./scripts/prepdocs.py $cwd/data/* --storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX --formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE --tenantid $env:AZURE_TENANT_ID -v" -Wait -NoNewWindow
diff --git a/scripts/prepdocs.py b/scripts/prepdocs.py
@@ -1,6 +1,7 @@
 import os
 import argparse
 import glob
+import html
 import io
 import re
 import time
@@ -11,6 +12,7 @@
 from azure.search.documents.indexes import SearchIndexClient
 from azure.search.documents.indexes.models import *
 from azure.search.documents import SearchClient
+from azure.ai.formrecognizer import DocumentAnalysisClient
 
 MAX_SECTION_LENGTH = 1000
 SENTENCE_SEARCH_LIMIT = 100
@@ -32,6 +34,9 @@
 parser.add_argument("--searchkey", required=False, help="Optional. Use this Azure Cognitive Search account key instead of the current user identity to login (use az login to set current user for Azure)")
 parser.add_argument("--remove", action="store_true", help="Remove references to this document from blob storage and the search index")
 parser.add_argument("--removeall", action="store_true", help="Remove all blobs from blob storage and documents from the search index")
+parser.add_argument("--localpdfparser", action="store_true", help="Use PyPdf local PDF parser (supports only digital PDFs) instead of Azure Form Recognizer service to extract text, tables and layout from the documents")
+parser.add_argument("--formrecognizerservice", required=False, help="Optional. Name of the Azure Form Recognizer service which will be used to extract text, tables and layout from the documents (must exist already)")
+parser.add_argument("--formrecognizerkey", required=False, help="Optional. Use this Azure Form Recognizer account key instead of the current user identity to login (use az login to set current user for Azure)")
 parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
 args = parser.parse_args()
 
@@ -41,24 +46,42 @@
 search_creds = default_creds if args.searchkey == None else AzureKeyCredential(args.searchkey)
 if not args.skipblobs:
     storage_creds = default_creds if args.storagekey == None else args.storagekey
+if not args.localpdfparser:
+    # check if Azure Form Recognizer credentials are provided
+    if args.formrecognizerservice == None:
+        print("Error: Azure Form Recognizer service is not provided. Please provide formrecognizerservice or use --localpdfparser for local pypdf parser.")
+        exit(1)
+    formrecognizer_creds = default_creds if args.formrecognizerkey == None else AzureKeyCredential(args.formrecognizerkey)
 
-def blob_name_from_file_page(filename, page):
-    return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".pdf"
+def blob_name_from_file_page(filename, page = 0):
+    if os.path.splitext(filename)[1].lower() == ".pdf":
+        return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".pdf"
+    else:
+        return os.path.basename(filename)
 
-def upload_blobs(pages):
+def upload_blobs(filename):
     blob_service = BlobServiceClient(account_url=f"https://{args.storageaccount}.blob.core.windows.net", credential=storage_creds)
     blob_container = blob_service.get_container_client(args.container)
     if not blob_container.exists():
         blob_container.create_container()
-    for i in range(len(pages)):
-        blob_name = blob_name_from_file_page(filename, i)
-        if args.verbose: print(f"\tUploading blob for page {i} -> {blob_name}")
-        f = io.BytesIO()
-        writer = PdfWriter()
-        writer.add_page(pages[i])
-        writer.write(f)
-        f.seek(0)
-        blob_container.upload_blob(blob_name, f, overwrite=True)
+
+    # if file is PDF split into pages and upload each page as a separate blob
+    if os.path.splitext(filename)[1].lower() == ".pdf":
+        reader = PdfReader(filename)
+        pages = reader.pages
+        for i in range(len(pages)):
+            blob_name = blob_name_from_file_page(filename, i)
+            if args.verbose: print(f"\tUploading blob for page {i} -> {blob_name}")
+            f = io.BytesIO()
+            writer = PdfWriter()
+            writer.add_page(pages[i])
+            writer.write(f)
+            f.seek(0)
+            blob_container.upload_blob(blob_name, f, overwrite=True)
+    else:
+        blob_name = blob_name_from_file_page(filename)
+        with open(filename,"rb") as data:
+            blob_container.upload_blob(blob_name, data, overwrite=True)
 
 def remove_blobs(filename):
     if args.verbose: print(f"Removing blobs for '{filename or '<all>'}'")
@@ -74,18 +97,74 @@ def remove_blobs(filename):
             if args.verbose: print(f"\tRemoving blob {b}")
             blob_container.delete_blob(b)
 
-def split_text(pages):
+def table_to_html(table):
+    table_html = "<table>"
+    rows = [sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index) for i in range(table.row_count)]
+    for row_cells in rows:
+        table_html += "<tr>"
+        for cell in row_cells:
+            tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
+            cell_spans = ""
+            if cell.column_span > 1: cell_spans += f" colSpan={cell.column_span}"
+            if cell.row_span > 1: cell_spans += f" rowSpan={cell.row_span}"
+            table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
+        table_html +="</tr>"
+    table_html += "</table>"
+    return table_html
+
+def get_document_text(filename):
+    offset = 0
+    page_map = []
+    if args.localpdfparser:
+        reader = PdfReader(filename)
+        pages = reader.pages
+        for page_num, p in enumerate(pages):
+            page_text = p.extract_text()
+            page_map.append((page_num, offset, page_text))
+            offset += len(page_text)
+    else:
+        if args.verbose: print(f"Extracting text from '{filename}' using Azure Form Recognizer")
+        form_recognizer_client = DocumentAnalysisClient(endpoint=f"https://{args.formrecognizerservice}.cognitiveservices.azure.com/", credential=formrecognizer_creds, headers={"x-ms-useragent": "azure-search-chat-demo/1.0.0"})
+        with open(filename, "rb") as f:
+            poller = form_recognizer_client.begin_analyze_document("prebuilt-layout", document = f)
+        form_recognizer_results = poller.result()
+
+        for page_num, page in enumerate(form_recognizer_results.pages):
+            tables_on_page = [table for table in form_recognizer_results.tables if table.bounding_regions[0].page_number == page_num + 1]
+
+            # mark all positions of the table spans in the page
+            page_offset = page.spans[0].offset
+            page_length = page.spans[0].length
+            table_chars = [-1]*page_length
+            for table_id, table in enumerate(tables_on_page):
+                for span in table.spans:
+                    # replace all table spans with "table_id" in table_chars array
+                    for i in range(span.length):
+                        idx = span.offset - page_offset + i
+                        if idx >=0 and idx < page_length:
+                            table_chars[idx] = table_id
+
+            # build page text by replacing charcters in table spans with table html
+            page_text = ""
+            added_tables = set()
+            for idx, table_id in enumerate(table_chars):
+                if table_id == -1:
+                    page_text += form_recognizer_results.content[page_offset + idx]
+                elif not table_id in added_tables:
+                    page_text += table_to_html(tables_on_page[table_id])
+                    added_tables.add(table_id)
+
+            page_text += " "
+            page_map.append((page_num, offset, page_text))
+            offset += len(page_text)
+
+    return page_map
+
+def split_text(page_map):
     SENTENCE_ENDINGS = [".", "!", "?"]
     WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]
     if args.verbose: print(f"Splitting '{filename}' into sections")
 
-    page_map = []
-    offset = 0
-    for i, p in enumerate(pages):
-        text = p.extract_text()
-        page_map.append((i, offset, text))
-        offset += len(text)
-
     def find_page(offset):
         l = len(page_map)
         for i in range(l - 1):
@@ -125,14 +204,24 @@ def find_page(offset):
         if start > 0:
             start += 1
 
-        yield (all_text[start:end], find_page(start))
-        start = end - SECTION_OVERLAP
+        section_text = all_text[start:end]
+        yield (section_text, find_page(start))
+
+        last_table_start = section_text.rfind("<table")
+        if (last_table_start > 2 * SENTENCE_SEARCH_LIMIT and last_table_start > section_text.rfind("</table")):
+            # If the section ends with an unclosed table, we need to start the next section with the table.
+            # If table starts inside SENTENCE_SEARCH_LIMIT, we ignore it, as that will cause an infinite loop for tables longer than MAX_SECTION_LENGTH
+            # If last table starts inside SECTION_OVERLAP, keep overlapping
+            if args.verbose: print(f"Section ends with unclosed table, starting next section with the table at page {find_page(start)} offset {start} table start {last_table_start}")
+            start = min(end - SECTION_OVERLAP, start + last_table_start)
+        else:
+            start = end - SECTION_OVERLAP
 
     if start + SECTION_OVERLAP < end:
         yield (all_text[start:end], find_page(start))
 
-def create_sections(filename, pages):
-    for i, (section, pagenum) in enumerate(split_text(pages)):
+def create_sections(filename, page_map):
+    for i, (section, pagenum) in enumerate(split_text(page_map)):
         yield {
             "id": re.sub("[^0-9a-zA-Z_-]","_",f"{filename}-{i}"),
             "content": section,
@@ -219,9 +308,8 @@ def remove_from_index(filename):
             remove_blobs(None)
             remove_from_index(None)
         else:
-            reader = PdfReader(filename)
-            pages = reader.pages
             if not args.skipblobs:
-                upload_blobs(pages)
-            sections = create_sections(os.path.basename(filename), pages)
+                upload_blobs(filename)
+            page_map = get_document_text(filename)
+            sections = create_sections(os.path.basename(filename), page_map)
             index_sections(os.path.basename(filename), sections)
diff --git a/scripts/prepdocs.sh b/scripts/prepdocs.sh
@@ -18,4 +18,4 @@ echo 'Installing dependencies from "requirements.txt" into virtual environment'
 ./scripts/.venv/bin/python -m pip install -r scripts/requirements.txt
 
 echo 'Running "prepdocs.py"'
-./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" --tenant "$AZURE_TENANT_ID" -v
+./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" --formrecognizerservice "$AZURE_FORMRECOGNIZER_SERVICE" --tenant "$AZURE_TENANT_ID" -v
diff --git a/scripts/requirements.txt b/scripts/requirements.txt
@@ -1,4 +1,5 @@
 pypdf==3.5.0
 azure-identity==1.13.0b3
 azure-search-documents==11.4.0b3
+azure-ai-formrecognizer==3.2.1
 azure-storage-blob==12.14.1