Document transformation framework to improve vector search results
Vector databases are useful for retrieving context for LLMs, however they struggle to find relevant information if the source documents are indexed haphazardly and information is sparse. Doctran uses LLMs and open source NLP libraries to transform raw text into clean, structured, information-dense documents that are optimized for vector space retrieval. You can think of Doctran as a black box where messy strings go in and nice, clean, labelled strings come out.
Doctran is maintained by Psychic, the data integration layer for LLMs.
Clone or download examples.ipynb
for interactive demos.
<doc type="Confidential Document - For Internal Use Only">
<metadata>
<date> J u l y   1 , 2 0 2 3 </date>
<subject> Updates and Discussions on Various Topics; </subject>
</metadata>
<body>
<p>Dear Team,</p>
<p>I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.</p>
<section>
<header>Security and Privacy Measures</header>
<p>As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.</p>
</section>
<section>
<header>HR Updates and Employee Benefits</header>
<p>Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 0 4 9 - 4 5 - 5 9 2 8) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 4 1
...
{
"topics": ["Security and Privacy", "HR Updates", "Marketing", "R&D"],
"summary": "The document discusses updates on security measures, HR, marketing initiatives, and R&D projects. It commends John Doe for enhancing network security, welcomes new team members, and recognizes Jane Smith for her customer service. It also mentions the open enrollment period for employee benefits, thanks Sarah Thompson for her social media efforts, and announces a product launch event on July 15th. David Rodriguez is acknowledged for his contributions to R&D. The document emphasizes the importance of confidentiality.",
"contact_info": [
{
"name": "John Doe",
"contact_info": {
"phone": "",
"email": "[email protected]"
}
},
{
"name": "Michael Johnson",
"contact_info": {
"phone": "418-492-3850",
"email": "[email protected]"
}
},
{
"name": "Sarah Thompson",
"contact_info": {
"phone": "415-555-1234",
"email": ""
}
}
],
"questions_and_answers": [
{
"question": "What is the purpose of this document?",
"answer": "The purpose of this document is to provide important updates and discuss various topics that require the team's attention."
},
{
"question": "Who is commended for enhancing the company's network security?",
"answer": "John Doe from the IT department is commended for enhancing the company's network security."
}
]
}
pip install doctran
from doctran import Doctran
doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="your_content_as_string")
Doctran is designed to make chaining document transformations easy. For example, you may want to first redact all PII from a document before sending it over to OpenAI to be summarized.
Ordering is important when chaining transformations - transformations that are invoked first will be executed first, and its result will be passed to the next transformation.
document = await document.redact(entities=["EMAIL_ADDRESS", "PHONE_NUMBER"]).extract(properties).summarize().execute()
Given any valid JSON schema, uses OpenAI function calling to extract structured data from a document.
from doctran import ExtractProperty
properties = ExtractProperty(
name="millenial_or_boomer",
description="A prediction of whether this document was written by a millenial or boomer",
type="string",
enum=["millenial", "boomer"],
required=True
)
document = await document.extract(properties=properties).execute()
Uses a spaCy model to remove names, emails, phone numbers and other sensitive information from a document. Runs locally to avoid sending sensitive data to third party APIs.
document = await document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"]).execute()
Summarize the information in a document. token_limit
may be passed to configure the size of the summary, however it may not be respected by OpenAI.
document = await document.summarize().execute()
Remove all information from a document unless it's related to a specific set of topics.
document = await document.refine(topics=['marketing', 'meetings']).execute()
Translates text into another language
document = await document.translate(language="spanish").execute()
Convert information in a document into question and answer format. End user queries often take the form of a question, so converting information into questions and creating indexes from these questions often yields better results when using a vector database for context retrieval.
document = await document.interrogate().execute()
Doctran is open to contributions! The best way to get started is to contribute a new document transformer. Transformers that don't rely on API calls (e.g. OpenAI) are especially valuable since they can run fast and don't require any external dependencies.
Contributing new transformers is straightforward.
- Add a new class that extends
DocumentTransformer
orOpenAIDocumentTransformer
- Implement the
__init__
andtransform
methods - Add corresponding methods to the
DocumentTransformationBuilder
andDocument
classes to enable chaining