Skip to content

Commit

Permalink
Adds document transformer abstraction, metadata tagger (langchain-ai#…
Browse files Browse the repository at this point in the history
…1945)

* Adds document transformers, add OpenAIFunctions metadata tagger

* Adds docstrings for base document transformer abstract class

* Adds docs and entrypoint for document transformers

* Type tagging chain inputs better

* Formatting
  • Loading branch information
jacoblee93 authored Jul 13, 2023
1 parent c360d53 commit 37dfc77
Show file tree
Hide file tree
Showing 22 changed files with 364 additions and 7 deletions.
37 changes: 37 additions & 0 deletions docs/docs/modules/indexes/document_transformers/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
sidebar_label: Document Transformers
sidebar_position: 2
---

# Document Transformers

Once you've loaded documents, you'll often want to transform them to better suit your application. One example of this is
to automatically tag the loaded documents with metadata extracted from their content.

## Metadata Tagger

It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.

The `MetadataTagger` document transformer automates this process by extracting metadata from each provided document according to a provided schema. It uses a configurable OpenAI Functions-powered chain under the hood, so if you pass a custom LLM instance, it must be an OpenAI model with functions support.

**Note:** This document transformer works best with complete documents, so it's best to run it first with whole documents before doing any other splitting or processing!

### Usage

For example, let's say you wanted to index a set of movie reviews. You could initialize the document transformer as follows:

import CodeBlock from "@theme/CodeBlock";
import Example from "@examples/document_transformers/metadata_tagger.ts";

<CodeBlock language="typescript">{Example}</CodeBlock>

There is an additional `createMetadataTagger` method that accepts a valid JSON Schema object as well.

### Customization

You can pass the underlying tagging chain the standard LLMChain arguments in the second options parameter.
For example, if you wanted to ask the LLM to focus specific details in the input documents, or extract metadata in a certain style, you could pass in a custom prompt:

import CustomExample from "@examples/document_transformers/metadata_tagger_custom_prompt.ts";

<CodeBlock language="typescript">{CustomExample}</CodeBlock>
61 changes: 61 additions & 0 deletions examples/src/document_transformers/metadata_tagger.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import { z } from "zod";
import { createMetadataTaggerFromZod } from "langchain/document_transformers/openai_functions";
import { ChatOpenAI } from "langchain/chat_models/openai";
import { Document } from "langchain/document";

const zodSchema = z.object({
movie_title: z.string(),
critic: z.string(),
tone: z.enum(["positive", "negative"]),
rating: z
.optional(z.number())
.describe("The number of stars the critic rated the movie"),
});

const metadataTagger = createMetadataTaggerFromZod(zodSchema, {
llm: new ChatOpenAI({ modelName: "gpt-3.5-turbo" }),
});

const documents = [
new Document({
pageContent:
"Review of The Bee Movie\nBy Roger Ebert\nThis is the greatest movie ever made. 4 out of 5 stars.",
}),
new Document({
pageContent:
"Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata: { reliable: false },
}),
];
const taggedDocuments = await metadataTagger.transformDocuments(documents);

console.log(taggedDocuments);

/*
[
Document {
pageContent: 'Review of The Bee Movie\n' +
'By Roger Ebert\n' +
'This is the greatest movie ever made. 4 out of 5 stars.',
metadata: {
movie_title: 'The Bee Movie',
critic: 'Roger Ebert',
tone: 'positive',
rating: 4
}
},
Document {
pageContent: 'Review of The Godfather\n' +
'By Anonymous\n' +
'\n' +
'This movie was super boring. 1 out of 5 stars.',
metadata: {
movie_title: 'The Godfather',
critic: 'Anonymous',
tone: 'negative',
rating: 1,
reliable: false
}
}
]
*/
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import { z } from "zod";
import { createMetadataTaggerFromZod } from "langchain/document_transformers/openai_functions";
import { ChatOpenAI } from "langchain/chat_models/openai";
import { Document } from "langchain/document";
import { PromptTemplate } from "langchain/prompts";

const taggingChainTemplate = `Extract the desired information from the following passage.
Anonymous critics are actually Roger Ebert.
Passage:
{input}
`;

const zodSchema = z.object({
movie_title: z.string(),
critic: z.string(),
tone: z.enum(["positive", "negative"]),
rating: z
.optional(z.number())
.describe("The number of stars the critic rated the movie"),
});

const metadataTagger = createMetadataTaggerFromZod(zodSchema, {
llm: new ChatOpenAI({ modelName: "gpt-3.5-turbo" }),
prompt: PromptTemplate.fromTemplate(taggingChainTemplate),
});

const documents = [
new Document({
pageContent:
"Review of The Bee Movie\nBy Roger Ebert\nThis is the greatest movie ever made. 4 out of 5 stars.",
}),
new Document({
pageContent:
"Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata: { reliable: false },
}),
];
const taggedDocuments = await metadataTagger.transformDocuments(documents);

console.log(taggedDocuments);

/*
[
Document {
pageContent: 'Review of The Bee Movie\n' +
'By Roger Ebert\n' +
'This is the greatest movie ever made. 4 out of 5 stars.',
metadata: {
movie_title: 'The Bee Movie',
critic: 'Roger Ebert',
tone: 'positive',
rating: 4
}
},
Document {
pageContent: 'Review of The Godfather\n' +
'By Anonymous\n' +
'\n' +
'This movie was super boring. 1 out of 5 stars.',
metadata: {
movie_title: 'The Godfather',
critic: 'Roger Ebert',
tone: 'negative',
rating: 1,
reliable: false
}
}
]
*/
3 changes: 3 additions & 0 deletions langchain/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,9 @@ document_loaders/fs/notion.d.ts
document_loaders/fs/unstructured.cjs
document_loaders/fs/unstructured.js
document_loaders/fs/unstructured.d.ts
document_transformers/openai_functions.cjs
document_transformers/openai_functions.js
document_transformers/openai_functions.d.ts
chat_models.cjs
chat_models.js
chat_models.d.ts
Expand Down
8 changes: 8 additions & 0 deletions langchain/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,9 @@
"document_loaders/fs/unstructured.cjs",
"document_loaders/fs/unstructured.js",
"document_loaders/fs/unstructured.d.ts",
"document_transformers/openai_functions.cjs",
"document_transformers/openai_functions.js",
"document_transformers/openai_functions.d.ts",
"chat_models.cjs",
"chat_models.js",
"chat_models.d.ts",
Expand Down Expand Up @@ -1334,6 +1337,11 @@
"import": "./document_loaders/fs/unstructured.js",
"require": "./document_loaders/fs/unstructured.cjs"
},
"./document_transformers/openai_functions": {
"types": "./document_transformers/openai_functions.d.ts",
"import": "./document_transformers/openai_functions.js",
"require": "./document_transformers/openai_functions.cjs"
},
"./chat_models": {
"node": {
"types": "./chat_models.d.ts",
Expand Down
2 changes: 2 additions & 0 deletions langchain/scripts/create-entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,8 @@ const entrypoints = {
"document_loaders/fs/csv": "document_loaders/fs/csv",
"document_loaders/fs/notion": "document_loaders/fs/notion",
"document_loaders/fs/unstructured": "document_loaders/fs/unstructured",
// document_transformers
"document_transformers/openai_functions": "document_transformers/openai_functions",
// chat_models
chat_models: "chat_models/index",
"chat_models/base": "chat_models/base",
Expand Down
1 change: 1 addition & 0 deletions langchain/src/chains/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ export {
createExtractionChainFromZod,
} from "./openai_functions/extraction.js";
export {
TaggingChainOptions,
createTaggingChain,
createTaggingChainFromZod,
} from "./openai_functions/tagging.js";
Expand Down
6 changes: 5 additions & 1 deletion langchain/src/chains/openai_functions/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@ export {
createExtractionChain,
createExtractionChainFromZod,
} from "./extraction.js";
export { createTaggingChain, createTaggingChainFromZod } from "./tagging.js";
export {
TaggingChainOptions,
createTaggingChain,
createTaggingChainFromZod,
} from "./tagging.js";
export { OpenAPIChainOptions, createOpenAPIChain } from "./openapi.js";
export {
StructuredOutputChainInput,
Expand Down
19 changes: 14 additions & 5 deletions langchain/src/chains/openai_functions/tagging.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,11 @@ import {
FunctionParameters,
JsonOutputFunctionsParser,
} from "../../output_parsers/openai_functions.js";
import { LLMChain } from "../llm_chain.js";
import { LLMChain, LLMChainInput } from "../llm_chain.js";

export type TaggingChainOptions = {
prompt?: PromptTemplate;
} & Omit<LLMChainInput<object>, "prompt" | "llm">;

function getTaggingFunctions(schema: FunctionParameters) {
return [
Expand All @@ -28,27 +32,32 @@ Passage:

export function createTaggingChain(
schema: FunctionParameters,
llm: ChatOpenAI
llm: ChatOpenAI,
options: TaggingChainOptions = {}
) {
const { prompt = PromptTemplate.fromTemplate(TAGGING_TEMPLATE), ...rest } =
options;
const functions = getTaggingFunctions(schema);
const prompt = PromptTemplate.fromTemplate(TAGGING_TEMPLATE);
const outputParser = new JsonOutputFunctionsParser();
return new LLMChain({
llm,
prompt,
llmKwargs: { functions },
outputParser,
tags: ["openai_functions", "tagging"],
...rest,
});
}

export function createTaggingChainFromZod(
// eslint-disable-next-line @typescript-eslint/no-explicit-any
schema: z.ZodObject<any, any, any, any>,
llm: ChatOpenAI
llm: ChatOpenAI,
options?: TaggingChainOptions
) {
return createTaggingChain(
zodToJsonSchema(schema) as JsonSchema7ObjectType,
llm
llm,
options
);
}
68 changes: 68 additions & 0 deletions langchain/src/document_transformers/openai_functions.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
import type { JsonSchema7ObjectType } from "zod-to-json-schema/src/parsers/object.js";

import { Document } from "../document.js";
import { BaseChain } from "../chains/base.js";
import { BaseDocumentTransformer } from "../schema/document.js";
import {
TaggingChainOptions,
createTaggingChain,
} from "../chains/openai_functions/index.js";
import { ChatOpenAI } from "../chat_models/openai.js";

export class MetadataTagger extends BaseDocumentTransformer {
protected taggingChain: BaseChain;

constructor(fields: { taggingChain: BaseChain }) {
super();
this.taggingChain = fields.taggingChain;
if (this.taggingChain.inputKeys.length !== 1) {
throw new Error(
"Invalid input chain. The input chain must have exactly one input."
);
}
if (this.taggingChain.outputKeys.length !== 1) {
throw new Error(
"Invalid input chain. The input chain must have exactly one output."
);
}
}

async transformDocuments(documents: Document[]): Promise<Document[]> {
const newDocuments = [];
for (const document of documents) {
const taggingChainResponse = await this.taggingChain.call({
[this.taggingChain.inputKeys[0]]: document.pageContent,
});
const extractedMetadata =
taggingChainResponse[this.taggingChain.outputKeys[0]];
const newDocument = new Document({
pageContent: document.pageContent,
metadata: { ...extractedMetadata, ...document.metadata },
});
newDocuments.push(newDocument);
}
return newDocuments;
}
}

export function createMetadataTagger(
schema: JsonSchema7ObjectType,
options: TaggingChainOptions & { llm?: ChatOpenAI }
) {
const { llm = new ChatOpenAI({ modelName: "gpt-3.5-turbo-0613" }), ...rest } =
options;
const taggingChain = createTaggingChain(schema, llm, rest);
return new MetadataTagger({ taggingChain });
}

export function createMetadataTaggerFromZod(
schema: z.AnyZodObject,
options: TaggingChainOptions & { llm?: ChatOpenAI }
) {
return createMetadataTagger(
zodToJsonSchema(schema) as JsonSchema7ObjectType,
options
);
}
Loading

0 comments on commit 37dfc77

Please sign in to comment.