Skip to content

Commit

Permalink
Add API key support to Unstructured loaders (langchain-ai#1128)
Browse files Browse the repository at this point in the history
* feat: support unstructured api key in loader

* add API key to instructions and examples

* change api key order

* update integration tests

* retrigger build

* default web path to hosted api

* moved api key to an options argument

* interface for unstructured options

* linting, linting, linting

* refactor loaders into mapping

* move webpath to options

* update function calls in test

* fix examples

* linting, linting, linting ...

* Adds shim for existing Unstructured users

* Remove bad content type header from Unstructured API call

* Fix md formatting

* Small fixes

* Move additional UnstructuredDirectoryLoader constructor options into an options object

* Remove extra unnecessary declaration

---------

Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Jacob Lee <[email protected]>
Co-authored-by: Nuno Campos <[email protected]>
  • Loading branch information
5 people authored May 5, 2023
1 parent 3495a79 commit 4e22af2
Show file tree
Hide file tree
Showing 8 changed files with 164 additions and 48 deletions.
11 changes: 5 additions & 6 deletions docs/docs/ecosystem/unstructured.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ Unstructured is an [open source](https://github.com/Unstructured-IO/unstructured

`unstructured` is a Python package and cannot be used directly with TS/JS, however Unstructured also maintains a [REST API](https://github.com/Unstructured-IO/unstructured-api) to support pre-processing pipelines written in other programming languages. The endpoint for the hosted Unstructured API is `https://api.unstructured.io/general/v0/general`, or you can run the service locally using the instructions found [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image).

Currently (as of April 26th, 2023), the Unstructured API does not require an API key. The API will begin to require an API key in the near future. The [Unstructured documentation page](https://unstructured-io.github.io/unstructured/) will include instructions on how to obtain an API key once they are available.

## Quick start

You can use Unstructured in `langchain` with the following code.
Replace the filename with the file you would like to process.
If you are running the container locally, switch the url to
`http://127.0.0.1:8000/general/v0/general`.
Check out the [API documentation page](https://api.unstructured.io/general/docs)
for additional details.
If you are running the container locally, switch the url to `http://127.0.0.1:8000/general/v0/general`.
Check out the [API documentation page](https://api.unstructured.io/general/docs) for additional details.

import SingleExample from "@examples/document_loaders/unstructured.ts";

Expand All @@ -42,8 +42,7 @@ Currently, the `UnstructuredLoader` supports the following document types:
- HTML (`.html`)
- Markdown Files (`.md`)

The output from the `UnstructuredLoader` will be an array of `Document` objects that looks
like the following:
The output from the `UnstructuredLoader` will be an array of `Document` objects that looks like the following:

```typescript
[
Expand Down
8 changes: 6 additions & 2 deletions examples/src/document_loaders/unstructured.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
import { UnstructuredLoader } from "langchain/document_loaders/fs/unstructured";

const options = {
apiKey: "MY_API_KEY",
};

const loader = new UnstructuredLoader(
"https://api.unstructured.io/general/v0/general",
"src/document_loaders/example_data/notion.md"
"src/document_loaders/example_data/notion.md",
options
);
const docs = await loader.load();
8 changes: 6 additions & 2 deletions examples/src/document_loaders/unstructured_directory.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
import { UnstructuredDirectoryLoader } from "langchain/document_loaders/fs/unstructured";

const options = {
apiKey: "MY_API_KEY",
};

const loader = new UnstructuredDirectoryLoader(
"https://api.unstructured.io/general/v0/general",
"langchain/src/document_loaders/tests/example_data"
"langchain/src/document_loaders/tests/example_data",
options
);
const docs = await loader.load();
8 changes: 5 additions & 3 deletions langchain/src/document_loaders/fs/directory.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@ export const UnknownHandling = {
export type UnknownHandling =
(typeof UnknownHandling)[keyof typeof UnknownHandling];

export interface LoadersMapping {
[extension: string]: (filePath: string) => BaseDocumentLoader;
}

export class DirectoryLoader extends BaseDocumentLoader {
constructor(
public directoryPath: string,
public loaders: {
[extension: string]: (filePath: string) => BaseDocumentLoader;
},
public loaders: LoadersMapping,
public recursive: boolean = true,
public unknown: UnknownHandling = UnknownHandling.Warn
) {
Expand Down
121 changes: 92 additions & 29 deletions langchain/src/document_loaders/fs/unstructured.ts
Original file line number Diff line number Diff line change
@@ -1,25 +1,73 @@
import type { basename as BasenameT } from "node:path";
import type { readFile as ReaFileT } from "node:fs/promises";
import { DirectoryLoader, UnknownHandling } from "./directory.js";
import type { readFile as ReadFileT } from "node:fs/promises";
import {
DirectoryLoader,
UnknownHandling,
LoadersMapping,
} from "./directory.js";
import { getEnv } from "../../util/env.js";
import { Document } from "../../document.js";
import { BaseDocumentLoader } from "../base.js";

interface Element {
const UNSTRUCTURED_API_FILETYPES = [
".txt",
".text",
".pdf",
".docx",
".doc",
".jpg",
".jpeg",
".eml",
".html",
".md",
".pptx",
".ppt",
".msg",
];

type Element = {
type: string;
text: string;
// this is purposefully loosely typed
metadata: {
[key: string]: unknown;
};
}
};

type UnstructuredLoaderOptions = {
apiKey?: string;
apiUrl?: string;
};

type UnstructuredDirectoryLoaderOptions = UnstructuredLoaderOptions & {
recursive?: boolean;
unknown?: UnknownHandling;
};

export class UnstructuredLoader extends BaseDocumentLoader {
constructor(public webPath: string, public filePath: string) {
public filePath: string;

private apiUrl = "https://api.unstructured.io/general/v0/general";

private apiKey?: string;

constructor(
filePathOrLegacyApiUrl: string,
optionsOrLegacyFilePath: UnstructuredLoaderOptions | string = {}
) {
super();
this.filePath = filePath;

this.webPath = webPath;
// Temporary shim to avoid breaking existing users
// Remove when API keys are enforced by Unstructured and existing code will break anyway
const isLegacySyntax = typeof optionsOrLegacyFilePath === "string";
if (isLegacySyntax) {
this.filePath = optionsOrLegacyFilePath;
this.apiUrl = filePathOrLegacyApiUrl;
} else {
this.filePath = filePathOrLegacyApiUrl;
this.apiKey = optionsOrLegacyFilePath.apiKey;
this.apiUrl = optionsOrLegacyFilePath.apiUrl ?? this.apiUrl;
}
}

async _partition() {
Expand All @@ -34,9 +82,14 @@ export class UnstructuredLoader extends BaseDocumentLoader {
const formData = new FormData();
formData.append("files", new Blob([buffer]), fileName);

const response = await fetch(this.webPath, {
const headers = {
"UNSTRUCTURED-API-KEY": this.apiKey ?? "",
};

const response = await fetch(this.apiUrl, {
method: "POST",
body: formData,
headers,
});

if (!response.ok) {
Expand Down Expand Up @@ -77,7 +130,7 @@ export class UnstructuredLoader extends BaseDocumentLoader {
}

async imports(): Promise<{
readFile: typeof ReaFileT;
readFile: typeof ReadFileT;
basename: typeof BasenameT;
}> {
try {
Expand All @@ -95,27 +148,37 @@ export class UnstructuredLoader extends BaseDocumentLoader {

export class UnstructuredDirectoryLoader extends DirectoryLoader {
constructor(
public webPath: string,
public directoryPath: string,
public recursive: boolean = true,
public unknown: UnknownHandling = UnknownHandling.Warn
directoryPathOrLegacyApiUrl: string,
optionsOrLegacyDirectoryPath: UnstructuredDirectoryLoaderOptions | string,
legacyOptionRecursive = true,
legacyOptionUnknown: UnknownHandling = UnknownHandling.Warn
) {
const loaders = {
".txt": (p: string) => new UnstructuredLoader(webPath, p),
".text": (p: string) => new UnstructuredLoader(webPath, p),
".pdf": (p: string) => new UnstructuredLoader(webPath, p),
".docx": (p: string) => new UnstructuredLoader(webPath, p),
".doc": (p: string) => new UnstructuredLoader(webPath, p),
".jpg": (p: string) => new UnstructuredLoader(webPath, p),
".jpeg": (p: string) => new UnstructuredLoader(webPath, p),
".eml": (p: string) => new UnstructuredLoader(webPath, p),
".html": (p: string) => new UnstructuredLoader(webPath, p),
".md": (p: string) => new UnstructuredLoader(webPath, p),
".pptx": (p: string) => new UnstructuredLoader(webPath, p),
".ppt": (p: string) => new UnstructuredLoader(webPath, p),
".msg": (p: string) => new UnstructuredLoader(webPath, p),
};
super(directoryPath, loaders, recursive, unknown);
let directoryPath;
let options: UnstructuredDirectoryLoaderOptions;
// Temporary shim to avoid breaking existing users
// Remove when API keys are enforced by Unstructured and existing code will break anyway
const isLegacySyntax = typeof optionsOrLegacyDirectoryPath === "string";
if (isLegacySyntax) {
directoryPath = optionsOrLegacyDirectoryPath;
options = {
apiUrl: directoryPathOrLegacyApiUrl,
recursive: legacyOptionRecursive,
unknown: legacyOptionUnknown,
};
} else {
directoryPath = directoryPathOrLegacyApiUrl;
options = optionsOrLegacyDirectoryPath;
}
const loader = (p: string) => new UnstructuredLoader(p, options);
const loaders = UNSTRUCTURED_API_FILETYPES.reduce(
(loadersObject: LoadersMapping, filetype: string) => {
// eslint-disable-next-line no-param-reassign
loadersObject[filetype] = loader;
return loadersObject;
},
{}
);
super(directoryPath, loaders, options.recursive, options.unknown);
}
}

Expand Down
7 changes: 5 additions & 2 deletions langchain/src/document_loaders/tests/s3.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,16 @@ test("Test S3 loader", async () => {
});

const result = await loader.load();
const unstructuredOptions = {
apiUrl: "http://localhost:8000/general/v0/general",
};

expect(fsMock.mkdtempSync).toHaveBeenCalled();
expect(fsMock.mkdirSync).toHaveBeenCalled();
expect(fsMock.writeFileSync).toHaveBeenCalled();
expect(UnstructuredLoaderMock).toHaveBeenCalledWith(
"http://localhost:8000/general/v0/general",
path.join("tmp", "s3fileloader-12345", "AccountingOverview.pdf")
path.join("tmp", "s3fileloader-12345", "AccountingOverview.pdf"),
unstructuredOptions
);
expect(result).toEqual(["fake document"]);
});
44 changes: 42 additions & 2 deletions langchain/src/document_loaders/tests/unstructured.int.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import {
UnknownHandling,
} from "../fs/unstructured.js";

test("Test Unstructured base loader", async () => {
test("Test Unstructured base loader legacy syntax", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data/example.txt"
Expand All @@ -25,7 +25,26 @@ test("Test Unstructured base loader", async () => {
}
});

test("Test Unstructured directory loader", async () => {
test("Test Unstructured base loader", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data/example.txt"
);

const options = {
apiKey: "MY_API_KEY",
};

const loader = new UnstructuredLoader(filePath, options);
const docs = await loader.load();

expect(docs.length).toBe(3);
for (const doc of docs) {
expect(typeof doc.pageContent).toBe("string");
}
});

test("Test Unstructured directory loader legacy syntax", async () => {
const directoryPath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data"
Expand All @@ -38,6 +57,27 @@ test("Test Unstructured directory loader", async () => {
UnknownHandling.Ignore
);
const docs = await loader.load();
expect(docs.length).toBe(619);
expect(typeof docs[0].pageContent).toBe("string");
});

test("Test Unstructured directory loader", async () => {
const directoryPath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data"
);

const options = {
apiKey: "MY_API_KEY",
};

const loader = new UnstructuredDirectoryLoader(
directoryPath,
options,
true,
UnknownHandling.Ignore
);
const docs = await loader.load();

expect(docs.length).toBe(619);
expect(typeof docs[0].pageContent).toBe("string");
Expand Down
5 changes: 3 additions & 2 deletions langchain/src/document_loaders/web/s3.ts
Original file line number Diff line number Diff line change
Expand Up @@ -93,9 +93,10 @@ export class S3Loader extends BaseDocumentLoader {
}

try {
const options = { apiUrl: this.unstructuredAPIURL };
const unstructuredLoader = new this._UnstructuredLoader(
this.unstructuredAPIURL,
filePath
filePath,
options
);

const docs = await unstructuredLoader.load();
Expand Down

0 comments on commit 4e22af2

Please sign in to comment.