Skip to main content

Apify Dataset

This guide shows how to use Apify with LangChain to load documents from an Apify Dataset.

Overview

Apify is a cloud platform for web scraping and data extraction, which provides an ecosystem of more than a thousand ready-made apps called Actors for various web scraping, crawling, and data extraction use cases.

This guide shows how to load documents from an Apify Dataset — a scalable append-only storage built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel.

Datasets are typically used to save results of Actors. For example, Website Content Crawler Actor deeply crawls websites such as documentation, knowledge bases, help centers, or blogs, and then stores the text content of webpages into a dataset, from which you can feed the documents into a vector index and answer questions from it.

Setup

You'll first need to install the official Apify client:

npm install apify-client
npm install @langchain/openai @langchain/community

You'll also need to sign up and retrieve your Apify API token.

Usage

From a New Dataset

If you don't already have an existing dataset on the Apify platform, you'll need to initialize the document loader by calling an Actor and waiting for the results.

Note: Calling an Actor can take a significant amount of time, on the order of hours, or even days for large sites!

Here's an example:

import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
startUrls: [{ url: "https://js.langchain.com/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
}
);

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const model = new ChatOpenAI({
temperature: 0,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/

API Reference:

From an Existing Dataset

If you already have an existing dataset on the Apify platform, you can initialize the document loader with the constructor directly:

import { ApifyDatasetLoader } from "langchain/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
});

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const model = new ChatOpenAI({
temperature: 0,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/

API Reference:


Was this page helpful?


You can leave detailed feedback on GitHub.