PuppeteerWebBaseLoader
Only available on Node.js.
This notebook provides a quick overview for getting started with PuppeteerWebBaseLoader. For detailed documentation of all PuppeteerWebBaseLoader features and configurations head to the API reference.
Puppeteer is a Node.js library that provides a high-level API for controlling headless Chrome or Chromium. You can use Puppeteer to automate web page interactions, including extracting data from dynamic web pages that require JavaScript to render.
If you want a lighterweight solution, and the webpages you want to load do not require JavaScript to render, you can use the CheerioWebBaseLoader instead.
Overview
Integration details
Class | Package | Local | Serializable | PY support |
---|---|---|---|---|
PuppeteerWebBaseLoader | @langchain/community | ✅ | beta | ❌ |
Loader features
Source | Web Loader | Node Envs Only |
---|---|---|
PuppeteerWebBaseLoader | ✅ | ✅ |
Setup
To access PuppeteerWebBaseLoader
document loader you’ll need to
install the @langchain/community
integration package, along with the
puppeteer
peer dependency.
Credentials
If you want to get automated tracing of your model calls you can also set your LangSmith API key by uncommenting below:
# export LANGCHAIN_TRACING_V2="true"
# export LANGCHAIN_API_KEY="your-api-key"
Installation
The LangChain PuppeteerWebBaseLoader integration lives in the
@langchain/community
package:
- npm
- yarn
- pnpm
npm i @langchain/community @langchain/core puppeteer
yarn add @langchain/community @langchain/core puppeteer
pnpm add @langchain/community @langchain/core puppeteer
Instantiation
Now we can instantiate our model object and load documents:
import { PuppeteerWebBaseLoader } from "@langchain/community/document_loaders/web/puppeteer";
const loader = new PuppeteerWebBaseLoader("https://langchain.com", {
// required params = ...
// optional params = ...
});
Load
const docs = await loader.load();
docs[0];
Document {
pageContent: '<div class="page-wrapper"><div class="global-styles w-embed"><style>\n' +
'\n' +
'* {\n' +
' -webkit-font-smoothing: antialiased;\n' +
'}\n' +
'\n' +
'.page-wrapper {\n' +
'overflow: clip;\n' +
' }\n' +
'\n' +
'\n' +
'\n' +
'/* Set fluid size change for smaller breakpoints */\n' +
' html { font-size: 1rem; }\n' +
' @media screen and (max-width:1920px) and (min-width:1281px) { html { font-size: calc(0.2499999999999999rem + 0.6250000000000001vw); } }\n' +
' @media screen and (max-width:1280px) and (min-width:992px) { html { font-size: calc(0.41223612197028925rem + 0.4222048475371384vw); } }\n' +
'/* video sizing */\n' +
'\n' +
'video {\n' +
' object-fit: fill;\n' +
'\t\twidth: 100%;\n' +
'}\n' +
'\n' +
'\n' +
'\n' +
'#retrieval-video {\n' +
' object-fit: cover;\n' +
' width: 100%;\n' +
'}\n' +
'\n' +
'\n' +
'\n' +
'/* Set color style to inherit */\n' +
'.inherit-color * {\n' +
' color: inherit;\n' +
'}\n' +
'\n' +
'/* Focus state style for keyboard navigation for the focusable elements */\n' +
'*[tabindex]:focus-visible,\n' +
' input[type="file"]:focus-visible {\n' +
' outline: 0.125rem solid #4d65ff;\n' +
' outline-offset: 0.125rem;\n' +
'}\n' +
'\n' +
'/* Get rid of top margin on first element in any rich text element */\n' +
'.w-richtext > :not(div):first-child, .w-richtext > div:first-child > :first-child {\n' +
' margin-top: 0 !important;\n' +
'}\n' +
'\n' +
'/* Get rid of bottom margin on last element in any rich text element */\n' +
'.w-richtext>:last-child, .w-richtext ol li:last-child, .w-richtext ul li:last-child {\n' +
'\tmargin-bottom: 0 !important;\n' +
'}\n' +
'\n' +
'/* Prevent all click and hover interaction with an element */\n' +
'.pointer-events-off {\n' +
'\tpointer-events: none;\n' +
'}\n' +
'\n' +
'/* Enables all click and hover interaction with an element */\n' +
'.pointer-events-on {\n' +
' pointer-events: auto;\n' +
'}\n' +
'\n' +
'/* Create a class of .div-square which maintains a 1:1 dimension of a div */\n' +
'.div-square::after {\n' +
'\tcontent: "";\n' +
'\tdisplay: block;\n' +
'\tpadding-bottom: 100%;\n' +
'}\n' +
'\n' +
'/* Make sure containers never lose their center alignment */\n' +
'.container-medium,.container-small, .container-large {\n' +
'\tmargin-right: auto !important;\n' +
' margin-left: auto !important;\n' +
'}\n' +
'\n' +
'/* \n' +
'Make the following elements inherit typography styles from the parent and not have hardcoded values. \n' +
'Important: You will not be able to style for example "All Links" in Designer with this CSS applied.\n' +
'Uncomment this CSS to use it in the project. Leave this message for future hand-off.\n' +
'*/\n' +
'/*\n' +
'a,\n' +
'.w-input,\n' +
'.w-select,\n' +
'.w-tab-link,\n' +
'.w-nav-link,\n' +
'.w-dropdown-btn,\n' +
'.w-dropdown-toggle,\n' +
'.w-dropdown-link {\n' +
' color: inherit;\n' +
' text-decoration: inherit;\n' +
' font-size: inherit;\n' +
'}\n' +
'*/\n' +
'\n' +
'/* Apply "..." after 3 lines of text */\n' +
'.text-style-3lines {\n' +
'\tdisplay: -webkit-box;\n' +
'\toverflow: hidden;\n' +
'\t-webkit-line-clamp: 3;\n' +
'\t-webkit-box-orient: vertical;\n' +
'}\n' +
'\n' +
'/* Apply "..." after 2 lines of text */\n' +
'.text-style-2lines {\n' +
'\tdisplay: -webkit-box;\n' +
'\toverflow: hidden;\n' +
'\t-webkit-line-clamp: 2;\n' +
'\t-webkit-box-orient: vertical;\n' +
'}\n' +
'\n' +
'/* Adds inline flex display */\n' +
'.display-inlineflex {\n' +
' display: inline-flex;\n' +
'}\n' +
'\n' +
'/* These classes are never overwritten */\n' +
'.hide {\n' +
' display: none !important;\n' +
'}\n' +
'\n' +
'@media screen and (max-width: 991px) {\n' +
' .hide, .hide-tablet {\n' +
' display: none !important;\n' +
' }\n' +
'}\n' +
' @media screen and (max-width: 767px) {\n' +
' .hide-mobile-landscape{\n' +
' display: none !important;\n' +
' }\n' +
'}\n' +
' @media screen and (max-width: 479px) {\n' +
' .hide-mobile{\n' +
' display: none !important;\n' +
' }\n' +
'}\n' +
' \n' +
'.margin-0 {\n' +
' margin: 0rem !important;\n' +
'}\n' +
' \n' +
'.padding-0 {\n' +
' padding: 0rem !important;\n' +
'}\n' +
'\n' +
'.spacing-clean {\n' +
'padding: 0rem !important;\n' +
'margin: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-top {\n' +
' margin-right: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-top {\n' +
' padding-right: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
' \n' +
'.margin-right {\n' +
' margin-top: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-right {\n' +
' padding-top: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-bottom {\n' +
' margin-top: 0rem !important;\n' +
' margin-right: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-bottom {\n' +
' padding-top: 0rem !important;\n' +
' padding-right: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-left {\n' +
' margin-top: 0rem !important;\n' +
' margin-right: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
'}\n' +
' \n' +
'.padding-left {\n' +
' padding-top: 0rem !important;\n' +
' padding-right: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
'}\n' +
' \n' +
'.margin-horizontal {\n' +
' margin-top: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-horizontal {\n' +
' padding-top: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-vertical {\n' +
' margin-right: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
' \n' +
'.padding-vertical {\n' +
' padding-right: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
'\n' +
'/* Apply "..." at 100% width */\n' +
'.truncate-width { \n' +
'\t\twidth: 100%; \n' +
' white-space: nowrap; \n' +
' overflow: hidden; \n' +
' text-overflow: ellipsis; \n' +
'}\n' +
'/* Removes native scrollbar */\n' +
'.no-scrollbar {\n' +
' -ms-overflow-style: none;\n' +
' overflow: -moz-scrollbars-none; \n' +
'}\n' +
'\n' +
'.no-scrollbar::-webkit-scrollbar {\n' +
' display: none;\n' +
'}\n' +
'\n' +
'input:checked + span {\n' +
'color: white /* styles for the div immediately following the checked input */\n' +
'}\n' +
'\n' +
'/* styles for word-wrapping\n' +
'h1, h2, h3 {\n' +
'word-wrap: break-word;\n' +
'hyphens: auto;\n' +
'}*/\n' +
'\n' +
'[nav-theme="light"] .navbar_logo-svg {\n' +
'\t--nav--logo: var(--light--logo);\n' +
'}\n' +
'\n' +
'[nav-theme="light"] .button.is-nav {\n' +
'\t--nav--button-bg: var(--light--button-bg);\n' +
'\t--nav--button-text: var(--light--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="light"] .button.is-nav:hover {\n' +
'\t--nav--button-bg: var(--dark--button-bg);\n' +
'\t--nav--button-text:var(--dark--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="dark"] .navbar_logo-svg {\n' +
'\t--nav--logo: var(--dark--logo);\n' +
'}\n' +
'\n' +
'[nav-theme="dark"] .button.is-nav {\n' +
'\t--nav--button-bg: var(--dark--button-bg);\n' +
'\t--nav--button-text: var(--dark--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="dark"] .button.is-nav:hover {\n' +
'\t--nav--button-bg: var(--light--button-bg);\n' +
'\t--nav--button-text: var(--light--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="red"] .navbar_logo-svg {\n' +
'\t--nav--logo: var(--red--logo);\n' +
'}\n' +
'\n' +
'\n' +
'[nav-theme="red"] .button.is-nav {\n' +
'\t--nav--button-bg: var(--red--button-bg);\n' +
'\t--nav--button-text: var(--red--button-text);\n' +
'}\n' +
'\n' +
'.navbar_logo-svg.is-light, .navbar_logo-svg.is-red.is-light{\n' +
'color: #F8F7FF!important;\n' +
'}\n' +
'\n' +
'.news_button[disabled] {\n' +
'background: none;\n' +
'}\n' +
'\n' +
'.product_bg-video video {\n' +
'object-fit: fill;\n' +
'}\n' +
'\n' +
'</style></div><div data-animation="default" class="navbar_component w-nav" data-easing2="ease" fs-scrolldisable-element="smart-nav" data-easing="ease" data-collapse="medium" data-w-id="78839fc1-6b85-b108-b164-82fcae730868" role="banner" data-duration="400" style="will-change: width, height; height: 10rem;"><div class="navbar_container"><a href="/" aria-current="page" class="navbar_logo-link w-nav-brand w--current" aria-label="home"><div class="navbar_logo-svg w-embed" style="color: rgb(255, 255, 255);"><svg width="100%" height="100%" viewBox="0 0 240 41" fill="none" xmlns="http://www.w3.org/2000/svg">\n' +
'<path d="M61.5139 11.1569C60.4527 11.1569 59.4549 11.568 58.708 12.3148L55.6899 15.3248C54.8757 16.1368 54.4574 17.2643 54.5431 18.4202C54.5492 18.4833 54.5553 18.5464 54.5615 18.6115C54.6696 19.4988 55.0594 20.2986 55.6899 20.9254C56.1246 21.3589 56.6041 21.6337 57.1857 21.825C57.2163 22 57.2326 22.177 57.2326 22.3541C57.2326 23.1519 56.9225 23.9008 56.3592 24.4625L56.1735 24.6477C55.1655 24.3037 54.3247 23.8011 53.5656 23.044C52.5576 22.0386 51.8903 20.7687 51.6393 19.3747L51.6046 19.1813L51.4515 19.3055C51.3475 19.3889 51.2495 19.4785 51.1577 19.57L48.1396 22.58C46.5928 24.1226 46.5928 26.636 48.1396 28.1786C48.913 28.9499 49.9292 29.3366 50.9475 29.3366C51.9658 29.3366 52.98 28.9499 53.7534 28.1786L56.7715 25.1687C58.3183 23.626 58.3183 21.1147 56.7715 19.57C56.3592 19.159 55.8675 18.8496 55.3104 18.6502C55.2798 18.469 55.2634 18.2879 55.2634 18.1109C55.2634 17.2439 55.6063 16.4217 56.2348 15.7949C57.2449 16.1388 58.1407 16.6965 58.8978 17.4515C59.9038 18.4548 60.5691 19.7227 60.8241 21.1208L60.8588 21.3141L61.0119 21.19C61.116 21.1066 61.2139 21.017 61.3078 20.9234L64.3259 17.9135C65.8727 16.3708 65.8747 13.8575 64.3259 12.3148C63.577 11.568 62.5811 11.1569 61.518 11.1569H61.5139Z" fill="CurrentColor"></path>\n' +
'<path d="M59.8966 0.148865H20.4063C9.15426 0.148865 0 9.27841 0 20.5001C0 31.7217 9.15426 40.8513 20.4063 40.8513H59.8966C71.1486 40.8513 80.3029 31.7217 80.3029 20.5001C80.3029 9.27841 71.1486 0.148865 59.8966 0.148865ZM40.4188 32.0555C39.7678 32.1898 39.0352 32.2142 38.5373 31.6953C38.3536 32.1165 37.9251 31.8947 37.5945 31.8398C37.5639 31.9252 37.5374 32.0005 37.5088 32.086C36.4089 32.1593 35.5845 31.04 35.0601 30.1954C34.0193 29.6337 32.8378 29.2918 31.7746 28.7036C31.7134 29.6724 31.9257 30.8731 31.0012 31.4979C30.9543 33.36 33.8255 31.7177 34.0887 33.1056C33.8847 33.128 33.6582 33.073 33.4949 33.2297C32.746 33.9563 31.8869 32.6803 31.0237 33.2074C29.8646 33.7894 29.7483 34.2656 28.3137 34.3857C28.2342 34.2656 28.2668 34.1862 28.3341 34.113C28.7382 33.6449 28.7668 33.0934 29.4565 32.8939C28.7464 32.782 28.1525 33.1728 27.5546 33.4821C26.7771 33.7996 26.7833 32.7657 25.5875 33.537C25.4548 33.4292 25.5181 33.3315 25.5936 33.2481C25.8976 32.8777 26.2976 32.8227 26.7486 32.8431C24.5304 31.6098 23.4856 34.3511 22.4612 32.9876C22.1531 33.069 22.0368 33.3457 21.8429 33.5411C21.6756 33.358 21.8021 33.1361 21.8103 32.9204C21.6103 32.8268 21.3572 32.782 21.4164 32.4625C21.0246 32.3302 20.7512 32.5622 20.4594 32.782C20.1961 32.5785 20.6369 32.2814 20.7185 32.0697C20.9532 31.6627 21.4878 31.9863 21.7592 31.6932C22.5306 31.2557 23.606 31.9659 24.4876 31.8459C25.1671 31.9313 26.0078 31.2353 25.667 30.5413C24.9406 29.6154 25.0691 28.4045 25.0528 27.2974C24.963 26.6522 23.4101 25.83 22.9612 25.134C22.4061 24.5072 21.9735 23.7807 21.5409 23.0664C19.9798 20.0523 20.4716 16.1795 18.5044 13.3812C17.6147 13.8717 16.4556 13.6397 15.6884 12.9823C15.2741 13.3588 15.2557 13.8513 15.2231 14.3744C14.2293 13.3833 14.3538 11.5109 15.1476 10.4079C15.4721 9.97239 15.8598 9.61421 16.2924 9.29876C16.3903 9.22754 16.423 9.15834 16.4209 9.04844C17.2066 5.52362 22.5653 6.20335 24.259 8.70044C25.4875 10.237 25.8589 12.27 27.2526 13.6967C29.1279 15.744 31.2645 17.5471 32'... 73262 more characters,
metadata: { source: 'https://langchain.com' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://langchain.com' }
Options
Here’s an explanation of the parameters you can pass to the PuppeteerWebBaseLoader constructor using the PuppeteerWebBaseLoaderOptions interface:
type PuppeteerWebBaseLoaderOptions = {
launchOptions?: PuppeteerLaunchOptions;
gotoOptions?: PuppeteerGotoOptions;
evaluate?: (page: Page, browser: Browser) => Promise<string>;
};
launchOptions
: an optional object that specifies additional options to pass to the puppeteer.launch() method. This can include options such as the headless flag to launch the browser in headless mode, or the slowMo option to slow down Puppeteer’s actions to make them easier to follow.gotoOptions
: an optional object that specifies additional options to pass to the page.goto() method. This can include options such as the timeout option to specify the maximum navigation time in milliseconds, or the waitUntil option to specify when to consider the navigation as successful.evaluate
: an optional function that can be used to evaluate JavaScript code on the page using the page.evaluate() method. This can be useful for extracting data from the page or interacting with page elements. The function should return a Promise that resolves to a string containing the result of the evaluation.
By passing these options to the PuppeteerWebBaseLoader
constructor,
you can customize the behavior of the loader and use Puppeteer’s
powerful features to scrape and interact with web pages.
Screenshots
To take a screenshot of a site, initialize the loader the same as above,
and call the .screenshot()
method. This will return an instance of
Document
where the page content is a base64 encoded image, and the
metadata contains a source
field with the URL of the page.
import { PuppeteerWebBaseLoader } from "@langchain/community/document_loaders/web/puppeteer";
const loaderForScreenshot = new PuppeteerWebBaseLoader(
"https://langchain.com",
{
launchOptions: {
headless: true,
},
gotoOptions: {
waitUntil: "domcontentloaded",
},
}
);
const screenshot = await loaderForScreenshot.screenshot();
console.log(screenshot.pageContent.slice(0, 100));
console.log(screenshot.metadata);
iVBORw0KGgoAAAANSUhEUgAACWAAAAdoCAIAAAA/Q2IJAAAAAXNSR0IArs4c6QAAIABJREFUeJzsvUuzHUeSJuaPiMjMk3nOuU88
{ source: 'https://langchain.com' }
API reference
For detailed documentation of all PuppeteerWebBaseLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_puppeteer.PuppeteerWebBaseLoader.html