# Playwright Web Scraper Playwright is a powerful library for browser automation that can control Chromium, Firefox, and WebKit with a single API. This module provides advanced web scraping capabilities using Playwright to extract content from web pages, including dynamic content that requires JavaScript execution. This module provides a sophisticated web scraper that can: - Load content from single or multiple web pages - Handle JavaScript-rendered content - Support various page load strategies - Wait for specific elements to load - Crawl relative links from websites - Process XML sitemaps ## Inputs ### Vizualni prikaz v Flowise (Agentflow V2) Pri zajemanju vsebine s spletnih strani, ki uporabljajo JavaScript (SPA, React, Vue), je Playwright Web Scraper nepogrešljiv. Spodaj je pregled njegovih parametrov na platnu: ![Playwright Web Scraper](../../../images/document-loaders/playwright_web_scraper_config_1773434865113.png) - **URL**: The webpage URL to scrape - **Text Splitter** (optional): A text splitter to process the extracted content - **Get Relative Links Method** (optional): Choose between: - Web Crawl: Crawl relative links from HTML URL - Scrape XML Sitemap: Scrape relative links from XML sitemap URL - **Get Relative Links Limit** (optional): Limit for number of relative links to process (default: 10, 0 for all links) - **Wait Until** (optional): Page load strategy: - Load: Wait for the load event to fire - DOM Content Loaded: Wait for the DOMContentLoaded event - Network Idle: Wait until no network connections for 500ms - Commit: Wait for initial network response and document loading - **Wait for selector to load** (optional): CSS selector to wait for before scraping - **Additional Metadata** (optional): JSON object with additional metadata to add to documents - **Omit Metadata Keys** (optional): Comma-separated list of metadata keys to omit ## Outputs - **Document**: Array of document objects containing metadata and pageContent - **Text**: Concatenated string from pageContent of documents ## Features - Multi-browser engine support (Chromium, Firefox, WebKit) - JavaScript execution support - Configurable page load strategies - Element wait capabilities - Web crawling functionality - XML sitemap processing - Headless browser operation - Sandbox configuration - Error handling for invalid URLs - Metadata customization ## Notes - Runs in headless mode by default - Uses no-sandbox mode for compatibility - Invalid URLs will throw an error - Setting link limit to 0 will retrieve all available links (may take longer) - Supports waiting for specific DOM elements before extraction ## Scrape One URL 1. _(Optional)_ Connect **[Text Splitter](../text-splitters/)**. 2. Input desired URL to be scraped. ## Crawl & Scrape Multiple URLs Use **Web Crawl** in *Get Relative Links Method* to scrape multiple pages, or point the loader at an XML sitemap. ## Resources * [LangChain JS Playwright](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/web_playwright) * [Playwright](https://playwright.dev/)