Skip to content

Web Scraper

Fetches and aggregates content from one or more web pages, with optional recursive crawling up to a limited depth. It can also run one or multiple natural-language queries against the scraped pages and return formatted answers alongside the collected documents.
Preview

Usage

Use this node to gather information from specified URLs, optionally follow links to related pages, and produce concise answers to research questions based on the scraped content. Typical workflows include competitive research, technical documentation review, or compiling references where you provide multiple URLs and one or more queries to be answered from the collected data.

Inputs

FieldRequiredTypeDescriptionExample
urlsTrueSTRINGOne or multiple starting URLs to scrape. Provide each URL on its own line.https://example.com https://docs.example.com/guide
max_depthTrueINTMaximum crawl depth for following links from the provided pages. Depth 0 scrapes only the listed URLs; higher values follow links recursively within the limit.1
queriesTrueSTRINGOptional research questions to answer from the scraped pages. Provide each query on its own line. Leave empty if you only want documents.What are the key features? Summarize the installation steps.

Outputs

FieldTypeDescriptionExample
answersSTRINGA formatted text block containing each query and its corresponding answer based on the scraped content. If a query fails, the response notes the exception.**Query** What are the key features? **Answer** The product offers A, B, and C with integrations for X and Y.
documentsSTRINGA JSON-formatted string array of scraped documents with content and metadata. Entries may include notes if a page was skipped due to an error.[ { "name": "https://example.com", "content": "...scraped text..." }, { "name": "Error", "content": "Page was skipped due to exception: ..." } ]

Important Notes

  • Depth limit: max_depth accepts values from 0 to 3. Depth 0 scrapes only the provided URLs.
  • Timeouts: Very large pages or deep crawls can lead to timeouts. Keep depth conservative for large sites.
  • Queries are optional: Although provided as an input, you can leave queries empty to only collect documents.
  • Multiple items: Separate multiple URLs or queries by line breaks.
  • Error propagation: If a page or query fails, the outputs include explanatory messages embedded in the corresponding document or answer.
  • Access restrictions: Pages requiring authentication, CAPTCHA, or blocked by robots may not be scraped successfully.

Troubleshooting

  • Empty URLs error: If you see an error about missing URLs, ensure at least one non-empty URL is provided (one per line).
  • Timeouts: Reduce max_depth, limit the number of starting URLs, or target smaller pages to avoid timeouts.
  • Blocked or private pages: If documents show errors or missing content, verify the pages are publicly accessible and not behind login or anti-bot protection.
  • Broken links: At higher depths, linked pages might be unavailable. Lower the depth or remove problematic starting URLs.
  • Unhelpful answers: Provide more specific queries or add more relevant URLs so the node has enough context to answer well.
  • Malformed output JSON: If downstream parsing fails, ensure you pass the 'documents' output (a JSON string) to a step expecting JSON, or parse it before use.

Example Pipelines

Example
Example