# Sitemap & Automated Crawling

The Sitemap Management page allows you to point Lyntaris at public websites or internal domains, converting html DOM trees into an autonomous, self-updating Knowledge Base. By combining headless browser scraping with a specialized HTML-to-Markdown LLM prompt, you bypass the need for manual data entry.

Domains and Discovery

The start of the autonomous pipeline involves fetching raw HTML.

  • Add Domain: Enter a base URL (e.g., lyntaris.com). This instructs the Flowise backend to initialize a Playwright mapping instance.
  • Discovery Job: Clicking Discover New Pages dispatches a background HTTP web crawler. The crawler extracts all nested <a href> links up to a configured depth.
  • Domain Files: Once a page is discovered, Flowise writes an empty .md (Markdown) file into the C:\Deployments\_crawl_data local directory. This file acts as a placeholder for the future scraped content.

The Scrape Tree

Not all pages are useful for an LLM's context window (e.g., /contact, /login, or /cart).

  • Toggles: The UI provides an interactive DOM-style folder tree. You can toggle specific routes to "Use" (crawl and index), "Ignore" (skip), or "Delete" (purge from the filesystem).
  • Process Images Override: Explicitly force the crawler to extract the visual src attributes of any <img> tags found on that specific URL, converting them into multi-modal metadata so the Avatar can display them to the user later.

Vectorization Pipelines

Once you have curated the target list, you dispatch the heavy-lifting indexing jobs:

  • Run Pipeline: This triggers the headless browser to fetch the raw HTML of all "Use" pages. Crucially, the raw HTML is then passed through an LLM sequence (e.g., gpt-5.2) running in Python. The LLM's job is to read the messy HTML (navbars, footers, ads) and output a clean, unified Markdown document containing only the core article text. This clean markdown is saved into the previously created .md placeholder.
  • Q&A / Documents Vector Sync: After the markdown files are generated, Flowise chunks them using a RecursiveCharacterTextSplitter. These chunks are passed through the E5 model to generate vectors, which are then injected into the local Weaviate database. You can manually inspect or edit these chunks in the Data Processing tab.
  • Delta Hashes: To prevent burning redundant LLM API credits, Lyntaris computes an MD5 delta hash of the raw HTML stream. The next time you click Run Pipeline, the scraper instantly skips any webpage where the DOM has not changed since the last execution.

File Browser

This tab acts as a raw IDE for the scraped .md files.

  • If the LLM failed to properly format a table from a specific corporate webpage, you can open the extracted .md file directly in Flowise and manually fix the markdown. Because the Vector Sync relies directly on these files, fixing the markdown here instantly heals the AI's RAG memory upon the next sync.

results matching ""

    No results matching ""