AlekSystem Workflow Detail

Domain-Specific Web Content Crawler with Depth Control & Text Extraction Workflow Solution

Domain-Specific Web Content Crawler with Depth Control & Text Extraction

This template implements a recursive web crawler inside AlekSystem.

Rank 54 Verified workflow

Workflow overview

Why this workflow matters

Useful for software delivery and engineering operations. Supports knowledge capture and document intelligence use cases.

This template implements a recursive web crawler inside AlekSystem. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook. 🚀 How It Works 1) Webhook Trigger Accepts a JSON body with a url field. Example payload: { "url": "https://example.com" } 2) Initialization Sets crawl parameters: url, domain, maxDepth = 3, and depth = 0. Initializes global static data (pending, visited, queued, pages). 3) Recursive Crawling Fetches each page (HTTP Request). Extracts body text and links (HTML node). Cleans and deduplicates links. Filters out: External domains (only same-site is followed) Anchors (#), mailto/tel/javascript links Non-HTML files (.pdf, .docx, .xlsx, .pptx) 4) Depth Control & Queue Tracks visited URLs Stops at maxDepth to prevent infinite loops Uses SplitInBatches to loop the queue 5) Data Collection Saves each crawled page (url, depth, content) into pages[] When pending = 0, combines results 6) Output Responds via the Webhook node with: combinedContent (all pages concatenated) pages[] (array of individual results) Large results are chunked when exceeding ~12,000 characters 🛠️ Setup Instructions 1) Import Template Load from AlekSystem Community Templates. 2) Configure Webhook Open the Webhook node Copy the Test URL (development) or Production URL (after deploy) You’ll POST crawl requests to this endpoint 3) Run a Test Send a POST with JSON: curl -X POST https://<your-AlekSystem>/webhook/<id> \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' 4) View Response The crawler returns a JSON object containing combinedContent and pages[]. ⚙️ Configuration maxDepth** Default: 3. Adjust in the Init Crawl Params (Set) node. Timeouts** HTTP Request node timeout is 5 seconds per request; increase if needed. Filtering Rules** Only same-domain links are followed (apex and www treated as same-site) Skips anchors, mailto:, tel:, javascript: Skips document links (.pdf, .docx, .xlsx, .pptx) You can tweak the regex and logic in Queue & Dedup Links (Code) node 📌 Limitations No JavaScript rendering (static HTML only) No authentication/cookies/session handling Large sites can be slow or hit timeouts; chunking mitigates response size ✅ Example Use Cases Extract text across your site for AI ingestion / embeddings SEO/content audit and internal link checks Build a lightweight page corpus for downstream processing in AlekSystem ⏱️ Estimated Setup Time ~10 minutes (import → set webhook → test request)

Best fit

Categories

AI/MLDevOpsMarketingDocument Ops

Services

Use cases

content automationengineering workflow automationdocument intelligence