Skip to main content

Overview

The AnySite Web Parser node provides powerful web scraping capabilities within your n8n workflows. Extract data from any website, parse HTML content, and convert unstructured web data into structured information for analysis and automation.

Node Configuration

Authentication

credential
AnySite API Credentials
required
Select your AnySite API credentials from the dropdown or create new ones.

Available Operations

  • Parse URL
  • Bulk URL Parse
  • Smart Extraction
  • Monitor Changes
Extract data from a specific web page URL.Parameters:
  • URL (required): Web page URL to scrape
  • Wait For Load: Wait time for dynamic content (0-30 seconds)
  • Extract Images: Include image URLs in the output
  • Extract Links: Include all links found on the page
  • Custom Selectors: CSS selectors for specific elements
Example Output:
{
  "page": {
    "url": "https://example.com/article",
    "title": "How to Build Scalable Web Applications",
    "description": "A comprehensive guide to building web applications...",
    "author": "John Developer",
    "publishDate": "2024-08-26",
    "content": "Building scalable web applications requires...",
    "images": [
      "https://example.com/images/architecture.png",
      "https://example.com/images/diagram.jpg"
    ],
    "links": [
      {
        "text": "Related Article",
        "url": "https://example.com/related"
      }
    ],
    "metadata": {
      "wordCount": 1250,
      "readingTime": "5 minutes",
      "tags": ["web development", "scalability", "architecture"]
    }
  }
}

Workflow Examples

Competitor Price Monitoring

1

Monitor Competitor Pages

Set up monitoring for competitor pricing pages and product announcements.
2

Detect Changes

Get automatic notifications when competitors change prices or launch new products.
3

Analysis & Alerts

Analyze pricing changes and send alerts to your team with actionable insights.
4

Strategy Updates

Use the data to adjust your own pricing strategy and competitive positioning.
Example Workflow:
{
  "nodes": [
    {
      "name": "Monitor Competitor Pricing",
      "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
      "operation": "monitorChanges",
      "parameters": {
        "url": "https://competitor.com/pricing",
        "checkInterval": 60,
        "changeThreshold": 5,
        "monitorElements": ["#pricing-table", ".product-price"]
      }
    },
    {
      "name": "Filter Significant Changes",
      "type": "n8n-nodes-base.filter",
      "parameters": {
        "conditions": [
          {
            "field": "changes[0].changePercent",
            "operation": "greaterThan", 
            "value": 10
          }
        ]
      }
    },
    {
      "name": "Analyze Price Change",
      "type": "n8n-nodes-base.function",
      "parameters": {
        "functionCode": `
          const change = items[0].json.changes[0];
          const analysis = {
            competitor: "Competitor Inc",
            product: "Enterprise Plan",
            oldPrice: change.oldValue,
            newPrice: change.newValue,
            changeAmount: change.newValue - change.oldValue,
            changePercent: change.changePercent,
            recommendation: change.changePercent > 0 ? 
              "Consider promotional pricing" : 
              "Review our pricing strategy"
          };
          return [{ json: analysis }];
        `
      }
    },
    {
      "name": "Alert Team",
      "type": "n8n-nodes-base.slack",
      "parameters": {
        "channel": "#competitive-intel",
        "text": "🚨 Competitor Price Change Alert\\n📊 {{ $json.competitor }} changed {{ $json.product }} from {{ $json.oldPrice }} to {{ $json.newPrice }} ({{ $json.changePercent }}%)\\n💡 Recommendation: {{ $json.recommendation }}"
      }
    }
  ]
}

Content Research & Analysis

Automatically research and analyze content from multiple sources:
  1. Industry News Monitoring - Track news sites for industry developments
  2. Competitor Content Analysis - Monitor competitor blogs and announcements
  3. Trend Research - Extract trending topics from various publications
  4. Content Gap Analysis - Find content opportunities in your niche
  5. SEO Research - Analyze top-ranking pages for target keywords

Lead Generation from Websites

Extract leads and contact information from business websites:
  1. Directory Scraping - Extract business listings from directories
  2. Contact Page Parsing - Get contact information from company websites
  3. Team Page Analysis - Extract employee information and roles
  4. Technology Detection - Identify technologies used by target companies
  5. CRM Integration - Automatically add qualified leads to your CRM

Advanced Parsing

Custom CSS Selectors

Extract specific elements using CSS selectors:
{
  "name": "Custom Data Extraction",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "parseUrl",
  "parameters": {
    "url": "https://news.ycombinator.com",
    "customSelectors": {
      "headlines": ".titleline > a",
      "scores": ".score",
      "comments": ".subtext a[href*='item']:last-child",
      "authors": ".hnuser"
    }
  }
}

Dynamic Content Handling

Handle JavaScript-heavy websites:
{
  "name": "Parse SPA Website",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "parseUrl", 
  "parameters": {
    "url": "https://spa-website.com",
    "waitForLoad": 10,
    "waitForSelector": "#dynamic-content",
    "executeJavaScript": "document.querySelector('#load-more').click()"
  }
}

Data Transformation

Transform extracted data into structured format:
// Clean and structure scraped data
{
  "name": "Transform Data",
  "type": "n8n-nodes-base.function",
  "parameters": {
    "functionCode": `
      const cleanText = (text) => text?.trim().replace(/\\s+/g, ' ');
      const extractPrice = (text) => {
        const match = text.match(/\\$([\\d,]+(?:\\.\\d{2})?)/);
        return match ? parseFloat(match[1].replace(',', '')) : null;
      };
      
      const transformed = items.map(item => ({
        json: {
          title: cleanText(item.json.title),
          price: extractPrice(item.json.priceText),
          description: cleanText(item.json.description),
          url: item.json.url,
          extractedAt: new Date().toISOString()
        }
      }));
      
      return transformed;
    `
  }
}

Error Handling

Common Issues

Error: 408 - Page load timeoutSolution:
  • Increase wait time for slow-loading pages
  • Check if the website is experiencing issues
  • Consider parsing the page in multiple steps
Error: 403 - ForbiddenSolution:
  • Website may be blocking automated access
  • Try using different user agents
  • Respect robots.txt and terms of service
  • Consider reaching out to site owners
Error: 429 - Too many requestsSolution:
  • Add delays between requests
  • Reduce concurrent parsing operations
  • Implement exponential backoff
  • Consider upgrading your API plan
Error: 404 - Element not foundSolution:
  • Website structure may have changed
  • Update CSS selectors
  • Add fallback selectors
  • Implement graceful degradation

Robust Parsing

{
  "name": "Robust Web Parser",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "continueOnFail": true,
  "retryOnFail": true,
  "maxTries": 3,
  "waitBetweenTries": 5000,
  "parameters": {
    "operation": "parseUrl",
    "url": "{{ $json.targetUrl }}",
    "fallbackSelectors": {
      "title": ["h1", ".title", ".headline", "title"],
      "content": [".content", ".article-body", "main", ".post"]
    }
  }
}

Data Quality & Validation

Content Validation

Validate extracted data quality:
// Data quality checks
{
  "name": "Validate Data Quality",
  "type": "n8n-nodes-base.function",
  "parameters": {
    "functionCode": `
      const validateData = (data) => {
        const quality = {
          score: 0,
          issues: [],
          valid: true
        };
        
        // Check title
        if (!data.title || data.title.length < 10) {
          quality.issues.push('Title too short or missing');
          quality.valid = false;
        } else {
          quality.score += 25;
        }
        
        // Check content
        if (!data.content || data.content.length < 100) {
          quality.issues.push('Content too short or missing');
          quality.valid = false;
        } else {
          quality.score += 25;
        }
        
        // Check for duplicate content
        if (data.title === data.description) {
          quality.issues.push('Title and description are identical');
          quality.score -= 10;
        }
        
        // Check for extraction artifacts
        if (data.content.includes('javascript:') || data.content.includes('void(0)')) {
          quality.issues.push('Content contains JavaScript artifacts');
          quality.score -= 15;
        }
        
        quality.score = Math.max(0, quality.score);
        return { ...data, quality };
      };
      
      return items.map(item => ({ json: validateData(item.json) }));
    `
  }
}

Duplicate Detection

Remove duplicate content:
{
  "name": "Remove Duplicates",
  "type": "n8n-nodes-base.removeDuplicates",
  "parameters": {
    "compare": "selectedFields",
    "fieldsToCompare": ["title", "url"]
  }
}

Integration Examples

Database Storage

Store parsed data in database:
{
  "name": "Store Parsed Data",
  "type": "n8n-nodes-base.postgres",
  "parameters": {
    "operation": "insert",
    "table": "scraped_content",
    "columns": [
      "url",
      "title",
      "content", 
      "author",
      "publish_date",
      "scraped_at"
    ],
    "values": [
      "={{ $json.url }}",
      "={{ $json.title }}",
      "={{ $json.content }}",
      "={{ $json.author }}",
      "={{ $json.publishDate }}",
      "={{ new Date().toISOString() }}"
    ]
  }
}

Content Management

Add to CMS or knowledge base:
{
  "name": "Add to Notion",
  "type": "n8n-nodes-base.notion",
  "parameters": {
    "operation": "create",
    "resource": "page",
    "databaseId": "your-database-id",
    "properties": {
      "Title": "={{ $json.title }}",
      "URL": "={{ $json.url }}",
      "Content": "={{ $json.content }}",
      "Source": "Web Scraping",
      "Date": "={{ new Date().toISOString() }}"
    }
  }
}

AI Analysis

Analyze extracted content with AI:
{
  "name": "AI Content Analysis",
  "type": "n8n-nodes-base.openAi",
  "parameters": {
    "operation": "analyze",
    "prompt": "Analyze this article and provide: 1) Main topics, 2) Key insights, 3) Sentiment, 4) Target audience. Article: {{ $json.title }} - {{ $json.content }}"
  }
}

Performance Optimization

Parallel Processing

Process multiple URLs simultaneously:
{
  "name": "Parallel URL Processing",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "bulkUrlParse",
  "parameters": {
    "urls": [
      "https://site1.com",
      "https://site2.com", 
      "https://site3.com"
    ],
    "batchSize": 3,
    "maxRetries": 2
  }
}

Selective Parsing

Only parse essential elements to improve speed:
{
  "name": "Fast Essential Parsing",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "parseUrl",
  "parameters": {
    "url": "{{ $json.url }}",
    "extractImages": false,
    "extractLinks": false,
    "customSelectors": {
      "title": "h1",
      "price": ".price",
      "availability": ".stock-status"
    }
  }
}

Best Practices

Ethical Scraping

  • Always respect robots.txt files
  • Don’t overload servers with too many requests
  • Follow website terms of service
  • Consider reaching out to site owners for API access
  • Store only necessary data and respect privacy

Performance Tips

  • Use batch operations for multiple URLs
  • Implement proper error handling and retries
  • Add appropriate delays between requests
  • Cache frequently accessed data
  • Monitor your API usage and quotas

Data Quality

  • Validate extracted data before using it
  • Implement fallback extraction methods
  • Clean and normalize text content
  • Remove duplicate entries
  • Handle encoding and special characters properly

Next Steps

I