Web Parser 节点

概述

Anysite Web Parser 节点在您的 n8n 工作流中提供强大的网页抓取功能。从任何网站提取数据、解析 HTML 内容，并将非结构化的网页数据转换为结构化信息，用于分析和自动化。

节点配置

认证

credential

Anysite API Credentials

必填

从下拉菜单中选择您的 Anysite API 凭证，或创建新凭证。

可用操作

解析 URL
批量 URL 解析
智能提取
监控变化

从特定网页 URL 提取数据。参数：

URL（必填）：要抓取的网页 URL
Wait For Load：等待动态内容的时间（0-30 秒）
Extract Images：在输出中包含图片 URL
Extract Links：包含页面上找到的所有链接
Custom Selectors：用于特定元素的 CSS 选择器

输出示例：

{
  "page": {
    "url": "https://example.com/article",
    "title": "How to Build Scalable Web Applications",
    "description": "A comprehensive guide to building web applications...",
    "author": "John Developer",
    "publishDate": "2024-08-26",
    "content": "Building scalable web applications requires...",
    "images": [
      "https://example.com/images/architecture.png",
      "https://example.com/images/diagram.jpg"
    ],
    "links": [
      {
        "text": "Related Article",
        "url": "https://example.com/related"
      }
    ],
    "metadata": {
      "wordCount": 1250,
      "readingTime": "5 minutes",
      "tags": ["web development", "scalability", "architecture"]
    }
  }
}

在单个请求中解析多个 URL。参数：

URLs（必填）：要解析的 URL 数组
Batch Size：同时处理的 URL 数量
Fail on Error：如果一个 URL 失败则停止处理
Include Screenshots：捕获页面截图

输出示例：

{
  "results": [
    {
      "url": "https://site1.com",
      "status": "success",
      "title": "Site 1 Title",
      "content": "Page content...",
      "loadTime": 1200
    },
    {
      "url": "https://site2.com",
      "status": "error",
      "error": "Page not found",
      "loadTime": 800
    }
  ],
  "summary": {
    "total": 2,
    "successful": 1,
    "failed": 1,
    "avgLoadTime": 1000
  }
}

自动检测并从网页提取结构化数据。参数：

URL（必填）：网页 URL
Data Type：“article”、“product”、“event”、“person”、“organization”
Language：预期内容语言
Include Schema：提取 schema.org 结构化数据

输出示例：

{
  "extracted": {
    "type": "article",
    "title": "The Future of AI Development",
    "author": {
      "name": "Dr. Sarah Chen",
      "bio": "AI researcher and author",
      "social": {
        "twitter": "@sarahchen_ai",
        "linkedin": "sarah-chen-ai"
      }
    },
    "article": {
      "headline": "The Future of AI Development",
      "summary": "Exploring trends and innovations in AI...",
      "content": "Full article content...",
      "publishDate": "2024-08-26T09:00:00Z",
      "category": "Technology",
      "tags": ["AI", "Machine Learning", "Future Tech"]
    },
    "schema": {
      "@type": "Article",
      "author": "Dr. Sarah Chen",
      "datePublished": "2024-08-26"
    }
  }
}

监控网页变化并获取通知。参数：

URL（必填）：要监控的网页
Check Interval：检查变化的频率（分钟）
Change Threshold：触发警报的最小变化百分比
Monitor Elements：要监控的特定 CSS 选择器
Notification Method：“webhook”、“email” 或 “return_data”

输出示例：

{
  "monitoring": {
    "url": "https://competitor.com/pricing",
    "lastChecked": "2024-08-26T15:30:00Z",
    "changes": [
      {
        "element": "#pricing-table",
        "changeType": "content",
        "oldValue": "$99/month",
        "newValue": "$89/month",
        "changePercent": 11.1,
        "timestamp": "2024-08-26T15:30:00Z"
      }
    ],
    "screenshot": {
      "before": "https://cdn.hdw.ai/screenshots/before_123.png",
      "after": "https://cdn.hdw.ai/screenshots/after_123.png"
    }
  }
}

工作流示例

竞争对手价格监控

监控竞争对手页面

设置对竞争对手定价页面和产品公告的监控。

检测变化

当竞争对手更改价格或推出新产品时获得自动通知。

分析和警报

分析价格变化并向您的团队发送可操作洞察的警报。

策略更新

使用数据调整您自己的定价策略和竞争定位。

工作流示例：

{
  "nodes": [
    {
      "name": "Monitor Competitor Pricing",
      "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
      "operation": "monitorChanges",
      "parameters": {
        "url": "https://competitor.com/pricing",
        "checkInterval": 60,
        "changeThreshold": 5,
        "monitorElements": ["#pricing-table", ".product-price"]
      }
    },
    {
      "name": "Filter Significant Changes",
      "type": "n8n-nodes-base.filter",
      "parameters": {
        "conditions": [
          {
            "field": "changes[0].changePercent",
            "operation": "greaterThan",
            "value": 10
          }
        ]
      }
    },
    {
      "name": "Analyze Price Change",
      "type": "n8n-nodes-base.function",
      "parameters": {
        "functionCode": `
          const change = items[0].json.changes[0];
          const analysis = {
            competitor: "Competitor Inc",
            product: "Enterprise Plan",
            oldPrice: change.oldValue,
            newPrice: change.newValue,
            changeAmount: change.newValue - change.oldValue,
            changePercent: change.changePercent,
            recommendation: change.changePercent > 0 ?
              "Consider promotional pricing" :
              "Review our pricing strategy"
          };
          return [{ json: analysis }];
        `
      }
    },
    {
      "name": "Alert Team",
      "type": "n8n-nodes-base.slack",
      "parameters": {
        "channel": "#competitive-intel",
        "text": "🚨 Competitor Price Change Alert\\n📊 {{ $json.competitor }} changed {{ $json.product }} from {{ $json.oldPrice }} to {{ $json.newPrice }} ({{ $json.changePercent }}%)\\n💡 Recommendation: {{ $json.recommendation }}"
      }
    }
  ]
}

内容研究与分析

从多个来源自动研究和分析内容：

行业新闻监控 - 跟踪新闻网站的行业发展
竞争对手内容分析 - 监控竞争对手的博客和公告
趋势研究 - 从各种出版物中提取热门话题
内容差距分析 - 在您的细分市场中发现内容机会
SEO 研究 - 分析目标关键词排名靠前的页面

从网站生成潜在客户

从商业网站提取潜在客户和联系信息：

目录抓取 - 从目录中提取企业列表
联系页面解析 - 从公司网站获取联系信息
团队页面分析 - 提取员工信息和角色
技术检测 - 识别目标公司使用的技术
CRM 集成 - 自动将合格的潜在客户添加到您的 CRM

高级解析

自定义 CSS 选择器

使用 CSS 选择器提取特定元素：

{
  "name": "Custom Data Extraction",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "parseUrl",
  "parameters": {
    "url": "https://news.ycombinator.com",
    "customSelectors": {
      "headlines": ".titleline > a",
      "scores": ".score",
      "comments": ".subtext a[href*='item']:last-child",
      "authors": ".hnuser"
    }
  }
}

动态内容处理

处理 JavaScript 密集型网站：

{
  "name": "Parse SPA Website",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "parseUrl",
  "parameters": {
    "url": "https://spa-website.com",
    "waitForLoad": 10,
    "waitForSelector": "#dynamic-content",
    "executeJavaScript": "document.querySelector('#load-more').click()"
  }
}

数据转换

将提取的数据转换为结构化格式：

// 清理和结构化抓取的数据
{
  "name": "Transform Data",
  "type": "n8n-nodes-base.function",
  "parameters": {
    "functionCode": `
      const cleanText = (text) => text?.trim().replace(/\\s+/g, ' ');
      const extractPrice = (text) => {
        const match = text.match(/\\$([\\d,]+(?:\\.\\d{2})?)/);
        return match ? parseFloat(match[1].replace(',', '')) : null;
      };

      const transformed = items.map(item => ({
        json: {
          title: cleanText(item.json.title),
          price: extractPrice(item.json.priceText),
          description: cleanText(item.json.description),
          url: item.json.url,
          extractedAt: new Date().toISOString()
        }
      }));

      return transformed;
    `
  }
}

错误处理

常见问题

页面加载超时

错误： 408 - Page load timeout解决方案：

为加载缓慢的页面增加等待时间
检查网站是否遇到问题
考虑分多个步骤解析页面

拒绝访问

错误： 403 - Forbidden解决方案：

网站可能阻止自动访问
尝试使用不同的用户代理
尊重 robots.txt 和服务条款
考虑联系网站所有者

速率限制

错误： 429 - Too many requests解决方案：

在请求之间添加延迟
减少并发解析操作
实现指数退避
考虑升级您的 API 计划

元素未找到

错误： 404 - Element not found解决方案：

网站结构可能已更改
更新 CSS 选择器
添加备用选择器
实现优雅降级

健壮的解析

{
  "name": "Robust Web Parser",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "continueOnFail": true,
  "retryOnFail": true,
  "maxTries": 3,
  "waitBetweenTries": 5000,
  "parameters": {
    "operation": "parseUrl",
    "url": "{{ $json.targetUrl }}",
    "fallbackSelectors": {
      "title": ["h1", ".title", ".headline", "title"],
      "content": [".content", ".article-body", "main", ".post"]
    }
  }
}

数据质量与验证

内容验证

验证提取数据的质量：

// 数据质量检查
{
  "name": "Validate Data Quality",
  "type": "n8n-nodes-base.function",
  "parameters": {
    "functionCode": `
      const validateData = (data) => {
        const quality = {
          score: 0,
          issues: [],
          valid: true
        };

        // 检查标题
        if (!data.title || data.title.length < 10) {
          quality.issues.push('Title too short or missing');
          quality.valid = false;
        } else {
          quality.score += 25;
        }

        // 检查内容
        if (!data.content || data.content.length < 100) {
          quality.issues.push('Content too short or missing');
          quality.valid = false;
        } else {
          quality.score += 25;
        }

        // 检查重复内容
        if (data.title === data.description) {
          quality.issues.push('Title and description are identical');
          quality.score -= 10;
        }

        // 检查提取杂质
        if (data.content.includes('javascript:') || data.content.includes('void(0)')) {
          quality.issues.push('Content contains JavaScript artifacts');
          quality.score -= 15;
        }

        quality.score = Math.max(0, quality.score);
        return { ...data, quality };
      };

      return items.map(item => ({ json: validateData(item.json) }));
    `
  }
}

重复检测

删除重复内容：

{
  "name": "Remove Duplicates",
  "type": "n8n-nodes-base.removeDuplicates",
  "parameters": {
    "compare": "selectedFields",
    "fieldsToCompare": ["title", "url"]
  }
}

集成示例

数据库存储

将解析的数据存储到数据库：

{
  "name": "Store Parsed Data",
  "type": "n8n-nodes-base.postgres",
  "parameters": {
    "operation": "insert",
    "table": "scraped_content",
    "columns": [
      "url",
      "title",
      "content",
      "author",
      "publish_date",
      "scraped_at"
    ],
    "values": [
      "={{ $json.url }}",
      "={{ $json.title }}",
      "={{ $json.content }}",
      "={{ $json.author }}",
      "={{ $json.publishDate }}",
      "={{ new Date().toISOString() }}"
    ]
  }
}

内容管理

添加到 CMS 或知识库：

{
  "name": "Add to Notion",
  "type": "n8n-nodes-base.notion",
  "parameters": {
    "operation": "create",
    "resource": "page",
    "databaseId": "your-database-id",
    "properties": {
      "Title": "={{ $json.title }}",
      "URL": "={{ $json.url }}",
      "Content": "={{ $json.content }}",
      "Source": "Web Scraping",
      "Date": "={{ new Date().toISOString() }}"
    }
  }
}

AI 分析

使用 AI 分析提取的内容：

{
  "name": "AI Content Analysis",
  "type": "n8n-nodes-base.openAi",
  "parameters": {
    "operation": "analyze",
    "prompt": "Analyze this article and provide: 1) Main topics, 2) Key insights, 3) Sentiment, 4) Target audience. Article: {{ $json.title }} - {{ $json.content }}"
  }
}

性能优化

并行处理

同时处理多个 URL：

{
  "name": "Parallel URL Processing",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "bulkUrlParse",
  "parameters": {
    "urls": [
      "https://site1.com",
      "https://site2.com",
      "https://site3.com"
    ],
    "batchSize": 3,
    "maxRetries": 2
  }
}

选择性解析

仅解析必要元素以提高速度：

{
  "name": "Fast Essential Parsing",
  "type": "@horizondatawave/n8n-nodes-anysite.WebParser",
  "operation": "parseUrl",
  "parameters": {
    "url": "{{ $json.url }}",
    "extractImages": false,
    "extractLinks": false,
    "customSelectors": {
      "title": "h1",
      "price": ".price",
      "availability": ".stock-status"
    }
  }
}

最佳实践

道德抓取

始终尊重 robots.txt 文件
不要用过多请求使服务器过载
遵循网站的服务条款
考虑联系网站所有者获取 API 访问权限
仅存储必要的数据并尊重隐私

性能技巧

对多个 URL 使用批量操作
实施适当的错误处理和重试
在请求之间添加适当的延迟
缓存频繁访问的数据
监控您的 API 使用情况和配额

数据质量

在使用之前验证提取的数据
实施备用提取方法
清理和规范化文本内容
删除重复条目
正确处理编码和特殊字符

后续步骤

LinkedIn 节点 - LinkedIn 数据提取
Twitter 节点 - Twitter/X 监控
Instagram 节点 - Instagram 分析
工作流 - 预构建的工作流模板

快速入门

MCP 服务器

n8n 节点

Claude Skills

法律条款

概述

节点配置

认证

可用操作

工作流示例

竞争对手价格监控

内容研究与分析

从网站生成潜在客户

高级解析

自定义 CSS 选择器

动态内容处理

数据转换

错误处理

常见问题

健壮的解析

数据质量与验证

内容验证

重复检测

集成示例

数据库存储

内容管理

AI 分析

性能优化

并行处理

选择性解析

最佳实践

道德抓取

性能技巧

数据质量

后续步骤

快速入门

MCP 服务器

n8n 节点

Claude Skills

法律条款

​概述

​节点配置

​认证

​可用操作

​工作流示例

​竞争对手价格监控

​内容研究与分析

​从网站生成潜在客户

​高级解析

​自定义 CSS 选择器

​动态内容处理

​数据转换

​错误处理

​常见问题

​健壮的解析

​数据质量与验证

​内容验证

​重复检测

​集成示例

​数据库存储

​内容管理

​AI 分析

​性能优化

​并行处理

​选择性解析

​最佳实践

​道德抓取

​性能技巧

​数据质量

​后续步骤

概述

节点配置

认证

可用操作

工作流示例

竞争对手价格监控

内容研究与分析

从网站生成潜在客户

高级解析

自定义 CSS 选择器

动态内容处理

数据转换

错误处理

常见问题

健壮的解析

数据质量与验证

内容验证

重复检测

集成示例

数据库存储

内容管理

AI 分析

性能优化

并行处理

选择性解析

最佳实践

道德抓取

性能技巧

数据质量

后续步骤