Anysite Web Parser 节点在您的 n8n 工作流中提供强大的网页抓取功能。从任何网站提取数据、解析 HTML 内容,并将非结构化的网页数据转换为结构化信息,用于分析和自动化。
节点配置
credential
Anysite API Credentials
必填
从下拉菜单中选择您的 Anysite API 凭证,或创建新凭证。
可用操作
解析 URL
批量 URL 解析
智能提取
监控变化
从特定网页 URL 提取数据。参数:
- URL(必填):要抓取的网页 URL
- Wait For Load:等待动态内容的时间(0-30 秒)
- Extract Images:在输出中包含图片 URL
- Extract Links:包含页面上找到的所有链接
- Custom Selectors:用于特定元素的 CSS 选择器
输出示例:{
"page": {
"url": "https://example.com/article",
"title": "How to Build Scalable Web Applications",
"description": "A comprehensive guide to building web applications...",
"author": "John Developer",
"publishDate": "2024-08-26",
"content": "Building scalable web applications requires...",
"images": [
"https://example.com/images/architecture.png",
"https://example.com/images/diagram.jpg"
],
"links": [
{
"text": "Related Article",
"url": "https://example.com/related"
}
],
"metadata": {
"wordCount": 1250,
"readingTime": "5 minutes",
"tags": ["web development", "scalability", "architecture"]
}
}
}
在单个请求中解析多个 URL。参数:
- URLs(必填):要解析的 URL 数组
- Batch Size:同时处理的 URL 数量
- Fail on Error:如果一个 URL 失败则停止处理
- Include Screenshots:捕获页面截图
输出示例:{
"results": [
{
"url": "https://site1.com",
"status": "success",
"title": "Site 1 Title",
"content": "Page content...",
"loadTime": 1200
},
{
"url": "https://site2.com",
"status": "error",
"error": "Page not found",
"loadTime": 800
}
],
"summary": {
"total": 2,
"successful": 1,
"failed": 1,
"avgLoadTime": 1000
}
}
自动检测并从网页提取结构化数据。参数:
- URL(必填):网页 URL
- Data Type:“article”、“product”、“event”、“person”、“organization”
- Language:预期内容语言
- Include Schema:提取 schema.org 结构化数据
输出示例:{
"extracted": {
"type": "article",
"title": "The Future of AI Development",
"author": {
"name": "Dr. Sarah Chen",
"bio": "AI researcher and author",
"social": {
"twitter": "@sarahchen_ai",
"linkedin": "sarah-chen-ai"
}
},
"article": {
"headline": "The Future of AI Development",
"summary": "Exploring trends and innovations in AI...",
"content": "Full article content...",
"publishDate": "2024-08-26T09:00:00Z",
"category": "Technology",
"tags": ["AI", "Machine Learning", "Future Tech"]
},
"schema": {
"@type": "Article",
"author": "Dr. Sarah Chen",
"datePublished": "2024-08-26"
}
}
}
监控网页变化并获取通知。参数:
- URL(必填):要监控的网页
- Check Interval:检查变化的频率(分钟)
- Change Threshold:触发警报的最小变化百分比
- Monitor Elements:要监控的特定 CSS 选择器
- Notification Method:“webhook”、“email” 或 “return_data”
输出示例:{
"monitoring": {
"url": "https://competitor.com/pricing",
"lastChecked": "2024-08-26T15:30:00Z",
"changes": [
{
"element": "#pricing-table",
"changeType": "content",
"oldValue": "$99/month",
"newValue": "$89/month",
"changePercent": 11.1,
"timestamp": "2024-08-26T15:30:00Z"
}
],
"screenshot": {
"before": "https://cdn.hdw.ai/screenshots/before_123.png",
"after": "https://cdn.hdw.ai/screenshots/after_123.png"
}
}
}
工作流示例
竞争对手价格监控
监控竞争对手页面
设置对竞争对手定价页面和产品公告的监控。
检测变化
当竞争对手更改价格或推出新产品时获得自动通知。
分析和警报
分析价格变化并向您的团队发送可操作洞察的警报。
工作流示例:
{
"nodes": [
{
"name": "Monitor Competitor Pricing",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "monitorChanges",
"parameters": {
"url": "https://competitor.com/pricing",
"checkInterval": 60,
"changeThreshold": 5,
"monitorElements": ["#pricing-table", ".product-price"]
}
},
{
"name": "Filter Significant Changes",
"type": "n8n-nodes-base.filter",
"parameters": {
"conditions": [
{
"field": "changes[0].changePercent",
"operation": "greaterThan",
"value": 10
}
]
}
},
{
"name": "Analyze Price Change",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": `
const change = items[0].json.changes[0];
const analysis = {
competitor: "Competitor Inc",
product: "Enterprise Plan",
oldPrice: change.oldValue,
newPrice: change.newValue,
changeAmount: change.newValue - change.oldValue,
changePercent: change.changePercent,
recommendation: change.changePercent > 0 ?
"Consider promotional pricing" :
"Review our pricing strategy"
};
return [{ json: analysis }];
`
}
},
{
"name": "Alert Team",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#competitive-intel",
"text": "🚨 Competitor Price Change Alert\\n📊 {{ $json.competitor }} changed {{ $json.product }} from {{ $json.oldPrice }} to {{ $json.newPrice }} ({{ $json.changePercent }}%)\\n💡 Recommendation: {{ $json.recommendation }}"
}
}
]
}
内容研究与分析
从多个来源自动研究和分析内容:
- 行业新闻监控 - 跟踪新闻网站的行业发展
- 竞争对手内容分析 - 监控竞争对手的博客和公告
- 趋势研究 - 从各种出版物中提取热门话题
- 内容差距分析 - 在您的细分市场中发现内容机会
- SEO 研究 - 分析目标关键词排名靠前的页面
从网站生成潜在客户
从商业网站提取潜在客户和联系信息:
- 目录抓取 - 从目录中提取企业列表
- 联系页面解析 - 从公司网站获取联系信息
- 团队页面分析 - 提取员工信息和角色
- 技术检测 - 识别目标公司使用的技术
- CRM 集成 - 自动将合格的潜在客户添加到您的 CRM
高级解析
自定义 CSS 选择器
使用 CSS 选择器提取特定元素:
{
"name": "Custom Data Extraction",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "parseUrl",
"parameters": {
"url": "https://news.ycombinator.com",
"customSelectors": {
"headlines": ".titleline > a",
"scores": ".score",
"comments": ".subtext a[href*='item']:last-child",
"authors": ".hnuser"
}
}
}
动态内容处理
处理 JavaScript 密集型网站:
{
"name": "Parse SPA Website",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "parseUrl",
"parameters": {
"url": "https://spa-website.com",
"waitForLoad": 10,
"waitForSelector": "#dynamic-content",
"executeJavaScript": "document.querySelector('#load-more').click()"
}
}
数据转换
将提取的数据转换为结构化格式:
// 清理和结构化抓取的数据
{
"name": "Transform Data",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": `
const cleanText = (text) => text?.trim().replace(/\\s+/g, ' ');
const extractPrice = (text) => {
const match = text.match(/\\$([\\d,]+(?:\\.\\d{2})?)/);
return match ? parseFloat(match[1].replace(',', '')) : null;
};
const transformed = items.map(item => ({
json: {
title: cleanText(item.json.title),
price: extractPrice(item.json.priceText),
description: cleanText(item.json.description),
url: item.json.url,
extractedAt: new Date().toISOString()
}
}));
return transformed;
`
}
}
错误处理
常见问题
错误: 408 - Page load timeout解决方案:
- 为加载缓慢的页面增加等待时间
- 检查网站是否遇到问题
- 考虑分多个步骤解析页面
错误: 403 - Forbidden解决方案:
- 网站可能阻止自动访问
- 尝试使用不同的用户代理
- 尊重 robots.txt 和服务条款
- 考虑联系网站所有者
错误: 429 - Too many requests解决方案:
- 在请求之间添加延迟
- 减少并发解析操作
- 实现指数退避
- 考虑升级您的 API 计划
错误: 404 - Element not found解决方案:
- 网站结构可能已更改
- 更新 CSS 选择器
- 添加备用选择器
- 实现优雅降级
健壮的解析
{
"name": "Robust Web Parser",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"continueOnFail": true,
"retryOnFail": true,
"maxTries": 3,
"waitBetweenTries": 5000,
"parameters": {
"operation": "parseUrl",
"url": "{{ $json.targetUrl }}",
"fallbackSelectors": {
"title": ["h1", ".title", ".headline", "title"],
"content": [".content", ".article-body", "main", ".post"]
}
}
}
数据质量与验证
内容验证
验证提取数据的质量:
// 数据质量检查
{
"name": "Validate Data Quality",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": `
const validateData = (data) => {
const quality = {
score: 0,
issues: [],
valid: true
};
// 检查标题
if (!data.title || data.title.length < 10) {
quality.issues.push('Title too short or missing');
quality.valid = false;
} else {
quality.score += 25;
}
// 检查内容
if (!data.content || data.content.length < 100) {
quality.issues.push('Content too short or missing');
quality.valid = false;
} else {
quality.score += 25;
}
// 检查重复内容
if (data.title === data.description) {
quality.issues.push('Title and description are identical');
quality.score -= 10;
}
// 检查提取杂质
if (data.content.includes('javascript:') || data.content.includes('void(0)')) {
quality.issues.push('Content contains JavaScript artifacts');
quality.score -= 15;
}
quality.score = Math.max(0, quality.score);
return { ...data, quality };
};
return items.map(item => ({ json: validateData(item.json) }));
`
}
}
重复检测
删除重复内容:
{
"name": "Remove Duplicates",
"type": "n8n-nodes-base.removeDuplicates",
"parameters": {
"compare": "selectedFields",
"fieldsToCompare": ["title", "url"]
}
}
集成示例
数据库存储
将解析的数据存储到数据库:
{
"name": "Store Parsed Data",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "scraped_content",
"columns": [
"url",
"title",
"content",
"author",
"publish_date",
"scraped_at"
],
"values": [
"={{ $json.url }}",
"={{ $json.title }}",
"={{ $json.content }}",
"={{ $json.author }}",
"={{ $json.publishDate }}",
"={{ new Date().toISOString() }}"
]
}
}
内容管理
添加到 CMS 或知识库:
{
"name": "Add to Notion",
"type": "n8n-nodes-base.notion",
"parameters": {
"operation": "create",
"resource": "page",
"databaseId": "your-database-id",
"properties": {
"Title": "={{ $json.title }}",
"URL": "={{ $json.url }}",
"Content": "={{ $json.content }}",
"Source": "Web Scraping",
"Date": "={{ new Date().toISOString() }}"
}
}
}
AI 分析
使用 AI 分析提取的内容:
{
"name": "AI Content Analysis",
"type": "n8n-nodes-base.openAi",
"parameters": {
"operation": "analyze",
"prompt": "Analyze this article and provide: 1) Main topics, 2) Key insights, 3) Sentiment, 4) Target audience. Article: {{ $json.title }} - {{ $json.content }}"
}
}
性能优化
并行处理
同时处理多个 URL:
{
"name": "Parallel URL Processing",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "bulkUrlParse",
"parameters": {
"urls": [
"https://site1.com",
"https://site2.com",
"https://site3.com"
],
"batchSize": 3,
"maxRetries": 2
}
}
选择性解析
仅解析必要元素以提高速度:
{
"name": "Fast Essential Parsing",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "parseUrl",
"parameters": {
"url": "{{ $json.url }}",
"extractImages": false,
"extractLinks": false,
"customSelectors": {
"title": "h1",
"price": ".price",
"availability": ".stock-status"
}
}
}
最佳实践
道德抓取
- 始终尊重 robots.txt 文件
- 不要用过多请求使服务器过载
- 遵循网站的服务条款
- 考虑联系网站所有者获取 API 访问权限
- 仅存储必要的数据并尊重隐私
性能技巧
- 对多个 URL 使用批量操作
- 实施适当的错误处理和重试
- 在请求之间添加适当的延迟
- 缓存频繁访问的数据
- 监控您的 API 使用情况和配额
数据质量
- 在使用之前验证提取的数据
- 实施备用提取方法
- 清理和规范化文本内容
- 删除重复条目
- 正确处理编码和特殊字符
后续步骤