Documentation Index
Fetch the complete documentation index at: https://docs.anysite.io/llms.txt
Use this file to discover all available pages before exploring further.
Anysite Web Parser 节点在您的 n8n 工作流中提供强大的网页抓取功能。从任何网站提取数据、解析 HTML 内容,并将非结构化的网页数据转换为结构化信息,用于分析和自动化。
节点配置
credential
Anysite API Credentials
必填
从下拉菜单中选择您的 Anysite API 凭证,或创建新凭证。
可用操作
解析 URL
批量 URL 解析
智能提取
监控变化
从特定网页 URL 提取数据。参数:
- URL(必填):要抓取的网页 URL
- Wait For Load:等待动态内容的时间(0-30 秒)
- Extract Images:在输出中包含图片 URL
- Extract Links:包含页面上找到的所有链接
- Custom Selectors:用于特定元素的 CSS 选择器
输出示例:{
"page": {
"url": "https://example.com/article",
"title": "How to Build Scalable Web Applications",
"description": "A comprehensive guide to building web applications...",
"author": "John Developer",
"publishDate": "2024-08-26",
"content": "Building scalable web applications requires...",
"images": [
"https://example.com/images/architecture.png",
"https://example.com/images/diagram.jpg"
],
"links": [
{
"text": "Related Article",
"url": "https://example.com/related"
}
],
"metadata": {
"wordCount": 1250,
"readingTime": "5 minutes",
"tags": ["web development", "scalability", "architecture"]
}
}
}
在单个请求中解析多个 URL。参数:
- URLs(必填):要解析的 URL 数组
- Batch Size:同时处理的 URL 数量
- Fail on Error:如果一个 URL 失败则停止处理
- Include Screenshots:捕获页面截图
输出示例:{
"results": [
{
"url": "https://site1.com",
"status": "success",
"title": "Site 1 Title",
"content": "Page content...",
"loadTime": 1200
},
{
"url": "https://site2.com",
"status": "error",
"error": "Page not found",
"loadTime": 800
}
],
"summary": {
"total": 2,
"successful": 1,
"failed": 1,
"avgLoadTime": 1000
}
}
自动检测并从网页提取结构化数据。参数:
- URL(必填):网页 URL
- Data Type:“article”、“product”、“event”、“person”、“organization”
- Language:预期内容语言
- Include Schema:提取 schema.org 结构化数据
输出示例:{
"extracted": {
"type": "article",
"title": "The Future of AI Development",
"author": {
"name": "Dr. Sarah Chen",
"bio": "AI researcher and author",
"social": {
"twitter": "@sarahchen_ai",
"linkedin": "sarah-chen-ai"
}
},
"article": {
"headline": "The Future of AI Development",
"summary": "Exploring trends and innovations in AI...",
"content": "Full article content...",
"publishDate": "2024-08-26T09:00:00Z",
"category": "Technology",
"tags": ["AI", "Machine Learning", "Future Tech"]
},
"schema": {
"@type": "Article",
"author": "Dr. Sarah Chen",
"datePublished": "2024-08-26"
}
}
}
监控网页变化并获取通知。参数:
- URL(必填):要监控的网页
- Check Interval:检查变化的频率(分钟)
- Change Threshold:触发警报的最小变化百分比
- Monitor Elements:要监控的特定 CSS 选择器
- Notification Method:“webhook”、“email” 或 “return_data”
输出示例:{
"monitoring": {
"url": "https://competitor.com/pricing",
"lastChecked": "2024-08-26T15:30:00Z",
"changes": [
{
"element": "#pricing-table",
"changeType": "content",
"oldValue": "$99/month",
"newValue": "$89/month",
"changePercent": 11.1,
"timestamp": "2024-08-26T15:30:00Z"
}
],
"screenshot": {
"before": "https://cdn.hdw.ai/screenshots/before_123.png",
"after": "https://cdn.hdw.ai/screenshots/after_123.png"
}
}
}
工作流示例
竞争对手价格监控
监控竞争对手页面
设置对竞争对手定价页面和产品公告的监控。
检测变化
当竞争对手更改价格或推出新产品时获得自动通知。
分析和警报
分析价格变化并向您的团队发送可操作洞察的警报。
工作流示例:
{
"nodes": [
{
"name": "Monitor Competitor Pricing",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "monitorChanges",
"parameters": {
"url": "https://competitor.com/pricing",
"checkInterval": 60,
"changeThreshold": 5,
"monitorElements": ["#pricing-table", ".product-price"]
}
},
{
"name": "Filter Significant Changes",
"type": "n8n-nodes-base.filter",
"parameters": {
"conditions": [
{
"field": "changes[0].changePercent",
"operation": "greaterThan",
"value": 10
}
]
}
},
{
"name": "Analyze Price Change",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": `
const change = items[0].json.changes[0];
const analysis = {
competitor: "Competitor Inc",
product: "Enterprise Plan",
oldPrice: change.oldValue,
newPrice: change.newValue,
changeAmount: change.newValue - change.oldValue,
changePercent: change.changePercent,
recommendation: change.changePercent > 0 ?
"Consider promotional pricing" :
"Review our pricing strategy"
};
return [{ json: analysis }];
`
}
},
{
"name": "Alert Team",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#competitive-intel",
"text": "🚨 Competitor Price Change Alert\\n📊 {{ $json.competitor }} changed {{ $json.product }} from {{ $json.oldPrice }} to {{ $json.newPrice }} ({{ $json.changePercent }}%)\\n💡 Recommendation: {{ $json.recommendation }}"
}
}
]
}
内容研究与分析
从多个来源自动研究和分析内容:
- 行业新闻监控 - 跟踪新闻网站的行业发展
- 竞争对手内容分析 - 监控竞争对手的博客和公告
- 趋势研究 - 从各种出版物中提取热门话题
- 内容差距分析 - 在您的细分市场中发现内容机会
- SEO 研究 - 分析目标关键词排名靠前的页面
从网站生成潜在客户
从商业网站提取潜在客户和联系信息:
- 目录抓取 - 从目录中提取企业列表
- 联系页面解析 - 从公司网站获取联系信息
- 团队页面分析 - 提取员工信息和角色
- 技术检测 - 识别目标公司使用的技术
- CRM 集成 - 自动将合格的潜在客户添加到您的 CRM
高级解析
自定义 CSS 选择器
使用 CSS 选择器提取特定元素:
{
"name": "Custom Data Extraction",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "parseUrl",
"parameters": {
"url": "https://news.ycombinator.com",
"customSelectors": {
"headlines": ".titleline > a",
"scores": ".score",
"comments": ".subtext a[href*='item']:last-child",
"authors": ".hnuser"
}
}
}
动态内容处理
处理 JavaScript 密集型网站:
{
"name": "Parse SPA Website",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "parseUrl",
"parameters": {
"url": "https://spa-website.com",
"waitForLoad": 10,
"waitForSelector": "#dynamic-content",
"executeJavaScript": "document.querySelector('#load-more').click()"
}
}
数据转换
将提取的数据转换为结构化格式:
// 清理和结构化抓取的数据
{
"name": "Transform Data",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": `
const cleanText = (text) => text?.trim().replace(/\\s+/g, ' ');
const extractPrice = (text) => {
const match = text.match(/\\$([\\d,]+(?:\\.\\d{2})?)/);
return match ? parseFloat(match[1].replace(',', '')) : null;
};
const transformed = items.map(item => ({
json: {
title: cleanText(item.json.title),
price: extractPrice(item.json.priceText),
description: cleanText(item.json.description),
url: item.json.url,
extractedAt: new Date().toISOString()
}
}));
return transformed;
`
}
}
错误处理
常见问题
错误: 408 - Page load timeout解决方案:
- 为加载缓慢的页面增加等待时间
- 检查网站是否遇到问题
- 考虑分多个步骤解析页面
错误: 403 - Forbidden解决方案:
- 网站可能阻止自动访问
- 尝试使用不同的用户代理
- 尊重 robots.txt 和服务条款
- 考虑联系网站所有者
错误: 429 - Too many requests解决方案:
- 在请求之间添加延迟
- 减少并发解析操作
- 实现指数退避
- 考虑升级您的 API 计划
错误: 404 - Element not found解决方案:
- 网站结构可能已更改
- 更新 CSS 选择器
- 添加备用选择器
- 实现优雅降级
健壮的解析
{
"name": "Robust Web Parser",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"continueOnFail": true,
"retryOnFail": true,
"maxTries": 3,
"waitBetweenTries": 5000,
"parameters": {
"operation": "parseUrl",
"url": "{{ $json.targetUrl }}",
"fallbackSelectors": {
"title": ["h1", ".title", ".headline", "title"],
"content": [".content", ".article-body", "main", ".post"]
}
}
}
数据质量与验证
内容验证
验证提取数据的质量:
// 数据质量检查
{
"name": "Validate Data Quality",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": `
const validateData = (data) => {
const quality = {
score: 0,
issues: [],
valid: true
};
// 检查标题
if (!data.title || data.title.length < 10) {
quality.issues.push('Title too short or missing');
quality.valid = false;
} else {
quality.score += 25;
}
// 检查内容
if (!data.content || data.content.length < 100) {
quality.issues.push('Content too short or missing');
quality.valid = false;
} else {
quality.score += 25;
}
// 检查重复内容
if (data.title === data.description) {
quality.issues.push('Title and description are identical');
quality.score -= 10;
}
// 检查提取杂质
if (data.content.includes('javascript:') || data.content.includes('void(0)')) {
quality.issues.push('Content contains JavaScript artifacts');
quality.score -= 15;
}
quality.score = Math.max(0, quality.score);
return { ...data, quality };
};
return items.map(item => ({ json: validateData(item.json) }));
`
}
}
重复检测
删除重复内容:
{
"name": "Remove Duplicates",
"type": "n8n-nodes-base.removeDuplicates",
"parameters": {
"compare": "selectedFields",
"fieldsToCompare": ["title", "url"]
}
}
集成示例
数据库存储
将解析的数据存储到数据库:
{
"name": "Store Parsed Data",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "scraped_content",
"columns": [
"url",
"title",
"content",
"author",
"publish_date",
"scraped_at"
],
"values": [
"={{ $json.url }}",
"={{ $json.title }}",
"={{ $json.content }}",
"={{ $json.author }}",
"={{ $json.publishDate }}",
"={{ new Date().toISOString() }}"
]
}
}
内容管理
添加到 CMS 或知识库:
{
"name": "Add to Notion",
"type": "n8n-nodes-base.notion",
"parameters": {
"operation": "create",
"resource": "page",
"databaseId": "your-database-id",
"properties": {
"Title": "={{ $json.title }}",
"URL": "={{ $json.url }}",
"Content": "={{ $json.content }}",
"Source": "Web Scraping",
"Date": "={{ new Date().toISOString() }}"
}
}
}
AI 分析
使用 AI 分析提取的内容:
{
"name": "AI Content Analysis",
"type": "n8n-nodes-base.openAi",
"parameters": {
"operation": "analyze",
"prompt": "Analyze this article and provide: 1) Main topics, 2) Key insights, 3) Sentiment, 4) Target audience. Article: {{ $json.title }} - {{ $json.content }}"
}
}
性能优化
并行处理
同时处理多个 URL:
{
"name": "Parallel URL Processing",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "bulkUrlParse",
"parameters": {
"urls": [
"https://site1.com",
"https://site2.com",
"https://site3.com"
],
"batchSize": 3,
"maxRetries": 2
}
}
选择性解析
仅解析必要元素以提高速度:
{
"name": "Fast Essential Parsing",
"type": "@horizondatawave/n8n-nodes-anysite.WebParser",
"operation": "parseUrl",
"parameters": {
"url": "{{ $json.url }}",
"extractImages": false,
"extractLinks": false,
"customSelectors": {
"title": "h1",
"price": ".price",
"availability": ".stock-status"
}
}
}
最佳实践
道德抓取
- 始终尊重 robots.txt 文件
- 不要用过多请求使服务器过载
- 遵循网站的服务条款
- 考虑联系网站所有者获取 API 访问权限
- 仅存储必要的数据并尊重隐私
性能技巧
- 对多个 URL 使用批量操作
- 实施适当的错误处理和重试
- 在请求之间添加适当的延迟
- 缓存频繁访问的数据
- 监控您的 API 使用情况和配额
数据质量
- 在使用之前验证提取的数据
- 实施备用提取方法
- 清理和规范化文本内容
- 删除重复条目
- 正确处理编码和特殊字符
后续步骤