Overview
Dataset pipelines let you define multi-source data collection workflows in YAML. Sources can depend on each other, forming chains like: search → company profiles → employees → posts. The CLI handles execution order, parallelism, error handling, and data storage automatically.Dataset pipelines require the
data extra: pip install "anysite-cli[data]"Create a Pipeline
Initialize a new dataset:dataset.yaml file with a starter configuration:
Run Collection
Multi-Source Pipeline Example
A more complex pipeline with dependency chains:Pipeline Configuration Reference
| Field | Description |
|---|---|
name | Pipeline name (used for history and logging) |
description | Optional description |
sources | List of data sources (see Source Types) |
storage.format | Output format: parquet (default), json, jsonl, csv |
storage.path | Output directory (default: ./data/) |
schedule.cron | Cron expression for automated runs (see Scheduling) |
notifications | Webhook URLs for success/failure events |
Dataset Commands
| Command | Description |
|---|---|
anysite dataset init <name> | Create a new dataset with starter YAML |
anysite dataset collect <yaml> | Run the collection pipeline |
anysite dataset collect <yaml> --dry-run | Preview execution plan without API calls |
anysite dataset collect <yaml> --incremental | Skip previously collected inputs |
anysite dataset collect <yaml> --source <id> | Collect a single source |
anysite dataset status <yaml> | Check collection status |
anysite dataset collect <yaml> --load-db <conn> | Auto-load results into a database |
anysite dataset query <yaml> --sql "..." | Query collected data with SQL |
anysite dataset stats <yaml> | Show collection statistics |
anysite dataset history <name> | View run history |
anysite dataset logs <name> --run <N> | View logs for a specific run |
anysite dataset reset-cursor <yaml> | Reset incremental collection cursors |
Next Steps
Source Types
Learn about the 5 source types: independent, from_file, dependent, union, and LLM
Scheduling
Set up incremental collection, cron scheduling, and webhook notifications