Overview
Dataset pipelines let you define multi-source data collection workflows in YAML. Sources can depend on each other, forming chains like: search → company profiles → employees → posts. The CLI handles execution order, parallelism, error handling, and data storage automatically.
Dataset pipelines require the data extra: pip install "anysite-cli[data]"
Create a Pipeline
Initialize a new dataset:
anysite dataset init my-dataset
This creates a dataset.yaml file with a starter configuration:
name: my-dataset
description: My data collection pipeline
sources:
- id: search_results
endpoint: /api/linkedin/search/users
params:
keywords: "software engineer"
count: 50
parallel: 1
rate_limit: "10/s"
on_error: stop
storage:
format: parquet
path: ./data/
Run Collection
# Full collection
anysite dataset collect dataset.yaml
# Preview what will be collected (no API calls)
anysite dataset collect dataset.yaml --dry-run
# Collect a specific source only
anysite dataset collect dataset.yaml --source search_results
# Skip LLM processing steps
anysite dataset collect dataset.yaml --no-llm
Multi-Source Pipeline Example
A more complex pipeline with dependency chains:
name: competitor-research
description: Collect and analyze competitor company data
sources:
# Step 1: Search for companies
- id: companies
endpoint: /api/linkedin/search/companies
params:
keywords: "AI startup"
count: 100
parallel: 1
# Step 2: Get employee list for each company (depends on Step 1)
- id: employees
endpoint: /api/linkedin/company/employees
dependency:
from_source: companies
field: urn.value
dedupe: true
input_key: companies
parallel: 3
rate_limit: "10/s"
# Step 3: Get full profiles for each employee (depends on Step 2)
- id: profiles
endpoint: /api/linkedin/user
dependency:
from_source: employees
field: urn.value
input_key: user
parallel: 5
rate_limit: "10/s"
on_error: skip
storage:
format: parquet
path: ./data/
The CLI automatically resolves the dependency graph and executes sources in the correct order.
Pipeline Configuration Reference
| Field | Description |
|---|
name | Pipeline name (used for history and logging) |
description | Optional description |
sources | List of data sources (see Source Types) |
storage.format | Output format: parquet (default), json, jsonl, csv |
storage.path | Output directory (default: ./data/) |
schedule.cron | Cron expression for automated runs (see Scheduling) |
notifications | Webhook URLs for success/failure events |
Dataset Commands
| Command | Description |
|---|
anysite dataset init <name> | Create a new dataset with starter YAML |
anysite dataset collect <yaml> | Run the collection pipeline |
anysite dataset collect <yaml> --dry-run | Preview execution plan without API calls |
anysite dataset collect <yaml> --incremental | Skip previously collected inputs |
anysite dataset collect <yaml> --source <id> | Collect a single source |
anysite dataset status <yaml> | Check collection status |
anysite dataset collect <yaml> --load-db <conn> | Auto-load results into a database |
anysite dataset query <yaml> --sql "..." | Query collected data with SQL |
anysite dataset stats <yaml> | Show collection statistics |
anysite dataset history <name> | View run history |
anysite dataset logs <name> --run <N> | View logs for a specific run |
anysite dataset reset-cursor <yaml> | Reset incremental collection cursors |
Next Steps