Dataset Pipelines

Overview

Dataset pipelines let you define multi-source data collection workflows in YAML. Sources can depend on each other, forming chains like: search → company profiles → employees → posts. The CLI handles execution order, parallelism, error handling, and data storage automatically.

Dataset pipelines require the data extra: pip install "anysite-cli[data]"

Create a Pipeline

Initialize a new dataset:

anysite dataset init my-dataset

This creates a dataset.yaml file with a starter configuration:

name: my-dataset
description: My data collection pipeline

sources:
  - id: search_results
    endpoint: /api/linkedin/search/users
    params:
      keywords: "software engineer"
      count: 50
    parallel: 1
    rate_limit: "10/s"
    on_error: stop

storage:
  format: parquet
  path: ./data/

Run Collection

# Full collection
anysite dataset collect dataset.yaml

# Preview what will be collected (no API calls)
anysite dataset collect dataset.yaml --dry-run

# Collect a specific source only
anysite dataset collect dataset.yaml --source search_results

# Skip LLM processing steps
anysite dataset collect dataset.yaml --no-llm

Multi-Source Pipeline Example

A more complex pipeline with dependency chains:

name: competitor-research
description: Collect and analyze competitor company data

sources:
  # Step 1: Search for companies
  - id: companies
    endpoint: /api/linkedin/search/companies
    params:
      keywords: "AI startup"
      count: 100
    parallel: 1

  # Step 2: Get employee list for each company (depends on Step 1)
  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
      dedupe: true
    input_key: companies
    parallel: 3
    rate_limit: "10/s"

  # Step 3: Get full profiles for each employee (depends on Step 2)
  - id: profiles
    endpoint: /api/linkedin/user
    dependency:
      from_source: employees
      field: urn.value
    input_key: user
    parallel: 5
    rate_limit: "10/s"
    on_error: skip

storage:
  format: parquet
  path: ./data/

The CLI automatically resolves the dependency graph and executes sources in the correct order.

Pipeline Configuration Reference

Field	Description
`name`	Pipeline name (used for history and logging)
`description`	Optional description
`sources`	List of data sources (see Source Types)
`storage.format`	Output format: `parquet` (default), `json`, `jsonl`, `csv`
`storage.path`	Output directory (default: `./data/`)
`schedule.cron`	Cron expression for automated runs (see Scheduling)
`notifications`	Webhook URLs for success/failure events

Dataset Commands

Command	Description
`anysite dataset init <name>`	Create a new dataset with starter YAML
`anysite dataset collect <yaml>`	Run the collection pipeline
`anysite dataset collect <yaml> --dry-run`	Preview execution plan without API calls
`anysite dataset collect <yaml> --incremental`	Skip previously collected inputs
`anysite dataset collect <yaml> --source <id>`	Collect a single source
`anysite dataset status <yaml>`	Check collection status
`anysite dataset collect <yaml> --load-db <conn>`	Auto-load results into a database
`anysite dataset query <yaml> --sql "..."`	Query collected data with SQL
`anysite dataset stats <yaml>`	Show collection statistics
`anysite dataset history <name>`	View run history
`anysite dataset logs <name> --run <N>`	View logs for a specific run
`anysite dataset reset-cursor <yaml>`	Reset incremental collection cursors

Next Steps

Source Types

Learn about the 5 source types: independent, from_file, dependent, union, and LLM

Scheduling

Set up incremental collection, cron scheduling, and webhook notifications

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

Overview

Create a Pipeline

Run Collection

Multi-Source Pipeline Example

Pipeline Configuration Reference

Dataset Commands

Next Steps

Source Types

Scheduling

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

​Overview

​Create a Pipeline

​Run Collection

​Multi-Source Pipeline Example

​Pipeline Configuration Reference

​Dataset Commands

​Next Steps

Source Types

Scheduling

Overview

Create a Pipeline

Run Collection

Multi-Source Pipeline Example

Pipeline Configuration Reference

Dataset Commands

Next Steps