Skip to main content

Overview

Dataset pipelines let you define multi-source data collection workflows in YAML. Sources can depend on each other, forming chains like: search → company profiles → employees → posts. The CLI handles execution order, parallelism, error handling, and data storage automatically.
Dataset pipelines require the data extra: pip install "anysite-cli[data]"

Create a Pipeline

Initialize a new dataset:
anysite dataset init my-dataset
This creates a dataset.yaml file with a starter configuration:
name: my-dataset
description: My data collection pipeline

sources:
  - id: search_results
    endpoint: /api/linkedin/search/users
    params:
      keywords: "software engineer"
      count: 50
    parallel: 1
    rate_limit: "10/s"
    on_error: stop

storage:
  format: parquet
  path: ./data/

Run Collection

# Full collection
anysite dataset collect dataset.yaml

# Preview what will be collected (no API calls)
anysite dataset collect dataset.yaml --dry-run

# Collect a specific source only
anysite dataset collect dataset.yaml --source search_results

# Skip LLM processing steps
anysite dataset collect dataset.yaml --no-llm

Multi-Source Pipeline Example

A more complex pipeline with dependency chains:
name: competitor-research
description: Collect and analyze competitor company data

sources:
  # Step 1: Search for companies
  - id: companies
    endpoint: /api/linkedin/search/companies
    params:
      keywords: "AI startup"
      count: 100
    parallel: 1

  # Step 2: Get employee list for each company (depends on Step 1)
  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
      dedupe: true
    input_key: companies
    parallel: 3
    rate_limit: "10/s"

  # Step 3: Get full profiles for each employee (depends on Step 2)
  - id: profiles
    endpoint: /api/linkedin/user
    dependency:
      from_source: employees
      field: urn.value
    input_key: user
    parallel: 5
    rate_limit: "10/s"
    on_error: skip

storage:
  format: parquet
  path: ./data/
The CLI automatically resolves the dependency graph and executes sources in the correct order.

Pipeline Configuration Reference

FieldDescription
namePipeline name (used for history and logging)
descriptionOptional description
sourcesList of data sources (see Source Types)
storage.formatOutput format: parquet (default), json, jsonl, csv
storage.pathOutput directory (default: ./data/)
schedule.cronCron expression for automated runs (see Scheduling)
notificationsWebhook URLs for success/failure events

Dataset Commands

CommandDescription
anysite dataset init <name>Create a new dataset with starter YAML
anysite dataset collect <yaml>Run the collection pipeline
anysite dataset collect <yaml> --dry-runPreview execution plan without API calls
anysite dataset collect <yaml> --incrementalSkip previously collected inputs
anysite dataset collect <yaml> --source <id>Collect a single source
anysite dataset status <yaml>Check collection status
anysite dataset collect <yaml> --load-db <conn>Auto-load results into a database
anysite dataset query <yaml> --sql "..."Query collected data with SQL
anysite dataset stats <yaml>Show collection statistics
anysite dataset history <name>View run history
anysite dataset logs <name> --run <N>View logs for a specific run
anysite dataset reset-cursor <yaml>Reset incremental collection cursors

Next Steps