Skip to main content

Overview

After collecting data with dataset pipelines, you can query it using SQL powered by DuckDB. This works directly on Parquet files — no separate database needed.
Requires the data extra: pip install "anysite-cli[data]"

SQL Queries

Run SQL against your collected dataset:
anysite dataset query dataset.yaml \
  --sql "SELECT name, headline FROM profiles LIMIT 10"

Query a Specific Source

anysite dataset query dataset.yaml \
  --source profiles \
  --fields "name, headline, urn.value AS id"

Complex Queries

# Aggregation
anysite dataset query dataset.yaml \
  --sql "SELECT industry, COUNT(*) as count FROM companies GROUP BY industry ORDER BY count DESC"

# Joins across sources
anysite dataset query dataset.yaml \
  --sql "
    SELECT p.name, p.headline, c.name as company
    FROM profiles p
    JOIN employees e ON p.urn_value = e.urn_value
    JOIN companies c ON e.company_id = c.urn_value
    LIMIT 20
  "

# Filtering
anysite dataset query dataset.yaml \
  --sql "SELECT * FROM profiles WHERE headline LIKE '%CTO%' OR headline LIKE '%CEO%'"

Interactive Mode

Launch an interactive SQL shell:
anysite dataset query dataset.yaml --interactive
This opens a DuckDB shell with all your dataset sources available as tables. Type SQL queries and see results instantly.

Dataset Statistics

Get a summary of collected data:
anysite dataset stats dataset.yaml
Shows per-source:
  • Number of records collected
  • Collection timestamp
  • File size
  • Column list with types

Source-Level Stats

anysite dataset stats dataset.yaml --source profiles

Dataset Profiling

Generate a statistical profile of your data:
anysite dataset profile dataset.yaml
Includes:
  • Column-level statistics (min, max, mean, median, null count)
  • Value distributions for categorical columns
  • Data quality indicators

Output Formats

Query results support the same output formats as API calls:
# Table (default for queries)
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles LIMIT 5" --format table

# CSV export
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format csv --output report.csv

# JSON
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format json

# JSONL
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format jsonl

Commands Reference

CommandDescription
anysite dataset query <yaml> --sql "..."Run SQL query on collected data
anysite dataset query <yaml> --source <id>Query a specific source
anysite dataset query <yaml> --interactiveOpen interactive SQL shell
anysite dataset stats <yaml>Show dataset statistics
anysite dataset stats <yaml> --source <id>Show stats for a specific source
anysite dataset profile <yaml>Generate data profile with distributions

Next Steps