Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.anysite.io/llms.txt

Use this file to discover all available pages before exploring further.

Overview

After collecting data with dataset pipelines, you can query it using SQL powered by DuckDB. This works directly on Parquet files — no separate database needed.
Requires the data extra: pip install "anysite-cli[data]"

SQL Queries

Run SQL against your collected dataset:
anysite dataset query dataset.yaml \
  --sql "SELECT name, headline FROM profiles LIMIT 10"

Query a Specific Source

anysite dataset query dataset.yaml \
  --source profiles \
  --fields "name, headline, urn.value AS id"

Complex Queries

# Aggregation
anysite dataset query dataset.yaml \
  --sql "SELECT industry, COUNT(*) as count FROM companies GROUP BY industry ORDER BY count DESC"

# Joins across sources
anysite dataset query dataset.yaml \
  --sql "
    SELECT p.name, p.headline, c.name as company
    FROM profiles p
    JOIN employees e ON p.urn_value = e.urn_value
    JOIN companies c ON e.company_id = c.urn_value
    LIMIT 20
  "

# Filtering
anysite dataset query dataset.yaml \
  --sql "SELECT * FROM profiles WHERE headline LIKE '%CTO%' OR headline LIKE '%CEO%'"

Interactive Mode

Launch an interactive SQL shell:
anysite dataset query dataset.yaml --interactive
This opens a DuckDB shell with all your dataset sources available as tables. Type SQL queries and see results instantly.

Dataset Statistics

Get a summary of collected data:
anysite dataset stats dataset.yaml
Shows per-source:
  • Number of records collected
  • Collection timestamp
  • File size
  • Column list with types

Source-Level Stats

anysite dataset stats dataset.yaml --source profiles

Dataset Profiling

Generate a statistical profile of your data:
anysite dataset profile dataset.yaml
Includes:
  • Column-level statistics (min, max, mean, median, null count)
  • Value distributions for categorical columns
  • Data quality indicators

Output Formats

Query results support the same output formats as API calls:
# Table (default for queries)
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles LIMIT 5" --format table

# CSV export
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format csv --output report.csv

# JSON
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format json

# JSONL
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format jsonl

Commands Reference

CommandDescription
anysite dataset query <yaml> --sql "..."Run SQL query on collected data
anysite dataset query <yaml> --source <id>Query a specific source
anysite dataset query <yaml> --interactiveOpen interactive SQL shell
anysite dataset stats <yaml>Show dataset statistics
anysite dataset stats <yaml> --source <id>Show stats for a specific source
anysite dataset profile <yaml>Generate data profile with distributions

Next Steps

Examples

See complete end-to-end workflow examples

Database Loading

Load query results into a persistent database