Overview
After collecting data with dataset pipelines, you can query it using SQL powered by DuckDB. This works directly on Parquet files — no separate database needed.
Requires the data extra: pip install "anysite-cli[data]"
SQL Queries
Run SQL against your collected dataset:
anysite dataset query dataset.yaml \
--sql "SELECT name, headline FROM profiles LIMIT 10"
Query a Specific Source
anysite dataset query dataset.yaml \
--source profiles \
--fields "name, headline, urn.value AS id"
Complex Queries
# Aggregation
anysite dataset query dataset.yaml \
--sql "SELECT industry, COUNT(*) as count FROM companies GROUP BY industry ORDER BY count DESC"
# Joins across sources
anysite dataset query dataset.yaml \
--sql "
SELECT p.name, p.headline, c.name as company
FROM profiles p
JOIN employees e ON p.urn_value = e.urn_value
JOIN companies c ON e.company_id = c.urn_value
LIMIT 20
"
# Filtering
anysite dataset query dataset.yaml \
--sql "SELECT * FROM profiles WHERE headline LIKE '%CTO%' OR headline LIKE '%CEO%'"
Interactive Mode
Launch an interactive SQL shell:
anysite dataset query dataset.yaml --interactive
This opens a DuckDB shell with all your dataset sources available as tables. Type SQL queries and see results instantly.
Dataset Statistics
Get a summary of collected data:
anysite dataset stats dataset.yaml
Shows per-source:
- Number of records collected
- Collection timestamp
- File size
- Column list with types
Source-Level Stats
anysite dataset stats dataset.yaml --source profiles
Dataset Profiling
Generate a statistical profile of your data:
anysite dataset profile dataset.yaml
Includes:
- Column-level statistics (min, max, mean, median, null count)
- Value distributions for categorical columns
- Data quality indicators
Query results support the same output formats as API calls:
# Table (default for queries)
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles LIMIT 5" --format table
# CSV export
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format csv --output report.csv
# JSON
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format json
# JSONL
anysite dataset query dataset.yaml --sql "SELECT * FROM profiles" --format jsonl
Commands Reference
| Command | Description |
|---|
anysite dataset query <yaml> --sql "..." | Run SQL query on collected data |
anysite dataset query <yaml> --source <id> | Query a specific source |
anysite dataset query <yaml> --interactive | Open interactive SQL shell |
anysite dataset stats <yaml> | Show dataset statistics |
anysite dataset stats <yaml> --source <id> | Show stats for a specific source |
anysite dataset profile <yaml> | Generate data profile with distributions |
Next Steps