Overview
Dataset pipelines support 5 source types, each designed for a different data collection pattern. Sources can be combined to build complex multi-step workflows.| Type | Purpose | Key Config |
|---|---|---|
| Independent | Single API call with static parameters | endpoint, params |
| From File | Batch calls iterating over a file | from_file, input_key |
| Dependent | Batch calls using values from a parent source | dependency, input_key |
| Union | Combine records from multiple sources | type: union, sources |
| LLM | Process data through an LLM model | type: llm, llm |
Independent Source
A single API call with static parameters. Use this for searches, listings, or any one-off data extraction.From File Source
Batch API calls driven by inputs from an external file. Each line/row in the file becomes a separate API request.- TXT — one value per line
- CSV — uses the column matching
input_key - JSON/JSONL — uses the field matching
input_key
Dependent Source
Batch API calls that use output from a parent source. The dependency chain is resolved automatically — the parent source runs first, and its results feed into the dependent source.Dependency Configuration
| Field | Description |
|---|---|
from_source | ID of the parent source |
field | Field path to extract from parent results (dot-notation supported) |
dedupe | Remove duplicate values before processing (default: false) |
Union Source
Combines records from multiple parent sources into a single dataset. Optionally deduplicates records by a specified field.Union Configuration
| Field | Description |
|---|---|
type | Must be union |
sources | List of source IDs to combine |
dedupe_by | Field to deduplicate by (optional) |
LLM Source
Processes data from a parent source through LLM operations — without making any API calls. Use this for classification, summarization, enrichment, and more.LLM Operations
| Operation | Description |
|---|---|
classify | Categorize records into predefined categories |
enrich | Extract new attributes (enums, strings, booleans, numbers) |
summarize | Generate concise summaries |
generate | Create text using templates with field placeholders |
LLM sources require the
llm extra: pip install "anysite-cli[llm]". See LLM Analysis for detailed configuration.Per-Source Transform & Export
Sources can include post-collection transforms and exports:Transform Options
| Field | Description |
|---|---|
filter | jq-style filter expression to keep matching records |
fields | List of fields to include in the output |
add_columns | Static columns to add to every record |
Export Options
| Field | Description |
|---|---|
type | Export type: file or webhook |
path | Output file path (supports {{date}} template) |
format | Export format: csv, json, jsonl |
Database Load Options (per-source)
| Field | Description |
|---|---|
key | Unique key column for incremental sync |
sync | Sync mode: full (default, includes DELETE) or append (no DELETE) |
fields | Fields to load into the database |
Input Templates
For endpoints that require complex input structures, useinput_template:
{value} placeholder is replaced with each input value from the dependency.
Common Source Options
These options apply to all API-based source types (independent, from_file, dependent):| Option | Description | Default |
|---|---|---|
parallel | Number of concurrent workers | 1 |
rate_limit | Maximum request rate (e.g., "10/s") | No limit |
on_error | Error handling: stop, skip, retry | stop |
refresh | Incremental behavior: auto, always | auto |