Documentation Index Fetch the complete documentation index at: https://docs.anysite.io/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Dataset pipelines support 5 source types, each designed for a different data collection pattern. Sources can be combined to build complex multi-step workflows.
Type Purpose Key Config Independent Single API call with static parameters endpoint, paramsFrom File Batch calls iterating over a file from_file, input_keyDependent Batch calls using values from a parent source dependency, input_keyUnion Combine records from multiple sources type: union, sourcesLLM Process data through an LLM model type: llm, llm
Independent Source
A single API call with static parameters. Use this for searches, listings, or any one-off data extraction.
sources :
- id : search_results
endpoint : /api/linkedin/search/users
params :
keywords : "CTO"
count : 50
parallel : 1
rate_limit : "10/s"
on_error : stop
When to use: Starting point for pipelines, search queries, single profile lookups.
From File Source
Batch API calls driven by inputs from an external file. Each line/row in the file becomes a separate API request.
sources :
- id : company_profiles
endpoint : /api/linkedin/company
from_file : companies.txt
input_key : company
parallel : 3
rate_limit : "10/s"
on_error : skip
Supported file formats:
TXT — one value per line
CSV — uses the column matching input_key
JSON/JSONL — uses the field matching input_key
When to use: You have a pre-existing list of URLs, IDs, or search terms.
Dependent Source
Batch API calls that use output from a parent source. The dependency chain is resolved automatically — the parent source runs first, and its results feed into the dependent source.
sources :
- id : companies
endpoint : /api/linkedin/search/companies
params :
keywords : "AI startup"
count : 100
- id : employees
endpoint : /api/linkedin/company/employees
dependency :
from_source : companies
field : urn.value
dedupe : true
input_key : companies
parallel : 3
rate_limit : "10/s"
on_error : skip
refresh : auto
Dependency Configuration
Field Description from_sourceID of the parent source fieldField path to extract from parent results (dot-notation supported) dedupeRemove duplicate values before processing (default: false)
Multi-level chains are supported — a dependent source can itself be the parent of another dependent source:
sources :
- id : companies
endpoint : /api/linkedin/search/companies
params : { keywords : "AI" , count : 50 }
- id : employees
endpoint : /api/linkedin/company/employees
dependency : { from_source : companies , field : urn.value }
input_key : companies
- id : profiles
endpoint : /api/linkedin/user
dependency : { from_source : employees , field : urn.value }
input_key : user
- id : posts
endpoint : /api/linkedin/user/posts
dependency : { from_source : profiles , field : urn.value }
input_key : user
When to use: Multi-step data enrichment, going from search results to detailed profiles to activity data.
Union Source
Combines records from multiple parent sources into a single dataset. Optionally deduplicates records by a specified field.
sources :
- id : search_cto
endpoint : /api/linkedin/search/users
params : { keywords : "CTO" , count : 50 }
- id : search_vp
endpoint : /api/linkedin/search/users
params : { keywords : "VP Engineering" , count : 50 }
- id : all_leaders
type : union
sources : [ search_cto , search_vp ]
dedupe_by : urn.value
Union Configuration
Field Description typeMust be union sourcesList of source IDs to combine dedupe_byField to deduplicate by (optional)
When to use: Merging results from multiple searches, combining data from different platforms.
LLM Source
Processes data from a parent source through LLM operations — without making any API calls. Use this for classification, summarization, enrichment, and more.
sources :
- id : profiles
endpoint : /api/linkedin/user
from_file : users.txt
input_key : user
- id : profiles_analyzed
type : llm
dependency :
from_source : profiles
field : name
llm :
- type : classify
categories : "developer,recruiter,executive,other"
output_column : role_type
- type : enrich
add :
- "seniority:junior/mid/senior/lead"
- "is_technical:boolean"
- type : summarize
max_length : 50
output_column : bio_summary
LLM Operations
Operation Description classifyCategorize records into predefined categories enrichExtract new attributes (enums, strings, booleans, numbers) summarizeGenerate concise summaries generateCreate text using templates with field placeholders
LLM sources require the llm extra: pip install "anysite-cli[llm]". See LLM Analysis for detailed configuration.
When to use: Adding AI-powered enrichment to your pipeline, categorizing or summarizing collected data.
Sources can include post-collection transforms and exports:
sources :
- id : companies
endpoint : /api/linkedin/company
from_file : companies.txt
input_key : company
transform :
filter : '.employee_count > 10'
fields : [ name , url , employee_count ]
add_columns :
batch : "q1-2026"
export :
- type : file
path : ./output/companies-{{date}}.csv
format : csv
db_load :
key : _input_value
sync : full
fields : [ name , url , employee_count ]
Field Description filterjq-style filter expression to keep matching records fieldsList of fields to include in the output add_columnsStatic columns to add to every record
Export Options
Field Description typeExport type: file or webhook pathOutput file path (supports {{date}} template) formatExport format: csv, json, jsonl
Database Load Options (per-source)
Field Description keyUnique key column for incremental sync syncSync mode: full (default, includes DELETE) or append (no DELETE) fieldsFields to load into the database
For endpoints that require complex input structures, use input_template:
sources :
- id : employees
endpoint : /api/linkedin/company/employees
dependency :
from_source : companies
field : urn.value
input_key : companies
input_template :
companies :
- type : company
value : "{value}"
count : 5
The {value} placeholder is replaced with each input value from the dependency.
Common Source Options
These options apply to all API-based source types (independent, from_file, dependent):
Option Description Default parallelNumber of concurrent workers 1rate_limitMaximum request rate (e.g., "10/s") No limit on_errorError handling: stop, skip, retry stoprefreshIncremental behavior: auto, always auto
Next Steps
Scheduling Set up incremental collection, cron scheduling, and webhooks
Database Loading Load pipeline results into SQLite, PostgreSQL, or ClickHouse