Source Types

Overview

Dataset pipelines support 5 source types, each designed for a different data collection pattern. Sources can be combined to build complex multi-step workflows.

Type	Purpose	Key Config
Independent	Single API call with static parameters	`endpoint`, `params`
From File	Batch calls iterating over a file	`from_file`, `input_key`
Dependent	Batch calls using values from a parent source	`dependency`, `input_key`
Union	Combine records from multiple sources	`type: union`, `sources`
LLM	Process data through an LLM model	`type: llm`, `llm`

Independent Source

A single API call with static parameters. Use this for searches, listings, or any one-off data extraction.

sources:
  - id: search_results
    endpoint: /api/linkedin/search/users
    params:
      keywords: "CTO"
      count: 50
    parallel: 1
    rate_limit: "10/s"
    on_error: stop

When to use: Starting point for pipelines, search queries, single profile lookups.

From File Source

Batch API calls driven by inputs from an external file. Each line/row in the file becomes a separate API request.

sources:
  - id: company_profiles
    endpoint: /api/linkedin/company
    from_file: companies.txt
    input_key: company
    parallel: 3
    rate_limit: "10/s"
    on_error: skip

Supported file formats:

TXT — one value per line
CSV — uses the column matching input_key
JSON/JSONL — uses the field matching input_key

When to use: You have a pre-existing list of URLs, IDs, or search terms.

Dependent Source

Batch API calls that use output from a parent source. The dependency chain is resolved automatically — the parent source runs first, and its results feed into the dependent source.

sources:
  - id: companies
    endpoint: /api/linkedin/search/companies
    params:
      keywords: "AI startup"
      count: 100

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
      dedupe: true
    input_key: companies
    parallel: 3
    rate_limit: "10/s"
    on_error: skip
    refresh: auto

Dependency Configuration

Field	Description
`from_source`	ID of the parent source
`field`	Field path to extract from parent results (dot-notation supported)
`dedupe`	Remove duplicate values before processing (default: `false`)

Multi-level chains are supported — a dependent source can itself be the parent of another dependent source:

sources:
  - id: companies
    endpoint: /api/linkedin/search/companies
    params: { keywords: "AI", count: 50 }

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency: { from_source: companies, field: urn.value }
    input_key: companies

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: employees, field: urn.value }
    input_key: user

  - id: posts
    endpoint: /api/linkedin/user/posts
    dependency: { from_source: profiles, field: urn.value }
    input_key: user

When to use: Multi-step data enrichment, going from search results to detailed profiles to activity data.

Union Source

Combines records from multiple parent sources into a single dataset. Optionally deduplicates records by a specified field.

sources:
  - id: search_cto
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO", count: 50 }

  - id: search_vp
    endpoint: /api/linkedin/search/users
    params: { keywords: "VP Engineering", count: 50 }

  - id: all_leaders
    type: union
    sources: [search_cto, search_vp]
    dedupe_by: urn.value

Union Configuration

Field	Description
`type`	Must be `union`
`sources`	List of source IDs to combine
`dedupe_by`	Field to deduplicate by (optional)

When to use: Merging results from multiple searches, combining data from different platforms.

LLM Source

Processes data from a parent source through LLM operations — without making any API calls. Use this for classification, summarization, enrichment, and more.

sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: users.txt
    input_key: user

  - id: profiles_analyzed
    type: llm
    dependency:
      from_source: profiles
      field: name
    llm:
      - type: classify
        categories: "developer,recruiter,executive,other"
        output_column: role_type

      - type: enrich
        add:
          - "seniority:junior/mid/senior/lead"
          - "is_technical:boolean"

      - type: summarize
        max_length: 50
        output_column: bio_summary

LLM Operations

Operation	Description
`classify`	Categorize records into predefined categories
`enrich`	Extract new attributes (enums, strings, booleans, numbers)
`summarize`	Generate concise summaries
`generate`	Create text using templates with field placeholders

LLM sources require the llm extra: pip install "anysite-cli[llm]". See LLM Analysis for detailed configuration.

When to use: Adding AI-powered enrichment to your pipeline, categorizing or summarizing collected data.

Per-Source Transform & Export

Sources can include post-collection transforms and exports:

sources:
  - id: companies
    endpoint: /api/linkedin/company
    from_file: companies.txt
    input_key: company
    transform:
      filter: '.employee_count > 10'
      fields: [name, url, employee_count]
      add_columns:
        batch: "q1-2026"
    export:
      - type: file
        path: ./output/companies-{{date}}.csv
        format: csv
    db_load:
      key: _input_value
      sync: full
      fields: [name, url, employee_count]

Transform Options

Field	Description
`filter`	jq-style filter expression to keep matching records
`fields`	List of fields to include in the output
`add_columns`	Static columns to add to every record

Export Options

Field	Description
`type`	Export type: `file` or `webhook`
`path`	Output file path (supports `{{date}}` template)
`format`	Export format: `csv`, `json`, `jsonl`

Database Load Options (per-source)

Field	Description
`key`	Unique key column for incremental sync
`sync`	Sync mode: `full` (default, includes DELETE) or `append` (no DELETE)
`fields`	Fields to load into the database

Input Templates

For endpoints that require complex input structures, use input_template:

sources:
  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
    input_key: companies
    input_template:
      companies:
        - type: company
          value: "{value}"
      count: 5

The {value} placeholder is replaced with each input value from the dependency.

Common Source Options

These options apply to all API-based source types (independent, from_file, dependent):

Option	Description	Default
`parallel`	Number of concurrent workers	`1`
`rate_limit`	Maximum request rate (e.g., `"10/s"`)	No limit
`on_error`	Error handling: `stop`, `skip`, `retry`	`stop`
`refresh`	Incremental behavior: `auto`, `always`	`auto`

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

Overview

Independent Source

From File Source

Dependent Source

Dependency Configuration

Union Source

Union Configuration

LLM Source

LLM Operations

Per-Source Transform & Export

Transform Options

Export Options

Database Load Options (per-source)

Input Templates

Common Source Options

Next Steps

Scheduling

Database Loading

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

​Overview

​Independent Source

​From File Source

​Dependent Source

​Dependency Configuration

​Union Source

​Union Configuration

​LLM Source

​LLM Operations

​Per-Source Transform & Export

​Transform Options

​Export Options

​Database Load Options (per-source)

​Input Templates

​Common Source Options

​Next Steps

Scheduling

Database Loading

Overview

Independent Source

From File Source

Dependent Source

Dependency Configuration

Union Source

Union Configuration

LLM Source

LLM Operations

Per-Source Transform & Export

Transform Options

Export Options

Database Load Options (per-source)

Input Templates

Common Source Options

Next Steps