Skip to main content

Overview

Dataset pipelines support 5 source types, each designed for a different data collection pattern. Sources can be combined to build complex multi-step workflows.
TypePurposeKey Config
IndependentSingle API call with static parametersendpoint, params
From FileBatch calls iterating over a filefrom_file, input_key
DependentBatch calls using values from a parent sourcedependency, input_key
UnionCombine records from multiple sourcestype: union, sources
LLMProcess data through an LLM modeltype: llm, llm

Independent Source

A single API call with static parameters. Use this for searches, listings, or any one-off data extraction.
sources:
  - id: search_results
    endpoint: /api/linkedin/search/users
    params:
      keywords: "CTO"
      count: 50
    parallel: 1
    rate_limit: "10/s"
    on_error: stop
When to use: Starting point for pipelines, search queries, single profile lookups.

From File Source

Batch API calls driven by inputs from an external file. Each line/row in the file becomes a separate API request.
sources:
  - id: company_profiles
    endpoint: /api/linkedin/company
    from_file: companies.txt
    input_key: company
    parallel: 3
    rate_limit: "10/s"
    on_error: skip
Supported file formats:
  • TXT — one value per line
  • CSV — uses the column matching input_key
  • JSON/JSONL — uses the field matching input_key
When to use: You have a pre-existing list of URLs, IDs, or search terms.

Dependent Source

Batch API calls that use output from a parent source. The dependency chain is resolved automatically — the parent source runs first, and its results feed into the dependent source.
sources:
  - id: companies
    endpoint: /api/linkedin/search/companies
    params:
      keywords: "AI startup"
      count: 100

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
      dedupe: true
    input_key: companies
    parallel: 3
    rate_limit: "10/s"
    on_error: skip
    refresh: auto

Dependency Configuration

FieldDescription
from_sourceID of the parent source
fieldField path to extract from parent results (dot-notation supported)
dedupeRemove duplicate values before processing (default: false)
Multi-level chains are supported — a dependent source can itself be the parent of another dependent source:
sources:
  - id: companies
    endpoint: /api/linkedin/search/companies
    params: { keywords: "AI", count: 50 }

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency: { from_source: companies, field: urn.value }
    input_key: companies

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: employees, field: urn.value }
    input_key: user

  - id: posts
    endpoint: /api/linkedin/user/posts
    dependency: { from_source: profiles, field: urn.value }
    input_key: user
When to use: Multi-step data enrichment, going from search results to detailed profiles to activity data.

Union Source

Combines records from multiple parent sources into a single dataset. Optionally deduplicates records by a specified field.
sources:
  - id: search_cto
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO", count: 50 }

  - id: search_vp
    endpoint: /api/linkedin/search/users
    params: { keywords: "VP Engineering", count: 50 }

  - id: all_leaders
    type: union
    sources: [search_cto, search_vp]
    dedupe_by: urn.value

Union Configuration

FieldDescription
typeMust be union
sourcesList of source IDs to combine
dedupe_byField to deduplicate by (optional)
When to use: Merging results from multiple searches, combining data from different platforms.

LLM Source

Processes data from a parent source through LLM operations — without making any API calls. Use this for classification, summarization, enrichment, and more.
sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: users.txt
    input_key: user

  - id: profiles_analyzed
    type: llm
    dependency:
      from_source: profiles
      field: name
    llm:
      - type: classify
        categories: "developer,recruiter,executive,other"
        output_column: role_type

      - type: enrich
        add:
          - "seniority:junior/mid/senior/lead"
          - "is_technical:boolean"

      - type: summarize
        max_length: 50
        output_column: bio_summary

LLM Operations

OperationDescription
classifyCategorize records into predefined categories
enrichExtract new attributes (enums, strings, booleans, numbers)
summarizeGenerate concise summaries
generateCreate text using templates with field placeholders
LLM sources require the llm extra: pip install "anysite-cli[llm]". See LLM Analysis for detailed configuration.
When to use: Adding AI-powered enrichment to your pipeline, categorizing or summarizing collected data.

Per-Source Transform & Export

Sources can include post-collection transforms and exports:
sources:
  - id: companies
    endpoint: /api/linkedin/company
    from_file: companies.txt
    input_key: company
    transform:
      filter: '.employee_count > 10'
      fields: [name, url, employee_count]
      add_columns:
        batch: "q1-2026"
    export:
      - type: file
        path: ./output/companies-{{date}}.csv
        format: csv
    db_load:
      key: _input_value
      sync: full
      fields: [name, url, employee_count]

Transform Options

FieldDescription
filterjq-style filter expression to keep matching records
fieldsList of fields to include in the output
add_columnsStatic columns to add to every record

Export Options

FieldDescription
typeExport type: file or webhook
pathOutput file path (supports {{date}} template)
formatExport format: csv, json, jsonl

Database Load Options (per-source)

FieldDescription
keyUnique key column for incremental sync
syncSync mode: full (default, includes DELETE) or append (no DELETE)
fieldsFields to load into the database

Input Templates

For endpoints that require complex input structures, use input_template:
sources:
  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
    input_key: companies
    input_template:
      companies:
        - type: company
          value: "{value}"
      count: 5
The {value} placeholder is replaced with each input value from the dependency.

Common Source Options

These options apply to all API-based source types (independent, from_file, dependent):
OptionDescriptionDefault
parallelNumber of concurrent workers1
rate_limitMaximum request rate (e.g., "10/s")No limit
on_errorError handling: stop, skip, retrystop
refreshIncremental behavior: auto, alwaysauto

Next Steps