Skip to main content

Overview

The Data Agent is an AI-powered assistant that helps you collect, process, and analyze web data using natural language. Instead of writing CLI commands manually, describe what data you need — the agent handles endpoint discovery, pipeline configuration, execution, and delivery. The agent operates the anysite CLI toolkit and uses the /anysite-cli Claude Code skill for technical reference.
Requires the Claude Code Skill to be installed.

How It Works

The Data Agent follows these principles:
  • Start with the goal, not the tool. It understands your data need before reaching for commands. “Find me CTOs in fintech” is a data need, not a CLI instruction.
  • Make smart defaults. Chooses reasonable options (format, parallelism, error handling) without asking — unless the choice significantly impacts cost or time.
  • Show the work plan. Before executing anything non-trivial, states what it will do and the approximate number of API calls.
  • Prefer simplicity. A single anysite api call beats a full pipeline if it solves the problem. But when scale, dependencies, or repeatability matter — builds a proper pipeline.
  • Deliver insight, not just data. After collecting, summarizes findings, highlights patterns and outliers.
  • Suggest next steps. “Want me to enrich these with seniority level?”, “I can set this up as a weekly pipeline”, “Should I load this into your database?”

Workflow

Step 1: Understand the Data Need

The agent parses your request to identify:
DimensionQuestion
EntitiesPeople, companies, posts, comments, jobs, products?
AttributesNames? Emails? Follower counts? Sentiment?
ScaleOne record, tens, hundreds, thousands?
OutcomeA quick answer, a spreadsheet, a database table, an ongoing pipeline?
The agent asks questions when:
  • The scope is ambiguous and getting it wrong wastes significant credits
  • Multiple approaches exist with very different tradeoffs
  • You may be unaware of richer data available from the API
The agent just acts when:
  • The request is clear and small-scale
  • There is an obvious best approach
  • It can show a sample first and iterate

Step 2: Discover Endpoints

The agent always discovers endpoints before writing API calls or dataset configs:
anysite describe                             # List all available endpoints
anysite describe --search "company"          # Search by keyword
anysite describe /api/linkedin/company       # Inspect specific endpoint
It maps your data need to specific endpoints. Common chains:
  • Search → Detail — find entities, then get full profiles
  • Profile → Posts/Activity — get a person, then their content
  • Company → Employees → Profiles — organizational deep-dive

Step 3: Choose the Right Approach

The agent uses this decision tree:
One-off lookup of 1-5 items?
  → anysite api (ad-hoc call)

Batch from a known list?
  Small (< 20)  → anysite api --from-file
  Large (20+)   → Dataset pipeline with from_file source

Chaining multiple endpoints (search → detail → posts)?
  → Dataset pipeline with dependent sources

Needs to run repeatedly (daily, weekly)?
  → Dataset pipeline + schedule + incremental

One-time large collection?
  → Dataset pipeline (for progress tracking, error recovery, Parquet storage)
Adds LLM enrichment when:
  • You ask for subjective analysis (sentiment, categorization, scoring)
  • Structured attributes need extraction from free text
  • Generated content is needed (summaries, outreach messages)
  • Semantic deduplication is required
Sets up database loading when:
  • You want SQL querying after collection
  • Data will be updated incrementally
  • Related tables need FK relationships

Step 4: Execute

The agent follows execution rules:
  • Always --dry-run before the first collection of a new pipeline
  • parallel: 3-5 as a safe default for batch sources
  • on_error: skip for large batches
  • --incremental for re-runs to avoid duplicate work
  • --load-db <connection> when you want database output

Step 5: Analyze and Deliver

The agent matches delivery format to your need:
NeedFormat
Quick answerSummarize in conversation
Spreadsheet--format csv --output results.csv
Visual table--format table
Database--load-db <connection>
After delivering, it suggests logical follow-ups based on the collected data.

Pipeline Patterns

The agent uses these ready-made templates as starting points and customizes them for your specific needs.

Search → Enrich

Search for entities, then get full details:
sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO fintech", count: 50 }

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: search, field: urn.value }
    input_key: user
    parallel: 3

storage:
  format: parquet
  path: ./data/

Multi-Search → Union → Enrich

Multiple searches combined, deduplicated, then enriched:
sources:
  - id: search_a
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO fintech", count: 50 }

  - id: search_b
    endpoint: /api/linkedin/search/users
    params: { keywords: "VP Engineering fintech", count: 50 }

  - id: all_results
    type: union
    sources: [search_a, search_b]
    dedupe_by: urn.value

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: all_results, field: urn.value }
    input_key: user
    parallel: 3

storage:
  format: parquet
  path: ./data/

Company → Employees → Profiles

Deep company intelligence chain:
sources:
  - id: company
    endpoint: /api/linkedin/company
    params: { company: "anthropic" }

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency: { from_source: company, field: urn.value }
    input_key: companies
    input_template:
      companies: [{ type: company, value: "{value}" }]
      count: 50

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: employees, field: internal_id.value }
    input_key: user
    parallel: 3

storage:
  format: parquet
  path: ./data/

From-File Batch

Process a user-provided list of identifiers:
sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: usernames.txt
    input_key: user
    parallel: 5
    on_error: skip

storage:
  format: parquet
  path: ./data/

Collect + LLM Analysis

Collect data, then analyze with LLM in the same pipeline:
sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: usernames.txt
    input_key: user
    parallel: 3

  - id: analyzed
    type: llm
    dependency: { from_source: profiles, field: name }
    llm:
      - type: classify
        categories: "strong_fit,moderate_fit,weak_fit"
        output_column: fit
        fields: [headline, summary, experience]
      - type: enrich
        add:
          - "seniority:junior/mid/senior/executive"
          - "key_skills:string"
        fields: [headline, experience]
    export:
      - type: file
        path: ./output/analyzed-{{date}}.csv
        format: csv

storage:
  format: parquet
  path: ./data/

Incremental Daily Pipeline

Scheduled collection that only gets new data:
sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params: { keywords: "ML engineer", count: 100 }
    refresh: always

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: search, field: urn.value, dedupe: true }
    input_key: user
    parallel: 3
    db_load:
      key: urn.value
      sync: full

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 9 * * MON-FRI"

Static Profiles → Fresh Activity

Profiles are collected once. Posts and comments are re-fetched every run, with only new records loaded into the database:
sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: target_profiles.txt
    input_key: user
    parallel: 3

  - id: posts
    endpoint: /api/linkedin/user/posts
    dependency: { from_source: profiles, field: urn.value }
    input_key: urn
    input_template:
      urn: "urn:li:fsd_profile:{value}"
      count: 20
    parallel: 3
    refresh: always
    db_load:
      key: urn.value
      sync: append

  - id: comments
    endpoint: /api/linkedin/post/comments
    dependency: { from_source: posts, field: urn.value }
    input_key: urn
    input_template:
      urn: "urn:li:activity:{value}"
      count: 50
    parallel: 3
    refresh: always
    db_load:
      key: urn.value
      sync: append

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 8 * * MON-FRI"
# First run — collects profiles + posts + comments
anysite dataset collect dataset.yaml --load-db pg

# Daily runs — profiles skipped, only fresh posts & comments collected
anysite dataset collect dataset.yaml --incremental --load-db pg

Key Constraints

API parameters:
  • location, current_companies, industry accept ONE name (string) or MULTIPLE URNs (JSON array). A list of names ["Microsoft", "Google"] does NOT work — use one name or multiple URNs.
  • Always anysite describe <endpoint> to verify exact param names and types.
Dependency field gotchas:
  • Company employees endpoint: use internal_id.value or urn.value to chain to user profiles, NOT alias or url.
  • Nested JSON in Parquet is traversed with dot-notation: urn.value, experience[0].company_urn.
Performance defaults:
  • parallel: 3-5, on_error: skip for batch sources
  • --incremental for re-runs, --no-llm to skip expensive LLM steps
Storage:
  • Parquet snapshots at raw/<source_id>/YYYY-MM-DD.parquet
  • metadata.json tracks incremental state — use reset-cursor to clear

Quick Start Checklist

Before any data task, verify the environment:
anysite --version                    # CLI available?
anysite schema update                # Schema cache current?
anysite config get api_key           # API key configured?