Data Agent

Overview

The Data Agent is an AI-powered assistant that helps you collect, process, and analyze web data using natural language. Instead of writing CLI commands manually, describe what data you need — the agent handles endpoint discovery, pipeline configuration, execution, and delivery. The agent operates the anysite CLI toolkit and uses the /anysite-cli Claude Code skill for technical reference.

Requires the Claude Code Skill to be installed.

How It Works

The Data Agent follows these principles:

Start with the goal, not the tool. It understands your data need before reaching for commands. “Find me CTOs in fintech” is a data need, not a CLI instruction.
Make smart defaults. Chooses reasonable options (format, parallelism, error handling) without asking — unless the choice significantly impacts cost or time.
Show the work plan. Before executing anything non-trivial, states what it will do and the approximate number of API calls.
Prefer simplicity. A single anysite api call beats a full pipeline if it solves the problem. But when scale, dependencies, or repeatability matter — builds a proper pipeline.
Deliver insight, not just data. After collecting, summarizes findings, highlights patterns and outliers.
Suggest next steps. “Want me to enrich these with seniority level?”, “I can set this up as a weekly pipeline”, “Should I load this into your database?”

Workflow

Step 1: Understand the Data Need

The agent parses your request to identify:

Dimension	Question
Entities	People, companies, posts, comments, jobs, products?
Attributes	Names? Emails? Follower counts? Sentiment?
Scale	One record, tens, hundreds, thousands?
Outcome	A quick answer, a spreadsheet, a database table, an ongoing pipeline?

The agent asks questions when:

The scope is ambiguous and getting it wrong wastes significant credits
Multiple approaches exist with very different tradeoffs
You may be unaware of richer data available from the API

The agent just acts when:

The request is clear and small-scale
There is an obvious best approach
It can show a sample first and iterate

Step 2: Discover Endpoints

The agent always discovers endpoints before writing API calls or dataset configs:

anysite describe                             # List all available endpoints
anysite describe --search "company"          # Search by keyword
anysite describe /api/linkedin/company       # Inspect specific endpoint

It maps your data need to specific endpoints. Common chains:

Search → Detail — find entities, then get full profiles
Profile → Posts/Activity — get a person, then their content
Company → Employees → Profiles — organizational deep-dive

Step 3: Choose the Right Approach

The agent uses this decision tree:

One-off lookup of 1-5 items?
  → anysite api (ad-hoc call)

Batch from a known list?
  Small (< 20)  → anysite api --from-file
  Large (20+)   → Dataset pipeline with from_file source

Chaining multiple endpoints (search → detail → posts)?
  → Dataset pipeline with dependent sources

Needs to run repeatedly (daily, weekly)?
  → Dataset pipeline + schedule + incremental

One-time large collection?
  → Dataset pipeline (for progress tracking, error recovery, Parquet storage)

Adds LLM enrichment when:

You ask for subjective analysis (sentiment, categorization, scoring)
Structured attributes need extraction from free text
Generated content is needed (summaries, outreach messages)
Semantic deduplication is required

Sets up database loading when:

You want SQL querying after collection
Data will be updated incrementally
Related tables need FK relationships

Step 4: Execute

The agent follows execution rules:

Always --dry-run before the first collection of a new pipeline
parallel: 3-5 as a safe default for batch sources
on_error: skip for large batches
--incremental for re-runs to avoid duplicate work
--load-db <connection> when you want database output

Step 5: Analyze and Deliver

The agent matches delivery format to your need:

Need	Format
Quick answer	Summarize in conversation
Spreadsheet	`--format csv --output results.csv`
Visual table	`--format table`
Database	`--load-db <connection>`

After delivering, it suggests logical follow-ups based on the collected data.

Pipeline Patterns

The agent uses these ready-made templates as starting points and customizes them for your specific needs.

Search → Enrich

Search for entities, then get full details:

sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO fintech", count: 50 }

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: search, field: urn.value }
    input_key: user
    parallel: 3

storage:
  format: parquet
  path: ./data/

Multi-Search → Union → Enrich

Multiple searches combined, deduplicated, then enriched:

sources:
  - id: search_a
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO fintech", count: 50 }

  - id: search_b
    endpoint: /api/linkedin/search/users
    params: { keywords: "VP Engineering fintech", count: 50 }

  - id: all_results
    type: union
    sources: [search_a, search_b]
    dedupe_by: urn.value

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: all_results, field: urn.value }
    input_key: user
    parallel: 3

storage:
  format: parquet
  path: ./data/

Company → Employees → Profiles

Deep company intelligence chain:

sources:
  - id: company
    endpoint: /api/linkedin/company
    params: { company: "anthropic" }

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency: { from_source: company, field: urn.value }
    input_key: companies
    input_template:
      companies: [{ type: company, value: "{value}" }]
      count: 50

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: employees, field: internal_id.value }
    input_key: user
    parallel: 3

storage:
  format: parquet
  path: ./data/

From-File Batch

Process a user-provided list of identifiers:

sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: usernames.txt
    input_key: user
    parallel: 5
    on_error: skip

storage:
  format: parquet
  path: ./data/

Collect + LLM Analysis

Collect data, then analyze with LLM in the same pipeline:

sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: usernames.txt
    input_key: user
    parallel: 3

  - id: analyzed
    type: llm
    dependency: { from_source: profiles, field: name }
    llm:
      - type: classify
        categories: "strong_fit,moderate_fit,weak_fit"
        output_column: fit
        fields: [headline, summary, experience]
      - type: enrich
        add:
          - "seniority:junior/mid/senior/executive"
          - "key_skills:string"
        fields: [headline, experience]
    export:
      - type: file
        path: ./output/analyzed-{{date}}.csv
        format: csv

storage:
  format: parquet
  path: ./data/

Incremental Daily Pipeline

Scheduled collection that only gets new data:

sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params: { keywords: "ML engineer", count: 100 }
    refresh: always

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: search, field: urn.value, dedupe: true }
    input_key: user
    parallel: 3
    db_load:
      key: urn.value
      sync: full

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 9 * * MON-FRI"

Static Profiles → Fresh Activity

Profiles are collected once. Posts and comments are re-fetched every run, with only new records loaded into the database:

sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: target_profiles.txt
    input_key: user
    parallel: 3

  - id: posts
    endpoint: /api/linkedin/user/posts
    dependency: { from_source: profiles, field: urn.value }
    input_key: urn
    input_template:
      urn: "urn:li:fsd_profile:{value}"
      count: 20
    parallel: 3
    refresh: always
    db_load:
      key: urn.value
      sync: append

  - id: comments
    endpoint: /api/linkedin/post/comments
    dependency: { from_source: posts, field: urn.value }
    input_key: urn
    input_template:
      urn: "urn:li:activity:{value}"
      count: 50
    parallel: 3
    refresh: always
    db_load:
      key: urn.value
      sync: append

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 8 * * MON-FRI"

# First run — collects profiles + posts + comments
anysite dataset collect dataset.yaml --load-db pg

# Daily runs — profiles skipped, only fresh posts & comments collected
anysite dataset collect dataset.yaml --incremental --load-db pg

Key Constraints

API parameters:

location, current_companies, industry accept ONE name (string) or MULTIPLE URNs (JSON array). A list of names ["Microsoft", "Google"] does NOT work — use one name or multiple URNs.
Always anysite describe <endpoint> to verify exact param names and types.

Dependency field gotchas:

Company employees endpoint: use internal_id.value or urn.value to chain to user profiles, NOT alias or url.
Nested JSON in Parquet is traversed with dot-notation: urn.value, experience[0].company_urn.

Performance defaults:

parallel: 3-5, on_error: skip for batch sources
--incremental for re-runs, --no-llm to skip expensive LLM steps

Storage:

Parquet snapshots at raw/<source_id>/YYYY-MM-DD.parquet
metadata.json tracks incremental state — use reset-cursor to clear

Quick Start Checklist

Before any data task, verify the environment:

anysite --version                    # CLI available?
anysite schema update                # Schema cache current?
anysite config get api_key           # API key configured?

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

Overview

How It Works

Workflow

Step 1: Understand the Data Need

Step 2: Discover Endpoints

Step 3: Choose the Right Approach

Step 4: Execute

Step 5: Analyze and Deliver

Pipeline Patterns

Search → Enrich

Multi-Search → Union → Enrich

Company → Employees → Profiles

From-File Batch

Collect + LLM Analysis

Incremental Daily Pipeline

Static Profiles → Fresh Activity

Key Constraints

Quick Start Checklist

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

​Overview

​How It Works

​Workflow

​Step 1: Understand the Data Need

​Step 2: Discover Endpoints

​Step 3: Choose the Right Approach

​Step 4: Execute

​Step 5: Analyze and Deliver

​Pipeline Patterns

​Search → Enrich

​Multi-Search → Union → Enrich

​Company → Employees → Profiles

​From-File Batch

​Collect + LLM Analysis

​Incremental Daily Pipeline

​Static Profiles → Fresh Activity

​Key Constraints

​Quick Start Checklist

Overview

How It Works

Workflow

Step 1: Understand the Data Need

Step 2: Discover Endpoints

Step 3: Choose the Right Approach

Step 4: Execute

Step 5: Analyze and Deliver

Pipeline Patterns

Search → Enrich

Multi-Search → Union → Enrich

Company → Employees → Profiles

From-File Batch

Collect + LLM Analysis

Incremental Daily Pipeline

Static Profiles → Fresh Activity

Key Constraints

Quick Start Checklist