Documentation Index
Fetch the complete documentation index at: https://docs.anysite.io/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Data Agent is an AI-powered assistant that helps you collect, process, and analyze web data using natural language. Instead of writing CLI commands manually, describe what data you need — the agent handles endpoint discovery, pipeline configuration, execution, and delivery.
The agent operates the anysite CLI toolkit and uses the /anysite-cli Claude Code skill for technical reference.
How It Works
The Data Agent follows these principles:
- Start with the goal, not the tool. It understands your data need before reaching for commands. “Find me CTOs in fintech” is a data need, not a CLI instruction.
- Make smart defaults. Chooses reasonable options (format, parallelism, error handling) without asking — unless the choice significantly impacts cost or time.
- Show the work plan. Before executing anything non-trivial, states what it will do and the approximate number of API calls.
- Prefer simplicity. A single
anysite api call beats a full pipeline if it solves the problem. But when scale, dependencies, or repeatability matter — builds a proper pipeline.
- Deliver insight, not just data. After collecting, summarizes findings, highlights patterns and outliers.
- Suggest next steps. “Want me to enrich these with seniority level?”, “I can set this up as a weekly pipeline”, “Should I load this into your database?”
Workflow
Step 1: Understand the Data Need
The agent parses your request to identify:
| Dimension | Question |
|---|
| Entities | People, companies, posts, comments, jobs, products? |
| Attributes | Names? Emails? Follower counts? Sentiment? |
| Scale | One record, tens, hundreds, thousands? |
| Outcome | A quick answer, a spreadsheet, a database table, an ongoing pipeline? |
The agent asks questions when:
- The scope is ambiguous and getting it wrong wastes significant credits
- Multiple approaches exist with very different tradeoffs
- You may be unaware of richer data available from the API
The agent just acts when:
- The request is clear and small-scale
- There is an obvious best approach
- It can show a sample first and iterate
Step 2: Discover Endpoints
The agent always discovers endpoints before writing API calls or dataset configs:
anysite describe # List all available endpoints
anysite describe --search "company" # Search by keyword
anysite describe /api/linkedin/company # Inspect specific endpoint
It maps your data need to specific endpoints. Common chains:
- Search → Detail — find entities, then get full profiles
- Profile → Posts/Activity — get a person, then their content
- Company → Employees → Profiles — organizational deep-dive
When the task involves loading data into a database, the agent also discovers the target database structure:
anysite db discover mydb # Schema, tables, columns, indexes, FKs
anysite db discover mydb --with-llm # Add LLM-generated descriptions
anysite db catalog mydb --json # View saved catalog as JSON
The agent can also use the built-in dataset guide for pipeline configuration reference:
anysite dataset guide --section sources # Source types reference
anysite dataset guide --example advanced # Complete example config
anysite dataset guide --json # Structured JSON for agents
Step 3: Choose the Right Approach
The agent uses this decision tree:
One-off lookup of 1-5 items?
→ anysite api (ad-hoc call)
Batch from a known list?
Small (< 20) → anysite api --from-file
Large (20+) → Dataset pipeline with from_file source
Chaining multiple endpoints (search → detail → posts)?
→ Dataset pipeline with dependent sources
Needs to run repeatedly (daily, weekly)?
→ Dataset pipeline + schedule + incremental
One-time large collection?
→ Dataset pipeline (for progress tracking, error recovery, Parquet storage)
Adds LLM enrichment when:
- You ask for subjective analysis (sentiment, categorization, scoring)
- Structured attributes need extraction from free text
- Generated content is needed (summaries, outreach messages)
- Semantic deduplication is required
Sets up database loading when:
- You want SQL querying after collection
- Data will be updated incrementally
- Related tables need FK relationships
- Uses Database Discovery to understand target DB schema before loading
Step 4: Execute
The agent follows execution rules:
- Always
--dry-run before the first collection of a new pipeline
parallel: 3-5 as a safe default for batch sources
on_error: skip for large batches
--incremental for re-runs to avoid duplicate work
--load-db <connection> when you want database output
Step 5: Analyze and Deliver
The agent matches delivery format to your need:
| Need | Format |
|---|
| Quick answer | Summarize in conversation |
| Spreadsheet | --format csv --output results.csv |
| Visual table | --format table |
| Database | --load-db <connection> |
After delivering, it suggests logical follow-ups based on the collected data.
Pipeline Patterns
The agent uses these ready-made templates as starting points and customizes them for your specific needs.
Search → Enrich
Search for entities, then get full details:
sources:
- id: search
endpoint: /api/linkedin/search/users
params: { keywords: "CTO fintech", count: 50 }
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: search, field: urn.value }
input_key: user
parallel: 3
storage:
format: parquet
path: ./data/
Multi-Search → Union → Enrich
Multiple searches combined, deduplicated, then enriched:
sources:
- id: search_a
endpoint: /api/linkedin/search/users
params: { keywords: "CTO fintech", count: 50 }
- id: search_b
endpoint: /api/linkedin/search/users
params: { keywords: "VP Engineering fintech", count: 50 }
- id: all_results
type: union
sources: [search_a, search_b]
dedupe_by: urn.value
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: all_results, field: urn.value }
input_key: user
parallel: 3
storage:
format: parquet
path: ./data/
Company → Employees → Profiles
Deep company intelligence chain:
sources:
- id: company
endpoint: /api/linkedin/company
params: { company: "anthropic" }
- id: employees
endpoint: /api/linkedin/company/employees
dependency: { from_source: company, field: urn.value }
input_key: companies
input_template:
companies: [{ type: company, value: "{value}" }]
count: 50
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: employees, field: internal_id.value }
input_key: user
parallel: 3
storage:
format: parquet
path: ./data/
From-File Batch
Process a user-provided list of identifiers:
sources:
- id: profiles
endpoint: /api/linkedin/user
from_file: usernames.txt
input_key: user
parallel: 5
on_error: skip
storage:
format: parquet
path: ./data/
Collect + LLM Analysis
Collect data, then analyze with LLM in the same pipeline:
sources:
- id: profiles
endpoint: /api/linkedin/user
from_file: usernames.txt
input_key: user
parallel: 3
- id: analyzed
type: llm
dependency: { from_source: profiles, field: name }
llm:
- type: classify
categories: "strong_fit,moderate_fit,weak_fit"
output_column: fit
fields: [headline, summary, experience]
- type: enrich
add:
- "seniority:junior/mid/senior/executive"
- "key_skills:string"
fields: [headline, experience]
export:
- type: file
path: ./output/analyzed-{{date}}.csv
format: csv
storage:
format: parquet
path: ./data/
Incremental Daily Pipeline
Scheduled collection that only gets new data:
sources:
- id: search
endpoint: /api/linkedin/search/users
params: { keywords: "ML engineer", count: 100 }
refresh: always
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: search, field: urn.value, dedupe: true }
input_key: user
parallel: 3
db_load:
key: urn.value
sync: full
storage:
format: parquet
path: ./data/
schedule:
cron: "0 9 * * MON-FRI"
Static Profiles → Fresh Activity
Profiles are collected once. Posts and comments are re-fetched every run, with only new records loaded into the database:
sources:
- id: profiles
endpoint: /api/linkedin/user
from_file: target_profiles.txt
input_key: user
parallel: 3
- id: posts
endpoint: /api/linkedin/user/posts
dependency: { from_source: profiles, field: urn.value }
input_key: urn
input_template:
urn: "urn:li:fsd_profile:{value}"
count: 20
parallel: 3
refresh: always
db_load:
key: urn.value
sync: append
- id: comments
endpoint: /api/linkedin/post/comments
dependency: { from_source: posts, field: urn.value }
input_key: urn
input_template:
urn: "urn:li:activity:{value}"
count: 50
parallel: 3
refresh: always
db_load:
key: urn.value
sync: append
storage:
format: parquet
path: ./data/
schedule:
cron: "0 8 * * MON-FRI"
# First run — collects profiles + posts + comments
anysite dataset collect dataset.yaml --load-db pg
# Daily runs — profiles skipped, only fresh posts & comments collected
anysite dataset collect dataset.yaml --incremental --load-db pg
Key Constraints
API parameters:
location, current_companies, industry accept ONE name (string) or MULTIPLE URNs (JSON array). A list of names ["Microsoft", "Google"] does NOT work — use one name or multiple URNs.
- Always
anysite describe <endpoint> to verify exact param names and types.
Dependency field gotchas:
- Company employees endpoint: use
internal_id.value or urn.value to chain to user profiles, NOT alias or url.
- Nested JSON in Parquet is traversed with dot-notation:
urn.value, experience[0].company_urn.
Performance defaults:
parallel: 3-5, on_error: skip for batch sources
--incremental for re-runs, --no-llm to skip expensive LLM steps
Storage:
- Parquet snapshots at
raw/<source_id>/YYYY-MM-DD.parquet
metadata.json tracks incremental state — use reset-cursor to clear
Quick Start Checklist
The agent communicates via the Agent Protocol — structured JSON output with exit codes, error codes, and next-step hints. When called from a pipe or subprocess, all output is automatically JSON.
Before any data task, verify the environment:
anysite --version # CLI available?
anysite schema update # Schema cache current?
anysite config get api_key # API key configured?
anysite db discover <name> # (Optional) Discover target DB schema