Overview
The Data Agent is an AI-powered assistant that helps you collect, process, and analyze web data using natural language. Instead of writing CLI commands manually, describe what data you need — the agent handles endpoint discovery, pipeline configuration, execution, and delivery. The agent operates the anysite CLI toolkit and uses the/anysite-cli Claude Code skill for technical reference.
Requires the Claude Code Skill to be installed.
How It Works
The Data Agent follows these principles:- Start with the goal, not the tool. It understands your data need before reaching for commands. “Find me CTOs in fintech” is a data need, not a CLI instruction.
- Make smart defaults. Chooses reasonable options (format, parallelism, error handling) without asking — unless the choice significantly impacts cost or time.
- Show the work plan. Before executing anything non-trivial, states what it will do and the approximate number of API calls.
- Prefer simplicity. A single
anysite apicall beats a full pipeline if it solves the problem. But when scale, dependencies, or repeatability matter — builds a proper pipeline. - Deliver insight, not just data. After collecting, summarizes findings, highlights patterns and outliers.
- Suggest next steps. “Want me to enrich these with seniority level?”, “I can set this up as a weekly pipeline”, “Should I load this into your database?”
Workflow
Step 1: Understand the Data Need
The agent parses your request to identify:| Dimension | Question |
|---|---|
| Entities | People, companies, posts, comments, jobs, products? |
| Attributes | Names? Emails? Follower counts? Sentiment? |
| Scale | One record, tens, hundreds, thousands? |
| Outcome | A quick answer, a spreadsheet, a database table, an ongoing pipeline? |
- The scope is ambiguous and getting it wrong wastes significant credits
- Multiple approaches exist with very different tradeoffs
- You may be unaware of richer data available from the API
- The request is clear and small-scale
- There is an obvious best approach
- It can show a sample first and iterate
Step 2: Discover Endpoints
The agent always discovers endpoints before writing API calls or dataset configs:- Search → Detail — find entities, then get full profiles
- Profile → Posts/Activity — get a person, then their content
- Company → Employees → Profiles — organizational deep-dive
Step 3: Choose the Right Approach
The agent uses this decision tree:- You ask for subjective analysis (sentiment, categorization, scoring)
- Structured attributes need extraction from free text
- Generated content is needed (summaries, outreach messages)
- Semantic deduplication is required
- You want SQL querying after collection
- Data will be updated incrementally
- Related tables need FK relationships
Step 4: Execute
The agent follows execution rules:- Always
--dry-runbefore the first collection of a new pipeline parallel: 3-5as a safe default for batch sourceson_error: skipfor large batches--incrementalfor re-runs to avoid duplicate work--load-db <connection>when you want database output
Step 5: Analyze and Deliver
The agent matches delivery format to your need:| Need | Format |
|---|---|
| Quick answer | Summarize in conversation |
| Spreadsheet | --format csv --output results.csv |
| Visual table | --format table |
| Database | --load-db <connection> |
Pipeline Patterns
The agent uses these ready-made templates as starting points and customizes them for your specific needs.Search → Enrich
Search for entities, then get full details:Multi-Search → Union → Enrich
Multiple searches combined, deduplicated, then enriched:Company → Employees → Profiles
Deep company intelligence chain:From-File Batch
Process a user-provided list of identifiers:Collect + LLM Analysis
Collect data, then analyze with LLM in the same pipeline:Incremental Daily Pipeline
Scheduled collection that only gets new data:Static Profiles → Fresh Activity
Profiles are collected once. Posts and comments are re-fetched every run, with only new records loaded into the database:Key Constraints
API parameters:location,current_companies,industryaccept ONE name (string) or MULTIPLE URNs (JSON array). A list of names["Microsoft", "Google"]does NOT work — use one name or multiple URNs.- Always
anysite describe <endpoint>to verify exact param names and types.
- Company employees endpoint: use
internal_id.valueorurn.valueto chain to user profiles, NOTaliasorurl. - Nested JSON in Parquet is traversed with dot-notation:
urn.value,experience[0].company_urn.
parallel: 3-5,on_error: skipfor batch sources--incrementalfor re-runs,--no-llmto skip expensive LLM steps
- Parquet snapshots at
raw/<source_id>/YYYY-MM-DD.parquet metadata.jsontracks incremental state — usereset-cursorto clear