Overview
Anysite CLI integrates with LLM providers to add AI-powered analysis to your data workflows. Six operations are available: classify, summarize, enrich, generate, match, and deduplicate.
Requires the llm extra: pip install "anysite-cli[llm]"
Setup
Configure your LLM provider:
This guides you through selecting a provider and entering your API key.
Supported Providers
| Provider | Default Model | Configuration |
|---|
| OpenAI | gpt-4.1-mini | Uses JSON Schema for structured output |
| Anthropic | claude-sonnet-4-5-20250514 | Uses system prompts with JSON schema |
Provider settings are stored in ~/.anysite/config.yaml.
Operations
Classify
Categorize records into predefined categories:
anysite llm classify dataset.yaml --source profiles \
--categories "developer,recruiter,executive,other" \
--fields "name,headline,summary"
If --categories is omitted, the LLM auto-detects 3-7 appropriate categories based on the data.
Summarize
Generate concise summaries:
anysite llm summarize dataset.yaml --source profiles \
--fields "name,headline,summary,experience" \
--max-length 50 \
--output-column bio_summary
Enrich
Extract new structured attributes from text data:
anysite llm enrich dataset.yaml --source profiles \
--add "seniority:junior/mid/senior/lead" \
--add "is_technical:boolean" \
--add "years_experience:number" \
--add "primary_skill:string"
Supported attribute types:
- Enum — predefined choices:
"seniority:junior/mid/senior"
- Boolean — true/false:
"is_technical:boolean"
- Number — numeric value:
"years_experience:number"
- String — free text:
"primary_skill:string"
Generate
Create new text using templates with field placeholders:
anysite llm generate dataset.yaml --source profiles \
--prompt "Write a 2-sentence professional intro for {name} who works as {headline}" \
--temperature 0.7 \
--output-column intro_text
Match
Compare records across two sources and find best matches:
anysite llm match dataset.yaml \
--source-a profiles \
--source-b companies \
--top-k 3
Returns the top K matches for each record in source A, with relevance scores.
Deduplicate
Find and flag semantic duplicates within a source:
anysite llm deduplicate dataset.yaml --source profiles \
--key name \
--threshold 0.8
Records with similarity above the threshold are flagged as potential duplicates.
Using LLM in Dataset Pipelines
Add LLM processing directly in your pipeline YAML:
sources:
- id: profiles
endpoint: /api/linkedin/user
from_file: users.txt
input_key: user
- id: profiles_enriched
type: llm
dependency:
from_source: profiles
field: name
llm:
- type: classify
categories: "developer,recruiter,executive,sales,other"
output_column: role_type
- type: enrich
add:
- "seniority:junior/mid/senior/lead"
- "is_technical:boolean"
- type: summarize
max_length: 50
output_column: bio_summary
Multiple LLM steps can be chained within a single LLM source. They execute in order, each adding new columns to the dataset.
Caching
LLM results are cached in a local SQLite database (~/.anysite/llm_cache.db) to avoid repeated API calls and reduce costs.
# View cache statistics
anysite llm cache-stats
# Clear the cache
anysite llm cache-clear
# Bypass cache for a single run
anysite llm classify dataset.yaml --source profiles \
--categories "dev,recruiter,exec" --no-cache
Caching is especially useful when iterating on pipeline configurations — you only pay for LLM calls once per unique input.
Options Reference
| Option | Description | Applies To |
|---|
--fields | Fields to include in LLM context (comma-separated) | classify, summarize |
--categories | Comma-separated categories | classify |
--add | Attribute to extract (repeatable) | enrich |
--prompt | Template with {field} placeholders | generate |
--temperature | LLM creativity (0.0-1.0) | generate |
--max-length | Max words for output | summarize |
--output-column | Name for the result column | all |
--top-k | Number of matches per record | match |
--key | Field to compare for duplicates | deduplicate |
--threshold | Similarity threshold (0.0-1.0) | deduplicate |
--no-cache | Skip the LLM cache | all |
Next Steps