Skip to main content

Overview

Anysite CLI integrates with LLM providers to add AI-powered analysis to your data workflows. Six operations are available: classify, summarize, enrich, generate, match, and deduplicate.
Requires the llm extra: pip install "anysite-cli[llm]"

Setup

Configure your LLM provider:
anysite llm setup
This guides you through selecting a provider and entering your API key.

Supported Providers

ProviderDefault ModelConfiguration
OpenAIgpt-4.1-miniUses JSON Schema for structured output
Anthropicclaude-sonnet-4-5-20250514Uses system prompts with JSON schema
Provider settings are stored in ~/.anysite/config.yaml.

Operations

Classify

Categorize records into predefined categories:
anysite llm classify dataset.yaml --source profiles \
  --categories "developer,recruiter,executive,other" \
  --fields "name,headline,summary"
If --categories is omitted, the LLM auto-detects 3-7 appropriate categories based on the data.

Summarize

Generate concise summaries:
anysite llm summarize dataset.yaml --source profiles \
  --fields "name,headline,summary,experience" \
  --max-length 50 \
  --output-column bio_summary

Enrich

Extract new structured attributes from text data:
anysite llm enrich dataset.yaml --source profiles \
  --add "seniority:junior/mid/senior/lead" \
  --add "is_technical:boolean" \
  --add "years_experience:number" \
  --add "primary_skill:string"
Supported attribute types:
  • Enum — predefined choices: "seniority:junior/mid/senior"
  • Boolean — true/false: "is_technical:boolean"
  • Number — numeric value: "years_experience:number"
  • String — free text: "primary_skill:string"

Generate

Create new text using templates with field placeholders:
anysite llm generate dataset.yaml --source profiles \
  --prompt "Write a 2-sentence professional intro for {name} who works as {headline}" \
  --temperature 0.7 \
  --output-column intro_text

Match

Compare records across two sources and find best matches:
anysite llm match dataset.yaml \
  --source-a profiles \
  --source-b companies \
  --top-k 3
Returns the top K matches for each record in source A, with relevance scores.

Deduplicate

Find and flag semantic duplicates within a source:
anysite llm deduplicate dataset.yaml --source profiles \
  --key name \
  --threshold 0.8
Records with similarity above the threshold are flagged as potential duplicates.

Using LLM in Dataset Pipelines

Add LLM processing directly in your pipeline YAML:
sources:
  - id: profiles
    endpoint: /api/linkedin/user
    from_file: users.txt
    input_key: user

  - id: profiles_enriched
    type: llm
    dependency:
      from_source: profiles
      field: name
    llm:
      - type: classify
        categories: "developer,recruiter,executive,sales,other"
        output_column: role_type

      - type: enrich
        add:
          - "seniority:junior/mid/senior/lead"
          - "is_technical:boolean"

      - type: summarize
        max_length: 50
        output_column: bio_summary
Multiple LLM steps can be chained within a single LLM source. They execute in order, each adding new columns to the dataset.

Caching

LLM results are cached in a local SQLite database (~/.anysite/llm_cache.db) to avoid repeated API calls and reduce costs.
# View cache statistics
anysite llm cache-stats

# Clear the cache
anysite llm cache-clear

# Bypass cache for a single run
anysite llm classify dataset.yaml --source profiles \
  --categories "dev,recruiter,exec" --no-cache
Caching is especially useful when iterating on pipeline configurations — you only pay for LLM calls once per unique input.

Options Reference

OptionDescriptionApplies To
--fieldsFields to include in LLM context (comma-separated)classify, summarize
--categoriesComma-separated categoriesclassify
--addAttribute to extract (repeatable)enrich
--promptTemplate with {field} placeholdersgenerate
--temperatureLLM creativity (0.0-1.0)generate
--max-lengthMax words for outputsummarize
--output-columnName for the result columnall
--top-kNumber of matches per recordmatch
--keyField to compare for duplicatesdeduplicate
--thresholdSimilarity threshold (0.0-1.0)deduplicate
--no-cacheSkip the LLM cacheall

Next Steps