Scheduling & Automation

Incremental Collection

By default, the CLI collects all inputs every time you run anysite dataset collect. With incremental mode, it tracks what has already been collected and skips those inputs on subsequent runs.

Enable Incremental Mode

# First run: collects everything
anysite dataset collect dataset.yaml

# Subsequent runs: only collects new inputs
anysite dataset collect dataset.yaml --incremental

The CLI stores cursor data in a metadata.json file alongside your dataset, tracking which inputs have been processed.

Per-Source Refresh Control

Control incremental behavior for each source:

sources:
  - id: companies
    endpoint: /api/linkedin/search/companies
    params: { keywords: "AI startup", count: 100 }
    refresh: always    # Always re-collect (time-sensitive data)

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency: { from_source: companies, field: urn.value }
    input_key: companies
    refresh: auto      # Respect --incremental flag (default)

Value	Behavior
`auto`	Respects the `--incremental` flag (default)
`always`	Always re-collects, even with `--incremental`

Use refresh: always for search results, trending content, or any data that changes frequently. Use refresh: auto for static data like user profiles.

Reset Cursors

To start fresh and re-collect everything:

anysite dataset reset-cursor dataset.yaml

Scheduling

Automate data collection with cron expressions in your pipeline configuration:

name: daily-monitoring
description: Daily competitor monitoring pipeline

sources:
  - id: competitor_posts
    endpoint: /api/linkedin/company/posts
    from_file: competitors.txt
    input_key: company
    parallel: 3

schedule:
  cron: "0 9 * * *"    # Every day at 9:00 AM

storage:
  format: parquet
  path: ./data/

Start Scheduled Collection

anysite dataset schedule dataset.yaml --incremental --load-db pg

This starts the scheduler which runs the pipeline according to the cron expression. Each run is logged with a unique run ID.

Common Cron Expressions

Expression	Schedule
`0 9 * * *`	Daily at 9:00 AM
`0 /6 * *`	Every 6 hours
`0 9 * * 1`	Every Monday at 9:00 AM
`0 9 1 * *`	First day of each month at 9:00 AM
`/30 * * *`	Every 30 minutes

Webhook Notifications

Get notified when pipeline runs complete or fail:

notifications:
  on_complete:
    - url: "https://hooks.slack.com/services/xxx/yyy/zzz"
  on_failure:
    - url: "https://alerts.example.com/anysite-failure"

Notifications include:

Pipeline name and run ID
Collection status (success/failure)
Number of records collected per source
Execution duration and error details (on failure)

Run History and Logs

View Run History

anysite dataset history my-dataset

Shows a list of all runs with status, timestamp, records collected, and duration.

View Run Logs

anysite dataset logs my-dataset --run 42

Shows detailed logs for a specific run, including per-source progress and any errors.

Auto-Load to Database

Combine scheduling with database loading for a fully automated pipeline:

anysite dataset schedule dataset.yaml \
  --incremental \
  --load-db pg

This automatically loads collected data into the specified database connection after each run. See Database Operations for database configuration.

Complete Automated Pipeline Example

name: lead-monitoring
description: Automated lead discovery and enrichment

sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params:
      keywords: "Head of Engineering"
      count: 100
    refresh: always

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: search, field: urn.value }
    input_key: user
    parallel: 5
    rate_limit: "10/s"
    on_error: skip
    refresh: auto

  - id: enriched
    type: llm
    dependency: { from_source: profiles, field: name }
    llm:
      - type: classify
        categories: "high_priority,medium,low"
        output_column: lead_score

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 9 * * 1"    # Every Monday at 9 AM

notifications:
  on_complete:
    - url: "https://hooks.slack.com/services/xxx"
  on_failure:
    - url: "https://alerts.example.com/fail"

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

Incremental Collection

Enable Incremental Mode

Per-Source Refresh Control

Reset Cursors

Scheduling

Start Scheduled Collection

Common Cron Expressions

Webhook Notifications

Run History and Logs

View Run History

View Run Logs

Auto-Load to Database

Complete Automated Pipeline Example

Next Steps

Database Connections

LLM Analysis

Get Started

MCP Server

n8n Nodes

Anysite CLI

Claude Skills

Legal

​Incremental Collection

​Enable Incremental Mode

​Per-Source Refresh Control

​Reset Cursors

​Scheduling

​Start Scheduled Collection

​Common Cron Expressions

​Webhook Notifications

​Run History and Logs

​View Run History

​View Run Logs

​Auto-Load to Database

​Complete Automated Pipeline Example

​Next Steps

Database Connections

LLM Analysis

Incremental Collection

Enable Incremental Mode

Per-Source Refresh Control

Reset Cursors

Scheduling

Start Scheduled Collection

Common Cron Expressions

Webhook Notifications

Run History and Logs

View Run History

View Run Logs

Auto-Load to Database

Complete Automated Pipeline Example

Next Steps