Skip to main content

Incremental Collection

By default, the CLI collects all inputs every time you run anysite dataset collect. With incremental mode, it tracks what has already been collected and skips those inputs on subsequent runs.

Enable Incremental Mode

# First run: collects everything
anysite dataset collect dataset.yaml

# Subsequent runs: only collects new inputs
anysite dataset collect dataset.yaml --incremental
The CLI stores cursor data in a metadata.json file alongside your dataset, tracking which inputs have been processed.

Per-Source Refresh Control

Control incremental behavior for each source:
sources:
  - id: companies
    endpoint: /api/linkedin/search/companies
    params: { keywords: "AI startup", count: 100 }
    refresh: always    # Always re-collect (time-sensitive data)

  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency: { from_source: companies, field: urn.value }
    input_key: companies
    refresh: auto      # Respect --incremental flag (default)
ValueBehavior
autoRespects the --incremental flag (default)
alwaysAlways re-collects, even with --incremental
Use refresh: always for search results, trending content, or any data that changes frequently. Use refresh: auto for static data like user profiles.

Reset Cursors

To start fresh and re-collect everything:
anysite dataset reset-cursor dataset.yaml

Scheduling

Automate data collection with cron expressions in your pipeline configuration:
name: daily-monitoring
description: Daily competitor monitoring pipeline

sources:
  - id: competitor_posts
    endpoint: /api/linkedin/company/posts
    from_file: competitors.txt
    input_key: company
    parallel: 3

schedule:
  cron: "0 9 * * *"    # Every day at 9:00 AM

storage:
  format: parquet
  path: ./data/

Start Scheduled Collection

anysite dataset schedule dataset.yaml --incremental --load-db pg
This starts the scheduler which runs the pipeline according to the cron expression. Each run is logged with a unique run ID.

Common Cron Expressions

ExpressionSchedule
0 9 * * *Daily at 9:00 AM
0 */6 * * *Every 6 hours
0 9 * * 1Every Monday at 9:00 AM
0 9 1 * *First day of each month at 9:00 AM
*/30 * * * *Every 30 minutes

Webhook Notifications

Get notified when pipeline runs complete or fail:
notifications:
  on_complete:
    - url: "https://hooks.slack.com/services/xxx/yyy/zzz"
  on_failure:
    - url: "https://alerts.example.com/anysite-failure"
Notifications include:
  • Pipeline name and run ID
  • Collection status (success/failure)
  • Number of records collected per source
  • Execution duration and error details (on failure)

Run History and Logs

View Run History

anysite dataset history my-dataset
Shows a list of all runs with status, timestamp, records collected, and duration.

View Run Logs

anysite dataset logs my-dataset --run 42
Shows detailed logs for a specific run, including per-source progress and any errors.

Auto-Load to Database

Combine scheduling with database loading for a fully automated pipeline:
anysite dataset schedule dataset.yaml \
  --incremental \
  --load-db pg
This automatically loads collected data into the specified database connection after each run. See Database Operations for database configuration.

Complete Automated Pipeline Example

name: lead-monitoring
description: Automated lead discovery and enrichment

sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params:
      keywords: "Head of Engineering"
      count: 100
    refresh: always

  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: search, field: urn.value }
    input_key: user
    parallel: 5
    rate_limit: "10/s"
    on_error: skip
    refresh: auto

  - id: enriched
    type: llm
    dependency: { from_source: profiles, field: name }
    llm:
      - type: classify
        categories: "high_priority,medium,low"
        output_column: lead_score

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 9 * * 1"    # Every Monday at 9 AM

notifications:
  on_complete:
    - url: "https://hooks.slack.com/services/xxx"
  on_failure:
    - url: "https://alerts.example.com/fail"

Next Steps