Incremental Collection
By default, the CLI collects all inputs every time you run anysite dataset collect. With incremental mode, it tracks what has already been collected and skips those inputs on subsequent runs.
Enable Incremental Mode
# First run: collects everything
anysite dataset collect dataset.yaml
# Subsequent runs: only collects new inputs
anysite dataset collect dataset.yaml --incremental
The CLI stores cursor data in a metadata.json file alongside your dataset, tracking which inputs have been processed.
Per-Source Refresh Control
Control incremental behavior for each source:
sources:
- id: companies
endpoint: /api/linkedin/search/companies
params: { keywords: "AI startup", count: 100 }
refresh: always # Always re-collect (time-sensitive data)
- id: employees
endpoint: /api/linkedin/company/employees
dependency: { from_source: companies, field: urn.value }
input_key: companies
refresh: auto # Respect --incremental flag (default)
| Value | Behavior |
|---|
auto | Respects the --incremental flag (default) |
always | Always re-collects, even with --incremental |
Use refresh: always for search results, trending content, or any data that changes frequently. Use refresh: auto for static data like user profiles.
Reset Cursors
To start fresh and re-collect everything:
anysite dataset reset-cursor dataset.yaml
Scheduling
Automate data collection with cron expressions in your pipeline configuration:
name: daily-monitoring
description: Daily competitor monitoring pipeline
sources:
- id: competitor_posts
endpoint: /api/linkedin/company/posts
from_file: competitors.txt
input_key: company
parallel: 3
schedule:
cron: "0 9 * * *" # Every day at 9:00 AM
storage:
format: parquet
path: ./data/
Start Scheduled Collection
anysite dataset schedule dataset.yaml --incremental --load-db pg
This starts the scheduler which runs the pipeline according to the cron expression. Each run is logged with a unique run ID.
Common Cron Expressions
| Expression | Schedule |
|---|
0 9 * * * | Daily at 9:00 AM |
0 */6 * * * | Every 6 hours |
0 9 * * 1 | Every Monday at 9:00 AM |
0 9 1 * * | First day of each month at 9:00 AM |
*/30 * * * * | Every 30 minutes |
Webhook Notifications
Get notified when pipeline runs complete or fail:
notifications:
on_complete:
- url: "https://hooks.slack.com/services/xxx/yyy/zzz"
on_failure:
- url: "https://alerts.example.com/anysite-failure"
Notifications include:
- Pipeline name and run ID
- Collection status (success/failure)
- Number of records collected per source
- Execution duration and error details (on failure)
Run History and Logs
View Run History
anysite dataset history my-dataset
Shows a list of all runs with status, timestamp, records collected, and duration.
View Run Logs
anysite dataset logs my-dataset --run 42
Shows detailed logs for a specific run, including per-source progress and any errors.
Auto-Load to Database
Combine scheduling with database loading for a fully automated pipeline:
anysite dataset schedule dataset.yaml \
--incremental \
--load-db pg
This automatically loads collected data into the specified database connection after each run. See Database Operations for database configuration.
Complete Automated Pipeline Example
name: lead-monitoring
description: Automated lead discovery and enrichment
sources:
- id: search
endpoint: /api/linkedin/search/users
params:
keywords: "Head of Engineering"
count: 100
refresh: always
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: search, field: urn.value }
input_key: user
parallel: 5
rate_limit: "10/s"
on_error: skip
refresh: auto
- id: enriched
type: llm
dependency: { from_source: profiles, field: name }
llm:
- type: classify
categories: "high_priority,medium,low"
output_column: lead_score
storage:
format: parquet
path: ./data/
schedule:
cron: "0 9 * * 1" # Every Monday at 9 AM
notifications:
on_complete:
- url: "https://hooks.slack.com/services/xxx"
on_failure:
- url: "https://alerts.example.com/fail"
Next Steps