Lead Enrichment Pipeline
Collect LinkedIn profiles from a search, enrich them with LLM analysis, and load results into PostgreSQL.Create the pipeline
dataset.yaml
Copy
name: lead-enrichment
description: Find and qualify engineering leads
sources:
- id: search
endpoint: /api/linkedin/search/users
params:
keywords: "Head of Engineering"
count: 200
parallel: 1
- id: profiles
endpoint: /api/linkedin/user
dependency:
from_source: search
field: urn.value
input_key: user
parallel: 5
rate_limit: "10/s"
on_error: skip
- id: qualified
type: llm
dependency:
from_source: profiles
field: name
llm:
- type: classify
categories: "high_priority,medium,low"
output_column: lead_score
- type: enrich
add:
- "seniority:junior/mid/senior/lead/executive"
- "is_technical:boolean"
- "team_size:small/medium/large"
- type: summarize
max_length: 30
output_column: quick_bio
storage:
format: parquet
path: ./data/
Competitor Monitoring
Track competitor companies, their employees, and recent posts on a weekly schedule.Define the pipeline
competitor-monitor.yaml
Copy
name: competitor-monitor
description: Weekly competitor intelligence
sources:
- id: companies
endpoint: /api/linkedin/company
from_file: competitors.txt
input_key: company
parallel: 2
refresh: always
- id: recent_posts
endpoint: /api/linkedin/company/posts
dependency:
from_source: companies
field: urn.value
input_key: company
parallel: 3
rate_limit: "10/s"
refresh: always
- id: key_employees
endpoint: /api/linkedin/company/employees
dependency:
from_source: companies
field: urn.value
input_key: companies
parallel: 3
rate_limit: "10/s"
refresh: auto
- id: post_analysis
type: llm
dependency:
from_source: recent_posts
field: text
llm:
- type: classify
categories: "product_launch,hiring,partnership,thought_leadership,other"
output_column: post_type
- type: summarize
max_length: 30
output_column: summary
storage:
format: parquet
path: ./data/
schedule:
cron: "0 9 * * 1"
notifications:
on_complete:
- url: "https://hooks.slack.com/services/xxx"
Start the scheduled collection
Copy
anysite dataset schedule competitor-monitor.yaml --incremental --load-db pg
Analyze the data
Copy
# What are competitors posting about?
anysite dataset query competitor-monitor.yaml --sql "
SELECT c.name as company, pa.post_type, COUNT(*) as count
FROM post_analysis pa
JOIN companies c ON pa.company_id = c.urn_value
GROUP BY c.name, pa.post_type
ORDER BY c.name, count DESC
" --format table
# New hires this week
anysite dataset query competitor-monitor.yaml --sql "
SELECT name, headline, company_name
FROM key_employees
ORDER BY collected_at DESC
LIMIT 20
" --format table
Multi-Platform Research
Collect data from LinkedIn, Twitter, and GitHub for a set of people, merge the results, and export a unified dataset.Define the pipeline
research.yaml
Copy
name: multi-platform-research
description: Cross-platform person research
sources:
- id: linkedin_profiles
endpoint: /api/linkedin/user
from_file: people.txt
input_key: user
parallel: 3
rate_limit: "10/s"
on_error: skip
- id: twitter_profiles
endpoint: /api/twitter/user
from_file: twitter_handles.txt
input_key: user
parallel: 3
rate_limit: "10/s"
on_error: skip
- id: github_profiles
endpoint: /api/github/user
from_file: github_users.txt
input_key: user
parallel: 3
on_error: skip
- id: all_profiles
type: union
sources: [linkedin_profiles, twitter_profiles, github_profiles]
storage:
format: parquet
path: ./data/
Quick One-Liners
Common tasks that don’t need a full pipeline:Copy
# Enrich a single profile and save to database
anysite api /api/linkedin/user user=satyanadella -q --format jsonl | \
anysite db insert mydb --table profiles --stdin --auto-create
# Batch process a CSV of companies
anysite api /api/linkedin/company --from-file companies.csv --input-key company \
--parallel 5 --rate-limit "10/s" --on-error skip \
--format csv --output company_profiles.csv
# Search and export in one command
anysite api /api/linkedin/search/users keywords="AI researcher" count=100 \
--format csv --output ai_researchers.csv
# Quick database query
anysite db query pg --sql "SELECT name, headline FROM profiles WHERE headline LIKE '%CEO%'" \
--format table