Skip to main content

Lead Enrichment Pipeline

Collect LinkedIn profiles from a search, enrich them with LLM analysis, and load results into PostgreSQL.
1

Create the pipeline

dataset.yaml
name: lead-enrichment
description: Find and qualify engineering leads

sources:
  - id: search
    endpoint: /api/linkedin/search/users
    params:
      keywords: "Head of Engineering"
      count: 200
    parallel: 1

  - id: profiles
    endpoint: /api/linkedin/user
    dependency:
      from_source: search
      field: urn.value
    input_key: user
    parallel: 5
    rate_limit: "10/s"
    on_error: skip

  - id: qualified
    type: llm
    dependency:
      from_source: profiles
      field: name
    llm:
      - type: classify
        categories: "high_priority,medium,low"
        output_column: lead_score
      - type: enrich
        add:
          - "seniority:junior/mid/senior/lead/executive"
          - "is_technical:boolean"
          - "team_size:small/medium/large"
      - type: summarize
        max_length: 30
        output_column: quick_bio

storage:
  format: parquet
  path: ./data/
2

Run the collection

anysite dataset collect dataset.yaml
3

Load into PostgreSQL

anysite dataset load-db dataset.yaml -c pg
4

Query the results

anysite db query pg --sql "
  SELECT name, headline, lead_score, seniority, quick_bio
  FROM qualified
  WHERE lead_score = 'high_priority' AND is_technical = true
  ORDER BY name
" --format table

Competitor Monitoring

Track competitor companies, their employees, and recent posts on a weekly schedule.
1

Create a competitors file

competitors.txt
anthropic
openai
google-deepmind
mistral-ai
2

Define the pipeline

competitor-monitor.yaml
name: competitor-monitor
description: Weekly competitor intelligence

sources:
  - id: companies
    endpoint: /api/linkedin/company
    from_file: competitors.txt
    input_key: company
    parallel: 2
    refresh: always

  - id: recent_posts
    endpoint: /api/linkedin/company/posts
    dependency:
      from_source: companies
      field: urn.value
    input_key: company
    parallel: 3
    rate_limit: "10/s"
    refresh: always

  - id: key_employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies
      field: urn.value
    input_key: companies
    parallel: 3
    rate_limit: "10/s"
    refresh: auto

  - id: post_analysis
    type: llm
    dependency:
      from_source: recent_posts
      field: text
    llm:
      - type: classify
        categories: "product_launch,hiring,partnership,thought_leadership,other"
        output_column: post_type
      - type: summarize
        max_length: 30
        output_column: summary

storage:
  format: parquet
  path: ./data/

schedule:
  cron: "0 9 * * 1"

notifications:
  on_complete:
    - url: "https://hooks.slack.com/services/xxx"
3

Start the scheduled collection

anysite dataset schedule competitor-monitor.yaml --incremental --load-db pg
4

Analyze the data

# What are competitors posting about?
anysite dataset query competitor-monitor.yaml --sql "
  SELECT c.name as company, pa.post_type, COUNT(*) as count
  FROM post_analysis pa
  JOIN companies c ON pa.company_id = c.urn_value
  GROUP BY c.name, pa.post_type
  ORDER BY c.name, count DESC
" --format table

# New hires this week
anysite dataset query competitor-monitor.yaml --sql "
  SELECT name, headline, company_name
  FROM key_employees
  ORDER BY collected_at DESC
  LIMIT 20
" --format table

Multi-Platform Research

Collect data from LinkedIn, Twitter, and GitHub for a set of people, merge the results, and export a unified dataset.
1

Define the pipeline

research.yaml
name: multi-platform-research
description: Cross-platform person research

sources:
  - id: linkedin_profiles
    endpoint: /api/linkedin/user
    from_file: people.txt
    input_key: user
    parallel: 3
    rate_limit: "10/s"
    on_error: skip

  - id: twitter_profiles
    endpoint: /api/twitter/user
    from_file: twitter_handles.txt
    input_key: user
    parallel: 3
    rate_limit: "10/s"
    on_error: skip

  - id: github_profiles
    endpoint: /api/github/user
    from_file: github_users.txt
    input_key: user
    parallel: 3
    on_error: skip

  - id: all_profiles
    type: union
    sources: [linkedin_profiles, twitter_profiles, github_profiles]

storage:
  format: parquet
  path: ./data/
2

Collect and query

# Collect all sources
anysite dataset collect research.yaml

# Query the unified dataset
anysite dataset query research.yaml --sql "
  SELECT * FROM all_profiles
" --format csv --output unified_research.csv

# Get statistics
anysite dataset stats research.yaml

Quick One-Liners

Common tasks that don’t need a full pipeline:
# Enrich a single profile and save to database
anysite api /api/linkedin/user user=satyanadella -q --format jsonl | \
  anysite db insert mydb --table profiles --stdin --auto-create

# Batch process a CSV of companies
anysite api /api/linkedin/company --from-file companies.csv --input-key company \
  --parallel 5 --rate-limit "10/s" --on-error skip \
  --format csv --output company_profiles.csv

# Search and export in one command
anysite api /api/linkedin/search/users keywords="AI researcher" count=100 \
  --format csv --output ai_researchers.csv

# Quick database query
anysite db query pg --sql "SELECT name, headline FROM profiles WHERE headline LIKE '%CEO%'" \
  --format table