/webparser/sitemap

curl --request POST \
  --url https://api.anysite.io/api/webparser/sitemap \
  --header 'Content-Type: application/json' \
  --header 'access-token: <api-key>' \
  --data '
{
  "url": "<string>",
  "timeout": 300,
  "include_patterns": [
    "^/blog/",
    "^/docs/"
  ],
  "exclude_patterns": [
    "\\.pdf$",
    "/admin/"
  ],
  "same_host_only": true,
  "respect_robots": true,
  "count": 2,
  "return_details": false
}
'

[
  {
    "urls": [
      "<string>"
    ],
    "total_found": 123,
    "sitemap_locations": [
      "<string>"
    ],
    "@type": "@sitemap_result",
    "entries": [
      {
        "loc": "<string>",
        "@type": "@sitemap_entry",
        "lastmod": "<string>",
        "changefreq": "<string>",
        "priority": 0.5
      }
    ]
  }
]

POST

api

webparser

sitemap

/webparser/sitemap

curl --request POST \
  --url https://api.anysite.io/api/webparser/sitemap \
  --header 'Content-Type: application/json' \
  --header 'access-token: <api-key>' \
  --data '
{
  "url": "<string>",
  "timeout": 300,
  "include_patterns": [
    "^/blog/",
    "^/docs/"
  ],
  "exclude_patterns": [
    "\\.pdf$",
    "/admin/"
  ],
  "same_host_only": true,
  "respect_robots": true,
  "count": 2,
  "return_details": false
}
'

[
  {
    "urls": [
      "<string>"
    ],
    "total_found": 123,
    "sitemap_locations": [
      "<string>"
    ],
    "@type": "@sitemap_result",
    "entries": [
      {
        "loc": "<string>",
        "@type": "@sitemap_entry",
        "lastmod": "<string>",
        "changefreq": "<string>",
        "priority": 0.5
      }
    ]
  }
]

Authorizations

access-token

string

header

required

Headers

access-token

string

required

Body

application/json

url

string<uri>

required

Website URL to fetch sitemap from

Required string length: 1 - 2083

Examples:

"https://www.example.com"

"https://blog.example.com"

timeout

integer

default:300

Max scrapping execution timeout (in seconds)

Required range: 20 <= x <= 1500

include_patterns

string[] | null

Regex patterns for URL paths to include (all URLs if not specified)

Example:

["^/blog/", "^/docs/"]

exclude_patterns

string[] | null

Regex patterns for URL paths to exclude

Example:

["\\.pdf$", "/admin/"]

same_host_only

boolean

default:true

Only include URLs from the same host as the base URL

respect_robots

boolean

default:true

Check robots.txt and respect disallowed URLs

count

integer | null

Maximum number of URLs to return

Required range: x >= 1

return_details

boolean

default:false

Return detailed sitemap entries (lastmod, changefreq, priority) instead of just URLs

Response

Successful Response

urls

string[]

required

List of URLs found in sitemap(s)

total_found

integer

required

Total number of URLs found

sitemap_locations

string[]

required

Sitemap URLs that were processed

@type

string

default:@sitemap_result

entries

SitemapEntry · object[] | null

Detailed sitemap entries (if return_details=True)

Show child attributes

/webparser/parse /yc/company

Token Management

Custom AI Parsers

Crunchbase

DuckDuckGo Search

Instagram Posts

Instagram Search

Instagram Users

LinkedIn Companies

LinkedIn Email Finder

LinkedIn Web Search

LinkedIn Groups

LinkedIn Management

LinkedIn Posts

LinkedIn Search

LinkedIn Social Search

LinkedIn Users

LinkedIn DB

Reddit Posts

Reddit Search

Reddit Users

SEC

Siemens

Twitter Search

Twitter Users

Webparser

Y Combinator

YouTube

/webparser/sitemap

Authorizations

Headers

Body

Response