How we choose an answer

How the Crawl Hierarchy Finds the Best Answer First

Understand how we prioritize datasets, site pages, and PDFs when generating answers.

When someone asks a question, our system doesn’t search everything at once. It follows a clear hierarchy to return the most accurate and relevant answer as quickly as possible.

This article explains exactly how that hierarchy works — so you know where answers come from and how to structure your content for the best results.


In This Article


How the Crawl Hierarchy Works

When a question is asked, the system searches your content in a specific order:

  1. Your datasets
  2. Your crawled website pages (on-page content)
  3. Your crawled PDFs

It only moves to the next level if it cannot confidently find the answer in the previous one.

Think of it as a priority ladder. The most structured and intentional content gets checked first. Broader or less structured content gets checked later.

This approach improves answer accuracy, reduces noise, and ensures the most trusted content is used whenever possible.


Step 1: Checking Your Datasets First

Datasets are the highest-priority source.

When a question comes in, we first look at any structured datasets you’ve created or uploaded. If the answer exists there, we return that answer immediately.

Why datasets come first:

  • They are structured and intentional.
  • They typically contain curated, high-confidence information.
  • They reduce ambiguity compared to raw website content.

If the answer exists in a dataset, the system does not continue searching other sources.

Important: If you want a specific answer to always take priority, add it to a dataset.


Step 2: Searching Your Crawled Web Pages

If the answer is not found in a dataset, we move to your crawled websites.

At this stage, the system analyzes:

  • On-page content from your crawled URLs
  • Headings and structured page elements
  • Relevant content sections within those pages

This includes any content publicly available on the sites you’ve added to your crawl settings.

Website content is powerful because:

  • It reflects your live, up-to-date information.
  • It captures detailed explanations and context.
  • It supports broader and more dynamic queries.

If the system finds a reliable answer within your crawled pages, it returns that response.


Step 3: Falling Back to Crawled PDFs

If the answer cannot be found in datasets or website pages, the system performs a final check: your crawled PDFs.

PDFs are treated as a last resort because:

  • They are often less structured.
  • Content can be harder to parse cleanly.
  • They may contain legacy or static information.

That said, PDFs can still be valuable for:

  • Documentation
  • Technical manuals
  • Policy documents
  • Archived resources

If the answer exists only within a PDF, the system will return it from there.


Why This Order Matters

The crawl hierarchy ensures:

  • Higher accuracy from curated content
  • More consistent answers
  • Less reliance on unstructured documents
  • Faster retrieval of high-confidence responses

It also gives you control.

By understanding the hierarchy, you can influence which content is most likely to be used when answering questions.

For example:

  • Want a definitive answer used every time? Add it to a dataset.
  • Want broader context available? Ensure your site content is well structured.
  • Relying on PDFs? Consider migrating key information to web pages or datasets.
Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.

Still need help? Contact Us Contact Us