How we choose an answer
How the Crawl Hierarchy Finds the Best Answer First
Understand how we prioritize datasets, site pages, and PDFs when generating answers.
When someone asks a question, our system doesn’t search everything at once. It follows a clear hierarchy to return the most accurate and relevant answer as quickly as possible.
This article explains exactly how that hierarchy works — so you know where answers come from and how to structure your content for the best results.
In This Article
- How the Crawl Hierarchy Works
- Step 1: Checking Your Datasets First
- Step 2: Searching Your Crawled Web Pages
- Step 3: Falling Back to Crawled PDFs
- Why This Order Matters
How the Crawl Hierarchy Works
When a question is asked, the system searches your content in a specific order:
- Your datasets
- Your crawled website pages (on-page content)
- Your crawled PDFs
It only moves to the next level if it cannot confidently find the answer in the previous one.
Think of it as a priority ladder. The most structured and intentional content gets checked first. Broader or less structured content gets checked later.
This approach improves answer accuracy, reduces noise, and ensures the most trusted content is used whenever possible.
Step 1: Checking Your Datasets First
Datasets are the highest-priority source.
When a question comes in, we first look at any structured datasets you’ve created or uploaded. If the answer exists there, we return that answer immediately.
Why datasets come first:
- They are structured and intentional.
- They typically contain curated, high-confidence information.
- They reduce ambiguity compared to raw website content.
If the answer exists in a dataset, the system does not continue searching other sources.
Important: If you want a specific answer to always take priority, add it to a dataset.
Step 2: Searching Your Crawled Web Pages
If the answer is not found in a dataset, we move to your crawled websites.
At this stage, the system analyzes:
- On-page content from your crawled URLs
- Headings and structured page elements
- Relevant content sections within those pages
This includes any content publicly available on the sites you’ve added to your crawl settings.
Website content is powerful because:
- It reflects your live, up-to-date information.
- It captures detailed explanations and context.
- It supports broader and more dynamic queries.
If the system finds a reliable answer within your crawled pages, it returns that response.
Step 3: Falling Back to Crawled PDFs
If the answer cannot be found in datasets or website pages, the system performs a final check: your crawled PDFs.
PDFs are treated as a last resort because:
- They are often less structured.
- Content can be harder to parse cleanly.
- They may contain legacy or static information.
That said, PDFs can still be valuable for:
- Documentation
- Technical manuals
- Policy documents
- Archived resources
If the answer exists only within a PDF, the system will return it from there.
Why This Order Matters
The crawl hierarchy ensures:
- Higher accuracy from curated content
- More consistent answers
- Less reliance on unstructured documents
- Faster retrieval of high-confidence responses
It also gives you control.
By understanding the hierarchy, you can influence which content is most likely to be used when answering questions.
For example:
- Want a definitive answer used every time? Add it to a dataset.
- Want broader context available? Ensure your site content is well structured.
- Relying on PDFs? Consider migrating key information to web pages or datasets.