Robots.txt is a plain text file placed in the root directory of a website (e.g., https://www.example.com/robots.txt) that communicates crawling instructions to search engine bots. It operates on the Robots Exclusion Protocol — a voluntary standard that well-behaved crawlers, including Googlebot and Bingbot, follow by default.

How Robots.txt Works

When a search engine bot visits your site, it first checks for a robots.txt file before crawling any pages. The file contains one or more directives specifying which user-agents (bots) are allowed or disallowed from accessing specific URL paths. A wildcard asterisk (*) applies rules to all bots.

Example robots.txt structure:

User-agent: *

Disallow: /admin/

Disallow: /staging/

Allow: /

Sitemap: https://www.example.com/sitemap.xml

What Robots.txt Controls — and What It Doesn't

A common misconception is that robots.txt prevents pages from appearing in search results. It does not. Robots.txt controls crawling, not indexing. If other sites link to a disallowed page, Google can still index that URL (it just won't crawl its content). To prevent a page from appearing in search results, use the noindex meta tag instead.

CAN control: which URLs Googlebot crawls
CANNOT control: whether a page appears in search results (use noindex for that)
CANNOT control: malicious bots — only well-behaved bots respect the file
CANNOT control: cached versions of pages already indexed

Common Robots.txt Use Cases

In a technical SEO context, robots.txt is used strategically to conserve crawl budget and prevent low-value pages from wasting Googlebot's time:

Block /admin/, /login/, /cart/, /checkout/ paths from crawling
Block staging or development subdomains (often handled at DNS level instead)
Block parameterized URLs that create duplicate content (e.g., /search?q=)
Block internal search result pages and filter combinations
Reference sitemap location for faster discovery

Critical Robots.txt Mistakes to Avoid

Misconfigured robots.txt files are one of the most common causes of SEO disasters. Blocking the wrong paths can prevent Google from crawling your entire site or specific key sections:

Never disallow Googlebot from crawling CSS/JS files — this prevents rendering and can tank rankings
Avoid using robots.txt as your only defense against duplicate content (use canonical tags too)
Don't block pages that link to important content — crawl paths matter
Regularly audit robots.txt after site migrations or restructures
Use Google Search Console's robots.txt Tester to validate your configuration

Robots.txt vs. Noindex

Technical SEOs often debate when to use robots.txt Disallow versus the noindex meta tag. The general rule: use robots.txt to block pages you don't want crawled (saving crawl budget), and use noindex for pages you want crawled but not indexed. For truly sensitive content, use both — but understand that if Google can't crawl a page, it can't read the noindex tag either.

Sagara includes robots.txt auditing in every technical SEO engagement. A properly configured robots.txt file is one of the fastest wins in on-site SEO — it directs Googlebot's limited crawl budget toward your revenue-driving pages and keeps low-value URLs out of the index entirely.

What Is Robots.txt?

How Robots.txt Works

What Robots.txt Controls — and What It Doesn't

Common Robots.txt Use Cases

Critical Robots.txt Mistakes to Avoid

Robots.txt vs. Noindex

Need Help with Robots.txt?

Also Learn

LOOKING FOR AN AGENCY
THAT GETS IT?

What Is Robots.txt?

How Robots.txt Works

What Robots.txt Controls — and What It Doesn't

Common Robots.txt Use Cases

Critical Robots.txt Mistakes to Avoid

Robots.txt vs. Noindex

Need Help with Robots.txt?

Also Learn

LOOKING FOR AN AGENCYTHAT GETS IT?

LOOKING FOR AN AGENCY
THAT GETS IT?