Robots.txt is a plain text file placed in the root directory of a website (e.g., https://www.example.com/robots.txt) that communicates crawling instructions to search engine bots. It operates on the Robots Exclusion Protocol — a voluntary standard that well-behaved crawlers, including Googlebot and Bingbot, follow by default.
How Robots.txt Works
When a search engine bot visits your site, it first checks for a robots.txt file before crawling any pages. The file contains one or more directives specifying which user-agents (bots) are allowed or disallowed from accessing specific URL paths. A wildcard asterisk (*) applies rules to all bots.
Example robots.txt structure:
User-agent: *
Disallow: /admin/
Disallow: /staging/
Allow: /
Sitemap: https://www.example.com/sitemap.xml
What Robots.txt Controls — and What It Doesn't
A common misconception is that robots.txt prevents pages from appearing in search results. It does not. Robots.txt controls crawling, not indexing. If other sites link to a disallowed page, Google can still index that URL (it just won't crawl its content). To prevent a page from appearing in search results, use the noindex meta tag instead.
- CAN control: which URLs Googlebot crawls
- CANNOT control: whether a page appears in search results (use noindex for that)
- CANNOT control: malicious bots — only well-behaved bots respect the file
- CANNOT control: cached versions of pages already indexed
Common Robots.txt Use Cases
In a technical SEO context, robots.txt is used strategically to conserve crawl budget and prevent low-value pages from wasting Googlebot's time:
- Block /admin/, /login/, /cart/, /checkout/ paths from crawling
- Block staging or development subdomains (often handled at DNS level instead)
- Block parameterized URLs that create duplicate content (e.g., /search?q=)
- Block internal search result pages and filter combinations
- Reference sitemap location for faster discovery
Critical Robots.txt Mistakes to Avoid
Misconfigured robots.txt files are one of the most common causes of SEO disasters. Blocking the wrong paths can prevent Google from crawling your entire site or specific key sections:
- Never disallow Googlebot from crawling CSS/JS files — this prevents rendering and can tank rankings
- Avoid using robots.txt as your only defense against duplicate content (use canonical tags too)
- Don't block pages that link to important content — crawl paths matter
- Regularly audit robots.txt after site migrations or restructures
- Use Google Search Console's robots.txt Tester to validate your configuration
Robots.txt vs. Noindex
Technical SEOs often debate when to use robots.txt Disallow versus the noindex meta tag. The general rule: use robots.txt to block pages you don't want crawled (saving crawl budget), and use noindex for pages you want crawled but not indexed. For truly sensitive content, use both — but understand that if Google can't crawl a page, it can't read the noindex tag either.
Sagara includes robots.txt auditing in every technical SEO engagement. A properly configured robots.txt file is one of the fastest wins in on-site SEO — it directs Googlebot's limited crawl budget toward your revenue-driving pages and keeps low-value URLs out of the index entirely.