What is an llms.txt file – and why it will soon be as important as robots.txt?
llms.txt is currently a proposal and not an official standard. For SMEs, the file is primarily useful as a supplementary guide when combined with robots.txt, Google Extended, X-Robots-Tag, and technical monitoring.

The llms.txt-Proposal is for SMEs Today it is primarily one thing: a useful orientation tool, but no official standardIf you want to understand what you can realistically control with an llms.txt file, the short answer is: not all AI usage of your content, but above all, the clear communication of your wishes within a multi-layered strategy robots.txt, provider-specific signals, headers, technology and Monitoring.

In projects with small businesses from South Tyrol and the entire DACH region, I'm currently seeing two typical mistakes. Either llms.txt is being sold as a miracle cure, or the topic is completely ignored, even though law firms, hotels, tradespeople, and consulting companies have long been publishing content that requires it. AI crawlers can be evaluated. For SMEs, this is not about hype, but about making clear decisions: What should be open, what should remain blocked, and where do you need real technical security?

TL; DR: An llms.txt file is currently a voluntary community approach with a clear... Proposal statusIt can improve visibility, citation quality, and governance, but it does not replace a robots.txt for AI for an X-Robots Day, no authentication, no Copyright Law and do not miss any GDPR-Prüfung.

If you want to act pragmatically today, combine a lean llms.txt with a clean robots.txt, targeted headers, clear policy pages and robust monitoring.

Correctly classifying the llms.txt proposal

The llms.txt file is currently... not a ratified IETF or W3C standardThe official project page describes the file as a community suggestion and not as a formal web standard, see llmstxt.orgThat is precisely why a sober assessment is important: An llms.txt is not a universal control file for all language models, but rather an additional, human- and machine-readable orientation.

In practice, an llms.txt file can still be useful. This is especially true if you want to curate your most important content, describe your brand more clearly, and increase the likelihood that systems will categorize your website more accurately. Crucially, this involves managing expectations: An llms.txt file can help formulate wishes and priorities, but it doesn't guarantee that these expectations will be met.

An llms.txt file is more common today. Orientation as control.

I would therefore never consider an llms.txt file in isolation. At Berger+Team, we always use such files, when they make sense, within the context of the entire web architecture: information structure, curated content, technical access rules, and machine-readable signals must all fit together. Anyone wanting to delve deeper into such setups can find more information in our article on... ADF and curated AI architecture the next logical step.

llms.txt, robots.txt, Google Extended and X-Robots-Tag: What does what?

The greatest confusion arises when different control levels are mixed. For SMEs, clear demarcation is more important than any discussion about new file names.

1. llms.txt

The llms.txt file primarily describes, Who A system should understand your website and which content you prefer to display. It can also describe user preferences. However, these preferences are not automatically enforceable. Directives such as Train, Generate, cache or allotment Therefore, you should understand them as examples, not as a universally implemented standard.

2. robots.txt

The robots.txt file is the classic file for crawl rules according to User AgentThe important clarification is that the Robots Exclusion Protocol has been documented as RFC 9309 on the IETF Standards Track since 2022. At the same time, the RFC clarifies that robots rules no access control are, but rules that crawlers should respect, see RFC 9309This is relevant for legitimate bots. It's not a hard ban for bad actors.

3. Google Extended

Google-Extended Google Extended is not a replacement for Googlebot, nor is it a general search toggle. Google documents Google Extended as a separate product token in the robots.txt file. According to Google, this allows you to control how your crawled content is used for Gemini models and certain grounding scenarios. separate from the Googlebot for search.According to Google, Google Extended does not affect inclusion in Google Search or rankings, see [link/reference]. Google for Developers.

4. X-Robots tag and meta tags

The X-Robots Day For files like PDFs, images, or other non-HTML-based resources, `llms.txt` is often the more practical level. If you want to control downloads, datasheets, or protected media, this is usually more relevant than an `llms.txt`. At the same time, you need to keep it clean: signals like noai or noimageai These are not uniformly supported by all providers. In Google's official specification for robots meta tags and X-Robots-Tag, these values ​​do not appear as officially supported directives; see [link/reference]. Google Search Center.

  • llms.txt: Orientation, context and curated entry file
  • robots.txt: Crawl rules per user-agent
  • Google Extended: Google-specific product token for AI use, separate from search.
  • X-Robots-Tag: fine-grained control of individual resources such as PDFs or images
  • Authentication, WAF and rate limits: technical protective layer with real penetration

What actually works today

If an SME asks me what works today not just in theory, but in practice, my answer is clear: The effective strategy is always multi-layered. This is precisely what many companies overlook when they focus solely on a new file.

  • Control known AI crawlers in the robots.txt file: This means differentiating by user-agent and excluding sensitive areas.
  • Treat Google Extended separately: so you can discuss Google AI usage without accidentally breaking your classic search.
  • Secure the file level: Mark PDFs, images and download areas with the X-Robots tag and, where appropriate, additionally with noai or noimageai as a supplementary signal.
  • Protected areas truly: Customer portal, application area, extranet and internal search belong behind a login, not just behind a text file.
  • Document rights and wishes: through an understandable AI policy, terms of use and a clear contact for license inquiries.
  • Set up monitoring: Log files, reverse DNS checks, unusual request patterns, rate limits, and WAF rules.

Internally and in client projects, we use not just an llms.txt file, but a complete governance set, depending on the requirements. This can include a clean robots.txt file, curated discovery files, structured brand signals, and, in some cases, additional files such as... /ai.json or /robots-ai.txt belong. The goal is never technical decoration, but a controllable system with less friction and clear rules.

Especially for visibility, it's also worth taking a look at structured data on the websiteBecause an llms.txt file is of little use if your content is open but remains difficult for machines to categorize.

A compact minimal example for SMEs

Many small businesses don't need a complex file at the beginning, but rather a significantly reduced versionThe purpose of this file is not to cover every scenario legally, but to provide the most important public context.

Minimal structure of an llms.txt file:

1. A brief description of the company: Who you are, what you offer, and who you work for.

2. A small selection of key public URLs: Homepage, Services, Contact, FAQ and Knowledge Base.

3. A clear indication that customer areas, logins, internal areas and sensitive downloads are not intended for free use.

4. A link to an AI policy or usage page plus contact information for questions and license requests.

That's often all you need to get started. You can add the frequently mentioned directives like Train, Generate, Cache, or Attribution, but please do so with the right expectations in mind. These are not universally guaranteed switches today.For SMEs, a concise, clear file is usually better than a long pseudo-rulebook that no provider ultimately interprets uniformly.

Three realistic setups for small business websites

Corporate site of a craft business or hotel

Public service pages, testimonials, contact pages, and FAQs are often unproblematic. An llms.txt file can clearly highlight these core pages. However, in the robots.txt file, I would block the backend, booking account, checkout, staging area, and internal search. Hotels often also have an external booking process that needs to be considered separately from a technical and contractual perspective.

  • Open: Services, team, location, opening hours and frequently asked questions
  • Blocked: Login, booking account, admin, test environment and internal price lists
  • in addition: Monitoring for aggressive bot access to images and PDFs

Consulting firm or law firm with blog and downloads

This is a typical case for a curated llms.txt file. Public technical articles should be visible if they are current and concise. I would treat white papers, client information, internal templates, or download packages much more restrictively. In my experience, this mixed area is the most critical for SMEs because useful expert content and sensitive information are often too closely intertwined.

  • Open: Blog, press, basic FAQs, and law firm or consultancy profile
  • Targeted restrictions: PDFs, checklists, premium downloads and internal knowledge documents
  • Important: Clearly state copyright and licensing information; use noai only as a supplementary signal.

Protected customer area or extranet

The matter is simple here: No llms.txt file in the world can replace login, role permissions, or access control.When personal data, contracts, project documents, or support cases are involved, the first layer of protection must be technical. Anything else would be negligent.

  • Duty: Authentication, role permissions, and no public access to sensitive URLs
  • Sensible: robots.txt and header as additional signals
  • Indispensable: Monitoring, incident process and regular audits

Law, copyright, GDPR and the real limits

The legal aspects are often oversimplified. Therefore, let's be perfectly clear: An llms.txt file does not replace a legal basis or technical safeguards. It can document usage preferences and support compliance processes. However, it does not automatically operationalize your GDPR obligations.

At the Copyright Law And yet, in the European text and data mining context, the communication of rights remains important. The W3C community documentation on the TDM Reservation Protocol refers to Article 4 of Directive 2019/790: TDM may be permitted for lawfully accessible content, provided that rights have not been reserved in an appropriate manner, such as machine-readable form. W3C Community Group – TDM Reservation ProtocolThis is precisely where opt-out communication becomes relevant. But the following also applies here: Legal reservation is not the same as a technical block..

For the GDPR In practical terms, this means that public content containing personal information, application forms, customer data, support tickets, or protected documents should never be "protected" solely by a policy file. You need data minimization, access control, clear responsibilities, and, if necessary, contractual agreements with third-party providers. The llms.txt file can help document your governance. Nothing more.

Bad actors ignore rules. Historical snapshots remain in circulation. Content can end up in data repositories via third parties. That's precisely why, in addition to signals, you always need technology, processes, and evidence.

When llms.txt is useful, when robots.txt is sufficient, and when more technology is needed.

For many SMEs, a maximum solution is not necessary, but rather good prioritization.

An llms.txt file is useful if…

  • you want to curate your most important public content,
  • you want to present your brand in a more machine-readable and consistent way,
  • you want to give AI systems clear entry pages, contact points and context,
  • you want to build an additional governance level.

A well-designed robots.txt file is often sufficient to start with, if…

  • If you are primarily concerned with known bots and crawl accesses,
  • Your website is small and has only a few sensitive areas,
  • You first need clean user-agent rules for AI crawlers and search engines.

Additional technology is mandatory if…

  • personal data is involved,
  • If you operate a customer area, shop, extranet or API endpoints,
  • you offer exclusive content, downloads or protected media,
  • You must prove violations and limit attacks.

If you want to set up such topics properly from the start, you should... Web design & development and Strategic advice closely together. Otherwise, the topic quickly becomes fragmented: a file here, a plugin there, but no robust system.

FAQ: The most important questions answered briefly

Does every SME immediately need an llms.txt file?

No. If your website is small and you don't yet have a clean robots.txt file, clearly defined sections, and monitoring, I would start there. An llms.txt file is only useful if you want to curate public content and make it easier for machines to understand and categorize it.

Does Google Extended jeopardize my SEO?

According to Google's documentation, no. Google Extended is separate from the Googlebot for search and, according to Google, does not affect search inclusion or rankings. Nevertheless, you should thoroughly test the rules before going live.

Is noai sufficient in the X-Robots tag?

No. Noai can be a useful additional signal, but it's not a universally supported standard. For sensitive content, you always need additional technology like login, file protection, rate limits, or WAF rules.

How do you check in projects whether AI crawlers adhere to rules?

We never just look at the user agent. Log files, request patterns, reverse DNS, IP address analysis, affected paths, and recurring anomalies are crucial. This kind of monitoring is precisely what separates sound governance from mere hope.

Conclusion

The most important classification is simple: llms.txt is currently a proposal, not an official standard.. For SMEs, the file can still be valuable if you use it as part of a system: for better classification, clearer communication and more citable content.

The real work, however, happens elsewhere: in a good robots.txt, in a differentiated approach to Google-Extended, in a clean X-Robots Day, in access control, rights, contracts and genuine monitoring. Those who understand this make better decisions and are less likely to be misled by new file names.

If you want to build your website so that it remains clear for people, understandable for machines, and manageable for your company, then it's also worth taking a look at our article on Entity SEO for SMEsLong-term visibility arises where branding, information architecture, and AI governance fit together, not in a single text file.

Sources

  1. The /llms.txt file — llmstxt.org (2024)
  2. RFC 9309: Robots Exclusion Protocol — rfc-editor.org (2022)
  3. List of Google's common crawlers — developers.google.com (2026)
  4. Robots meta tag, data-nosnippet, and X-Robots-Tag specifications — developers.google.com (2025)
  5. TDM Reservation Protocol — w3c.github.io (2025)
Florian Berger
Bloggerei.de