llms.txt: What you need to know

Visibility Engineering

Nov 3

Written By Kurt Fischman

HOME / FIELD NOTES

✍️ Re-published November 3, 2025 · 📝 Updated November 3, 2025 · 🕔 8 min read

🐲 Kurt Fischman, Founder @ Growth Marshal

What is llms.txt and why should decision-makers care?

Leaders want leverage. llms.txt gives it to you. The file is a simple, public, machine-readable guide that tells language models what your site is about, what to use, and where to fetch clean context in markdown. Think of it as a fact sheet for machines that are allergic to your bloated HTML. The proposal, introduced by Jeremy Howard and collaborators, standardizes a single endpoint at /llms.txt that links to LLM-ready resources and instructions.¹ ²

Executives should care because AI systems are already summarizing brands without asking. If you do not supply the facts in a format they can ingest, they will improvise. Improvisation is charming at a dinner party and expensive in an earnings call. llms.txt gives you a direct channel to models that increasingly shape customer perception. It does not block crawlers. It feeds them concise, high-signal guidance so they cite you correctly and often.¹ ²

How does llms.txt differ from robots.txt and other “control” files?

Publishers know robots.txt. That file governs crawling and indexing behavior for traditional bots. It says what not to fetch. llms.txt aims at a different goal. It says what to fetch, how to understand it, and where to find structured, markdown twins of key pages. It is a positive signal rather than a negative control.¹ ²

This distinction matters. The major AI companies already accept some form of robots.txt directives for their training or search bots, such as GPTBot for OpenAI and Google-Extended for Google’s training pipeline. Those are opt-out levers.³ ⁴ With llms.txt, you create an opt-in content map for LLMs that are actively trying to answer user questions. It complements robots.txt rather than replacing it. Use robots.txt to set boundaries. Use llms.txt to stage the set.¹ ² ⁴

What exactly goes into an llms.txt file?

Site owners place a small markdown document at /llms.txt. The document starts with brief brand context in plain language, then points to canonical, LLM-ready markdown pages that carry the heavy load. The format favors short sections, labeled links, and predictable headings that parsers can consume with deterministic logic. If you want models to cite your pricing policy, you link to pricing.md. If you want them to respect licensing or attribution requirements, you say it and link to the policy. The proposal intentionally keeps the grammar simple so even thin clients and regex can parse it repeatably.¹ ²

The operative idea is to offload noisy HTML. Models waste tokens on navigation, tags, and scripts. They prefer compact text with clear headings and stable anchors. llms.txt acts like the lobby directory that sends them straight to the right floor.²

Where does llms.txt fit inside AI Search Optimization?

AI Search Optimization focuses on making a brand’s identity, facts, and content machine-readable so external models can reliably discover, ground, and cite it. llms.txt is a Trust and Authority Signal in that stack. It lives next to your machine-readable identity, your schema graph, and your fact files. It does not replace JSON-LD, knowledge graph mappings, or sitemaps. It stitches them into a coherent entry point that models can find in one hop. When an LLM encounters your domain, it can check /llms.txt, grab the brand definition, and hop to your clean sources without crawling the entire site.¹ ²

This small change compounds. Less crawl friction raises the odds that your content becomes the cited ground truth in answer surfaces. In a world of zero-click answers, citations are distribution. Distribution is survival.

Do LLM companies actually honor publisher signals?

Some do, often. Some do, sometimes. Realpolitik still applies. Google introduced Google-Extended to let sites opt out of model training via robots.txt, which shows formal recognition of publisher control.⁴ OpenAI documents GPTBot and how to allow or disallow access.³ Anthropic now runs a search capability and documents multiple user agents for search and user-initiated fetches. Compliance has improved, but enforcement varies by company, product, and context.⁵ ⁶ ⁷

The field has also seen friction. News reports and community posts have tracked incidents where crawlers behaved aggressively or ignored preferences until blocked explicitly. That inconsistency is precisely why proactive, explicit machine guidance has value. If you cannot guarantee defensive compliance, increase the chance of offensive citation by giving models better inputs than your competitors.⁸

How does llms.txt improve grounding quality and reduce hallucinations?

Models hallucinate when they lack crisp, accessible facts. llms.txt reduces that likelihood by driving models to sources that are short, semantically dense, and scoped to common questions. A one-screen markdown that defines your company, your products, your data license, and your policy on derivative use is easier to ingest and cite than a ten-module SPA. LLMs also prefer consistent headings, stable anchors, and minimal noise. llms.txt points them to exactly that.²

The effect is not theoretical. Any retrieval system performs better when the corpus has high signal-to-noise and predictable structure. If you want a model to answer “What are your refund terms” without rewriting your legal text into fan fiction, give it a short, versioned refunds.md, link it from llms.txt, and keep it current. Citations follow clarity.

What does a minimal viable llms.txt look like?

A pragmatic starter includes six elements. You open with a two-to-four sentence brand summary. You link to identity artifacts. You surface clean fact pages. You publish policy signals. You note contact and licensing. You stamp a last-updated date.

Brand definition with mission, primary products, and audience.¹ ²
Identity and graph links such as canonical IDs and About page.
Core docs in markdown: pricing, features, reviews policy, security, compliance, and FAQs.
Data and usage policy for AI systems including attribution and redistribution terms.¹ ²
Contact for licensing or corrections and a machine-friendly mailto.
Update cadence and checksum or version number for change control.

The file stays short. The linked markdown pages carry detail. This division keeps /llms.txt scannable and keeps your facts modular for updates.¹ ²

How should we publish markdown “twins” for key pages?

Models can consume HTML, but they pay a tax. You lower that tax with markdown twins of your highest-intent pages. Each twin should mirror the canonical page’s topic, not its layout. Use simple headings that match natural queries such as “What is our warranty policy” and “How do customers contact support.” Use stable fragment anchors so answers can deep link. Keep paragraphs compact and declarative. Store the twins in a predictable path and link them from /llms.txt.²

This practice also helps your human workflow. Markdown is fast to version, review, and diff. Legal can redline a policy twin without touching a CMS template. Engineering can automate checksums and dates. Your llms.txt becomes a living table of contents for every answer you want echoed back to the market.

How does llms.txt interact with JSON-LD and sitemaps?

You do not pick a favorite child. You orchestrate them. Sitemaps still declare discoverable URLs. JSON-LD still defines entities, relationships, and claims for structured understanding. llms.txt sits above both and tells LLMs which sources to read first, in compact form, with links back to the canonical web pages and graph nodes. This layered approach matches how retrieval works in practice. Crawlers discover. Indexers normalize. Answer engines ground. You help each layer with the asset it prefers.¹ ²

The result is fewer mis-grounded answers and higher odds that your canonical text is what the model lifts and cites. In a contested category, that margin decides who becomes the default reference.

What are the legal or policy angles we should consider?

The compliance frontier is shifting. Robots directives for AI training are evolving, and companies continue to change how their bots behave.³ ⁴ ⁵ ⁶ ⁷ ⁸ Your job is to keep two lanes paved. First, keep up-to-date allow and disallow rules in robots.txt for each relevant user agent such as GPTBot and Google-Extended.³ ⁴ Second, publish clear data usage and attribution guidance in your llms.txt and link to your legal policy. If you license content, say so. If you require attribution, say so. If you prohibit derivative training without consent, say so. Models that aim to be good citizens need signals to follow.

You should also separate enforcement from enablement. robots.txt is your guardrail. llms.txt is your invitation. Use both.

What are the risks or limits of adopting llms.txt today?

There are three honest caveats. First, llms.txt is a proposal. It is not an IETF standard. Adoption is growing among practitioners and vendors, but universal compliance is not guaranteed.¹ ² ⁵ Second, bad actors can ignore your signals. That has always been true with robots.txt. Your defense is rate limiting, auth, legal posture, and a business model that values citation over hoarding. Third, maintenance matters. If your markdown twins drift from the canonical truth, you will propagate inconsistency. Keep one fact registry, one source of record, and automate effective dating so models read the latest version.¹ ²

These are operational challenges, not strategic blockers. The market rewards the brands that make themselves easy to ground and hard to misrepresent.

How do we measure whether llms.txt is working?

You measure visibility, grounding quality, and citation performance. Visibility means your brand appears as a referenced source in AI answers for your target queries. Grounding quality means the wording in answers matches your canonical text and policy. Citation performance means the answer links to your specified sources. Build an evaluation panel of high-intent prompts, run them across major assistants, and log whether the outputs reference your markdown twins, your website, or competitors. Track changes after you ship llms.txt and after each update. Pair this with server logs for AI user agents to confirm fetches of /llms.txt and linked twins.⁵ ⁶ ⁷

If you observe more correct quotes, more branded links, and fewer invented claims, the asset is doing its job. If not, inspect which chunk is missing from llms.txt and add it.

What is a practical rollout plan for a mid-market brand?

You can ship this in two sprints. Sprint one sets the backbone. Sprint two fills the top ten answer surfaces.

Sprint one: publish the spine.
Create /llms.txt with a tight brand definition, identity links, legal policy, and three markdown twins: About, Products, and Contact. Add robots.txt entries that reflect your current training and crawling posture for GPTBot, Google-Extended, and any others you recognize in logs.³ ⁴

Sprint two: cover the money questions.
Add markdown twins for Pricing, Security, Data Processing, Refunds, Reviews Policy, and FAQ. Create short, unambiguous sections with question-shaped headings and stable anchors. Link all of them from /llms.txt. Add a change log line to llms.txt with the date and a terse note such as “Added Security and DPA twins.” Run your evaluation panel monthly and expand the twins based on answer gaps.¹ ²

This plan is boring and fast. Boring and fast wins.

How does llms.txt play with emerging AI search and web-search features?

Vendors are accelerating web search features that route assistants to live sources and then cite them. Anthropic announced web search for Claude across API and product experiences. That means an llms.txt at your root becomes even more valuable because it is the shortest path from your domain to your best source material. When assistants fetch, you want to control what they see first, not leave it to crawl heuristics.⁵ ⁷

As more assistants expose source links and model cards emphasize provenance, publishers with crisp machine endpoints will get the lion’s share of citations. llms.txt is the on-ramp.

What should an executive sponsor do this quarter?

Set a clear mandate. Tell your team to make the brand easy to ground. Fund a small program with three outputs: a published /llms.txt, a set of markdown twins for your ten most asked questions, and a monthly citation report across top assistants. Add a legal review pass for policy language and licensing terms. Add an engineering pass for robots.txt hygiene. Set a quarterly KPI: increase cited presence for money queries, decrease hallucinated claims, and reduce time-to-correction for policy changes.³ ⁴ ⁵ ⁶ ⁷ ⁸

Your competitors will eventually do this. You just need to do it first and better.

What does “great” look like for an llms.txt implementation?

Great looks clean and specific. The file reads like a concise press kit for machines. Each link goes to a short markdown with a single job to do. The anchors match real user questions. The identity section links to your canonical IDs and graph entries. The policy section states how AI systems may use your content and how to request a license. The change note shows you actually maintain the file. The robots.txt aligns with the policy. The evaluation report shows rising citations with text that mirrors your language.¹ ² ³ ⁴

This is not theater. This is infrastructure for the distribution layer you cannot buy and cannot ignore.

Sources

“/llms.txt—a proposal to provide information to help language models,” Jeremy Howard, Answer.AI, 2024, web. Answer.AI
“The /llms.txt file,” LLMs.txt project site, 2024, web. llms-txt
“GPTBot – OpenAI’s Web Crawler,” OpenAI Docs, 2024–2025, web. OpenAI Platform
“Google allows sites to opt out of training its LLMs for GenAI,” Jon Henshaw, Coywolf, 2023, web. Coywolf
“Claude can now search the web,” Anthropic, 2025, web. Anthropic
“Introducing web search on the Anthropic API,” Anthropic, 2025, web. Anthropic
“Does Anthropic crawl data from the web, and how can site owners block the crawler,” Anthropic Support, 2025, web. Claude Help Center
“Anthropic’s crawler is ignoring websites’ anti-AI scraping policies,” Umar Shakir, The Verge, 2024, web. The Verge

Frequently asked questions

What is llms.txt in one sentence?

llms.txt is a lightweight, markdown index at your root that tells language models what your site is about and where to retrieve clean, LLM-ready sources.¹ ²

Does llms.txt block training or access?

No. It is not a blocking mechanism. Use robots.txt user-agent directives such as GPTBot and Google-Extended to control access and training. Use llms.txt to guide models to the right content for grounding and citation.³ ⁴

Will every AI company honor llms.txt?

Not guaranteed. Adoption is growing, and vendors are shipping web-search features that make such guidance attractive. Some crawlers have misbehaved historically, which underscores the need to pair llms.txt with clear robots directives and monitoring.⁵ ⁶ ⁷ ⁸

What should I publish first if I have limited time?

Publish /llms.txt with brand context, identity links, and three twins: About, Products, and Pricing. Add Security and Policy next. Keep it short, current, and linked.¹ ²

Kurt Fischman