arXiv one-year bans show the trust cost of AI citations

arXiv scrutiny of AI-generated manuscripts is not a blanket LLM ban. It is a warning about hallucinated citations entering research infrastructure.

AI 요약

What happened: arXiv is being reported as taking a harder line on AI-generated manuscripts and hallucinated citations.
- The official policy targets deceptive or automated submissions, while enforcement documents describe warnings, submission limits, and account suspensions.
The core issue: The problem is less about using LLMs at all and more about unverified fake references entering the scholarly record.
Why it matters: A paper citation is not decorative link text. It is part of the trust layer for search, review, and future research.
Builder lesson: AI product teams need to design for source verification, logs, and accountability before generated text reaches a trust system.

arXiv is now in the middle of a very practical AI trust story. TechCrunch and Ars Technica reported that the preprint repository can restrict authors from submitting for up to one year when they submit AI-generated manuscripts with fabricated references, wrong author names, or citations to papers that do not exist. It is tempting to read this as "arXiv is banning AI papers." That misses the more useful lesson for developers and AI product teams. The issue is not the mere use of LLMs. The issue is the trust cost created when unverified generated material enters research infrastructure.

References in a paper are not decoration. They tell readers which prior work supports a claim, where the contribution differs, and where verification should begin. When an LLM invents a plausible title, author list, venue, and year, the error is often hard to catch at a glance. That is especially true in fast-moving or interdisciplinary fields where no single reviewer knows the full literature. One bad citation is not just a typo. A reviewer may spend time looking for a nonexistent paper. A researcher may follow a false lineage. Search engines and citation databases may absorb noise.

That is why the arXiv discussion is a useful case study in how AI quality problems show up inside knowledge infrastructure. A hallucinated link in a blog post is bad, but a hallucinated reference in a preprint has a different weight. The preprint can become input for other papers, review workflows, search indexes, literature-mapping tools, summarizers, and research assistants. If AI-generated errors are then reused by AI systems, the mistake stops being an isolated slip and becomes contamination in the knowledge supply chain.

What arXiv is actually targeting

arXiv's official code of conduct describes baseline expectations for a research community. It covers false identities, deceptive conduct, inaccurate or misleading content, and abusive automated submissions. Its enforcement document explains that violations can lead to warnings, submission restrictions, or account suspension. The starting point is therefore not "did the author use an LLM?" It is "does the final submission meet the trust standards of the community?"

That distinction matters. Researchers already use LLMs in many parts of their workflow. They polish prose, shorten abstracts, explore related work, draft reviewer responses, and generate code experiments. A blanket ban on all of that would not match how research is now done. But the opposite position, that a bad output is acceptable because "AI made it," does not work either. The submitter is still the author of the final manuscript. Responsibility for citations, claims, and data remains with the human authors and submitting account.

The danger begins when verification responsibility gets blurred. If a researcher copies an LLM-suggested bibliography into a manuscript, does not check DOI records, does not confirm authors and titles, and does not search arXiv or Crossref for the cited work, the paper carries the risk of an automated output even if a human touched the file. What arXiv is policing is closer to responsibility evasion than tool usage.

111M

references audited

2.5M

papers analyzed

146,932

estimated hallucinated citations in 2025

The uncomfortable question from a 111M-reference audit

The wider debate was sharpened by the May 8, 2026 arXiv paper LLM hallucinations in the wild: Large-scale evidence from non-existent citations. The authors analyzed 2.5 million papers and 111 million references across arXiv, bioRxiv, SSRN, and PubMed Central. They found that nonexistent references rose sharply after LLMs became widely used and conservatively estimated 146,932 hallucinated citations in 2025 alone.

This is not just a collection of embarrassing examples. The more important question is what kind of verification cost LLMs have added to the research workflow. LLMs can produce fluent research prose quickly. They can also produce outputs that look like related work sections, method summaries, and bibliographies. Fluency, however, is not correctness. Whether a paper exists, whether the title and author list match, and whether the cited claim appears in the source are separate questions.

The better the LLM output looks, the more careful the verifier has to be. Awkward text invites suspicion. Polished text often pushes verification further down the workflow. That is the paradox of AI research assistance. LLMs may reduce drafting time, but once the draft enters an external trust system, it needs verification time. If the references are wrong, prose quality does not matter. If the prior-work section is wrong, the novelty claim weakens with it. The writing cost saved by AI comes back as verification cost somewhere else.

Why hallucinated citations are worse than ordinary errors

Hallucinated citations are harder to manage than many factual mistakes. First, they look plausible. LLMs can imitate title patterns, author names, conference names, dates, and DOI-like strings. Second, they are expensive to disprove. Confirming that a paper does not exist may require several searches, and there may be similar titles or real papers by similar authors. Third, automated systems can amplify the mistake. Literature search tools, review assistants, and summarizers may ingest the bad citation and make it feel more natural.

For developers, this resembles a data validation problem. A value can be well typed without being true. A JSON object can match a schema without proving that its external identifier exists. An LLM-generated bibliography may look like valid BibTeX, but semantically it is an unverified external key. In software, failing to validate external keys creates broken references. In research, failing to validate citations creates a broken knowledge graph.

The bigger issue is accountability. Human authors have always made citation mistakes, but LLMs can generate mistakes at scale, quickly, and in a confident voice. When one paper includes a few fabricated references, the cost falls on reviewers and readers. When automated manuscripts arrive repeatedly, the platform's overall moderation and review burden rises. arXiv's advantage has always been fast preprint distribution. That advantage depends on a baseline of submitter trust and community verification.

This is closer to "no unverified submissions" than "no AI"

Reading the incident as an AI-tool ban misses the operational point. arXiv needs to know whether the record hosted on its servers is trustworthy, not which editor produced the first draft. If an author used an LLM to polish wording but verified citations, data, and claims, the risk profile is different. If a human manually typed a nonexistent reference, the manuscript still has a problem.

The same distinction applies to enterprise AI products. Whether a customer-support answer was first drafted by an agent or a human may be an internal implementation detail. But if the answer cites a nonexistent policy, invents a refund rule, or points to a fake legal requirement, the service provider owns the outcome. AI usage is not a liability shield. It is part of the operating design.

Area	Acceptable AI assistance	Risky submission pattern
Writing	Style edits where authors verify meaning and evidence	Unsupported assertions and inflated novelty claims
Citations	Separate checks for DOI, arXiv ID, author names, and titles	Copying LLM-generated references into the manuscript unchanged
Responsibility	Authors and submitters retain responsibility for the final paper	Skipping verification because "the AI wrote it"

A more accurate shorthand is therefore "no unverified submissions." Platforms can still detect automated-content patterns and restrict repeat offenders. But the long-term answer is not a single AI detector. Submission workflows need to verify DOI records and arXiv IDs, flag whether references exist, track repeated violations, and preserve revision history. Individual vigilance alone is not enough when generation speed is so high.

Research infrastructure previews the future of AI products

arXiv's problem is not limited to academia. The same issue will repeat anywhere AI systems quote external knowledge. Legal AI cites cases and statutes. Medical AI cites papers and guidelines. Financial AI cites rules and contracts. Developer tools cite API docs, GitHub issues, package advisories, and changelogs. When the citation is wrong, the problem is no longer just answer quality. It becomes operational risk.

The first lesson for builders is to treat sources as first-class data, not as trailing text. Store not only the answer, but also the document ID, version, retrieval result, and evidence span used to generate it. The second lesson is to separate generation from verification. Asking the same model that invented a citation to certify that the citation is valid is weak. When possible, verification should hit structured sources such as DOI registries, document stores, code repositories, package registries, and policy databases.

The third lesson is to design failure into the user experience. If the system cannot verify a source, it should say so instead of inventing one. That can feel less impressive as a product moment, but in trust-sensitive domains it is essential. A system that clearly says it cannot find the paper is more useful than one that confidently fabricates a reference.

The community reaction is not simple

The research and developer community broadly understands why arXiv needs enforcement. Repeated hallucinated citations waste reviewer time and raise the cost of running a trust-based preprint system. Many researchers scan abstracts and reference lists before deciding what to read. A fake citation in that path is not harmless noise. It changes the cost of literature discovery.

At the same time, many people warn against collapsing "AI use" and "false citation submission" into one category. Modern research workflows already combine search engines, grammar tools, code assistants, summarizers, and reference managers. The boundary between AI assistance and automated manuscript production is not always clean. In practice, behavior-based criteria will probably matter more: citation accuracy, response to correction, repeated violation patterns, and whether the submitter can verify the final work.

This resembles the developer debate around AI-generated code. A project should not reject a patch simply because AI helped draft it. But it should reject code that lacks tests, misunderstands dependencies, calls nonexistent APIs, or breaks security boundaries. Papers are similar. The key question is not whether AI touched the prose. It is whether unverified claims and fabricated references entered the record.

Citation CI is coming

Software teams put builds, tests, linting, type checks, and security scans into CI. Manuscript submission is likely to move in a similar direction. If a reference includes an arXiv ID, the workflow can check whether it exists. If it includes a DOI, the workflow can compare it against Crossref metadata. If a title and author list do not match the registry record, the system can flag it. If a cited paper has been retracted, the author can see that before submission. Human reviewers should spend more time on scholarly judgment and less time disproving broken references.

AI research assistants should evolve in the same direction. A good assistant is not merely one that writes more paragraphs. It should expose verification status: this citation is confirmed, this title is similar but has different authors, this claim was not found in the source text. The next competitive edge for generative interfaces may be less about prettier prose and more about traceable sources and clear responsibility.

arXiv's enforcement shift is therefore not a small moderation footnote. It is a signal that knowledge infrastructure is under pressure now that AI sits closer to the front of production. Models will write faster, imitate citations more naturally, and suggest more research directions. But the research community does not need more plausibility. It needs knowledge that can be verified.

The same question now applies to every knowledge-heavy AI product. Can your system create a source that does not exist? If it does, who catches it, where, and when? What warning or verification path appears before a user carries that output into a trust system? The cost arXiv is paying today is the cost other AI services will face next. In the AI writing era, the bottleneck is not generation. It is proving that the generated work can be trusted.