Devlery
Blog/AI

In 14 AI coding sessions, zero first prompts asked for security

A SOUPS 2026 paper observed that AI coding assistants push security from upfront requirements into after-the-fact review.

In 14 AI coding sessions, zero first prompts asked for security
AI 요약
  • What happened: A SOUPS 2026 paper studied how security requirements disappear from the first prompt when developers use AI coding assistants.
    • The researchers interviewed 15 professional software engineers and observed 14 coding sessions with Gemini CLI and Gemini 2.5 Pro.
  • The numbers: 0 participants included security requirements in the first prompt, and only 2 caught hidden security issues on their own.
  • Builder impact: AI coding security is becoming a problem of prompt templates, review checklists, inline warnings, and tool defaults, not only model quality.
    • The paper proposes automatic security-review passes, pre-generation checks for sensitive code, and OWASP-class inline warnings.

The paper is From Preventive to Reactive: How AI Coding Assistants Transform Developers' Security Awareness, posted to arXiv on May 22, 2026 and accepted to SOUPS 2026. It does not frame AI coding security as a simple vulnerability-rate benchmark. The researchers interviewed 15 professional software engineers, then watched 14 of them use Gemini CLI with Gemini 2.5 Pro on security-relevant coding tasks. Their finding is sharper than "developers forgot security." Developers often knew the security concepts, but the AI workflow pushed that knowledge out of the initial requirement and into post-generation review.

SOUPS is a usable privacy and security venue, so the paper looks at people, tools, and workflow rather than only generated code. That makes the research useful for the current coding-agent market. Claude Code, Codex, Copilot, Cursor, Gemini CLI, and similar tools are moving beyond autocomplete into file edits, command execution, and task loops. In that environment, the assumption that a developer will "review it later" becomes more expensive. The first prompt can steer the entire agent loop toward functional correctness while non-functional requirements stay implicit.

AI coding security awareness shifts from upfront requirements to post-generation review

The study design was not a broad survey. From November 2025 to January 2026, the researchers recruited 15 professional developers across full-stack, machine learning, frontend, R&D, and general software engineering roles. Students and university employees were excluded. Each 45-60 minute session started with a semi-structured interview about security practice and AI coding-tool experience. Participants then worked through Think-Aloud coding tasks in a self-hosted VS Code server with Gemini CLI and Gemini 2.5 Pro preinstalled.

The tasks were chosen to surface security behavior without spelling out security requirements. One task asked for an /update-profile HTTP endpoint that accepts a JSON payload, where input validation, authorization, and sanitization naturally matter. A second task asked participants to initialize a new backend project, where secrets, secure defaults, and environment separation matter. A third task asked them to debug a file-upload race condition. The point was not to trick participants with obscure vulnerabilities. The point was to see whether developers would bring security into their AI prompts when the product task itself did not say "make this secure."

The largest number in the paper is zero. Across the 14 observed coding sessions, no participant included security requirements in the first prompt. Some participants had just discussed authorization, sensitive data, or vulnerability categories in the interview portion. Once they sat down with Gemini CLI, the initial request focused on the functional goal: implement the endpoint, create the project, or fix the bug. The paper calls this a prompting gap. Security knowledge and security prompting behavior split apart.

The second number is two. Only two participants identified hidden security problems without a separate prompt from the researchers. P10 noticed that a generated profile-update endpoint read the user ID from the POST body instead of a secure session store. That creates a direct access-control failure: an authenticated user can change the request body and update another account's profile. P12 succeeded in a different way. After making an initial fix, P12 asked the AI whether any additional security concerns remained, and that question surfaced a DDoS-related risk the participant had not initially raised.

Observed itemPaper numberPractical reading
Professional developer interviews15The study first checked AI attitudes and existing security habits.
Observed coding sessions14One participant declined the coding task; the rest worked in Gemini CLI.
Security requirements in first prompts0Security moved from functional specification to after-the-fact review.
Participants who caught hidden issues2Success depended on individual security instincts and question habits.

That result is different from the familiar story that novice developers blindly trust AI. The researchers grouped participants into Pre-AI, AI-era, and AI-native cohorts, but experience cohort did not reliably predict security outcomes. P10, who caught the access-control bug, was a relatively less experienced AI-era participant. P12 came from the Pre-AI cohort. More experienced developers also missed important issues. The paper's interpretation is that years of development experience are not enough if security knowledge does not become an action at the exact moment the AI system is being instructed.

That is directly relevant to enterprise AI coding-tool adoption. Teams often compare model quality, IDE integration, codebase indexing, agent permissions, and license cost first. This paper identifies a failure point that happens before many of those comparisons matter. When a developer asks an assistant to "build a profile update endpoint," the tool should ask where the authenticated identity comes from and who is allowed to update which resource. If the assistant generates code first and only later offers a security review, the default workflow has already treated authorization as optional context.

The paper also connects its findings to earlier security research. Pearce and collaborators previously reported that GitHub Copilot produced exploitable code in roughly 40% of high-risk vulnerability scenarios. Perry and collaborators found that developers using an AI assistant could produce less secure code while feeling more confident that their code was secure. The SOUPS 2026 paper helps explain the behavior behind those outcomes. AI can produce functional, readable code quickly, so developers may assume they can recover security later in review instead of specifying it before generation.

Participant comments pointed in the same direction. One participant said AI output tends to focus on functionality and readability while not treating security as a default. Another described how security considerations fall back in the mind when working with AI, because the developer stops reading line by line. The researchers do not reduce this to personal laziness. The current interaction model of many AI coding tools is "give a functional request, get code." If the user does not actively pull in non-functional requirements such as security, accessibility, reliability, and performance, those requirements can disappear from the first implementation.

The P10 case is the kind of bug most backend developers can recognize. In an /update-profile endpoint, user identity must come from the session, token, or server-side context. It should not come from a client-controlled JSON body. A traditional secure-coding lesson would state that rule plainly: do not trust client input for authorization identity. In the observed session, the AI did not automatically enforce that rule, and the tool interface did not mark the generated endpoint as security-sensitive. P10 found the issue because of separate security judgment, not because the tool made the boundary visible.

The P12 case shows a possible design direction. P12 asked whether additional security concerns remained. That simple question pulled out a DDoS-related risk. It was one of the few effective coping strategies the researchers observed across the sessions. For teams, the lesson is not merely to append "review for security" at the end of every task. A stronger pattern is to force both sides of the workflow: make the security requirements explicit before generation, then require a second security question after the diff exists.

The paper's tool-design recommendations are concrete. First, coding assistants could run an automatic security-review pass after code generation. Second, they could ask for security requirements before generating sensitive code involving authentication, file uploads, payments, PII, external URL fetching, or database writes. Third, they could attach inline warnings mapped to OWASP vulnerability classes when generated code matches a known risky pattern. Fourth, security-sensitive outputs could include alternative implementations, assumptions, and uncertainty instead of presenting a single polished answer as if it were complete.

That positioning intersects with products such as Salt Code, GitHub Copilot code review, Snyk Agent Scan, Semgrep, and SonarQube MCP-style workflows. The difference is insertion point. SAST and DAST usually inspect code after it exists. PR review bots inspect a completed diff. The paper is concerned with the earlier gap. If the first prompt and first generated code omit security requirements, downstream tools inherit more repair work. As AI coding speed increases, that cost shows up as review queue pressure, exception approvals, delayed releases, and security teams arguing over generated code they did not see being specified.

The organizational implication is uncomfortable. Saying "our developers know security" is not enough. In this study, developers who could discuss security concerns still left them out of the first prompt. AI coding policy therefore cannot live only in a document. The same requirements need to appear in repository templates, AGENTS.md, Copilot or Codex instruction files, Claude Code project instructions, PR checklists, pre-commit hooks, CI security scanners, and IDE warnings. Security knowledge has to be present at the point of AI interaction, not only in memory or annual training.

Teams can test a minimal version of that approach without buying a new platform. First, create prompt templates for security-sensitive work. An endpoint-generation template should ask for the authentication source, authorization rule, input validation, rate limit, audit log, and PII handling. Second, split AI-generated diff review into two questions: does the feature work, and did the implementation satisfy the security requirements from the start? Third, require the agent or assistant to run one explicit security query after making changes. These steps are basic, but they move security from an optional reviewer instinct into a repeated workflow control.

The study has limits. The sample size was 15 interviews and 14 observed coding sessions. The work happened in a research environment with Gemini CLI and Gemini 2.5 Pro inside a self-hosted VS Code server. Real company repositories, team review, CI failures, deployment deadlines, customer data, and long-running tasks could change behavior. One participant, P6, said the research-task result felt acceptable in the study but would not have been acceptable at work. The paper should be read as a warning signal rather than a universal law.

Even with those limits, 0 out of 14 is hard to ignore. The paper does not say AI coding tools erase security knowledge. It says they can move the moment when that knowledge is used. That interpretation matches the product direction around coding agents. Vendors are adding code review, sandboxing, automation controls, billing controls, and security integrations because generated-code governance is becoming a product category. The SOUPS 2026 paper explains the human side of that category: when the first prompt only describes the feature, the agent loop optimizes around the feature.

A better AI coding assistant will need a stricter standard than "it has a security review button." When a developer asks for an endpoint, the assistant should ask who may change the resource. When the developer asks for file upload, the assistant should raise race conditions, path traversal, MIME validation, and storage permissions. When the developer asks for an OAuth callback, the assistant should require redirect URI validation and state handling. Security cannot remain a final /security-review command. For teams handing code to agents, review-only security is already one step late.