SecurityBrief India - Technology news for CISOs & cybersecurity decision-makers
India
Cloudflare warns of AI code review prompt injection

Cloudflare warns of AI code review prompt injection

Mon, 4th May 2026 (Today)
Sean Mitchell
SEAN MITCHELL Publisher

Cloudflare has published research showing that attackers can mislead AI models used in code security reviews through indirect prompt injection. The study examined seven AI models across 18,400 API calls.

Its threat intelligence unit, Cloudforce One, analysed 100 malicious or abusive Cloudflare Workers scripts to test whether embedded text comments could steer large language models toward false benign verdicts. The research compared frontier and non-frontier models and found that success depended less on blunt manipulation than on the ratio of deceptive comments to code and the size of the file under review.

Rather than trying to break the models directly, the tactic inserts blocks of commented text into source code. These comments tell an automated reviewer that the code is harmless, legitimate or routine, even when the script is malicious.

The study identified a "bypass zone" when deceptive comments made up less than 1% of a file. In that range, average detection rates across the tested models fell to 53.3%, from a 67.3% baseline for the same malicious code without the added lures.

When the comments became too numerous, however, the tactic often backfired. At ratios above 25% of the file, detection rose to 97.9%, and above 50% it reached 98.9%, suggesting many models treated heavy repetition as a sign of fraud rather than reassurance.

Model behaviour

The data also showed uneven resilience across systems. One non-frontier model saw its detection rate drop from 90% to 67% with only 20 inserted comments. By contrast, several frontier models were broadly resistant to low-volume deception and, in some cases, became more likely to flag malicious code as comment volume rose.

One non-frontier model remained particularly vulnerable, while another treated large numbers of repeated comments as a strong indicator of abuse, pushing detection to 91% at 1,000 comments. A third non-frontier model showed little variation between moderate and extreme comment volumes, indicating a flatter response to both deception and repetition.

The findings suggest that cost-optimised models may create a wider attack surface when companies deploy AI in automated security workflows. They also indicate that defensive outcomes can vary sharply depending on how a model weighs natural language against executable code.

Context trap

The strongest effect in the study came from file size rather than wording. Detection for Workers under 500KB was close to 100%, but for files larger than 3MB it fell to between 12% and 18%.

That decline appeared when malicious logic was buried inside large bundles of legitimate framework code, including libraries such as React SDKs or TON. In those cases, the models often failed to isolate the harmful payload even when the relevant code still sat within the system's context window.

Cloudflare's researchers described this as a structural weakness rather than a linguistic one. The implication is that attackers may not need to persuade an AI reviewer that a file is safe if they can instead drown the malicious signal in enough benign material.

Some frontier models also suffered "format breakdown" under extreme comment volumes, returning garbled outputs or refusals instead of a usable malicious-or-benign verdict. In an automated pipeline, that kind of failure could stop a blocking action from being triggered at all.

Language bias

The study also found what it called linguistic stereotyping in model responses. Some models treated Russian, Chinese or Arabic comments as higher-risk signals than English, regardless of the code's actual function, while a few appeared more trusting of Estonian comments.

At high volumes, one non-frontier model increased malicious detection by 14 percentage points when comments were written in Russian. Frontier models 2 and 3 were more likely to treat Chinese and Arabic scripts as suspicious than the English baseline.

That pattern raises questions about how language bias may shape security decisions when AI systems are used to review software or automate triage. It also suggests that attackers may adapt comment language to match the assumptions of the model they are trying to evade.

The research grew out of Cloudforce One's monitoring of abuse on Cloudflare Workers, where it found a rise in VPN and proxy tunnelling scripts using the VLESS protocol. During routine analysis, the team identified scripts with thousands of lines of repetitive multilingual text aimed at influencing automated auditing systems.

To test the effect systematically, the team inserted comments at statement boundaries throughout malicious scripts rather than placing them only at the top of a file. It then measured binary verdicts, confidence scores and "unknown" responses across multiple models with context windows ranging from 376KB to 2.8MB.

The results point to several practical defences, including stripping comments before analysis, prioritising functional code over boilerplate in large files, anonymising variable names and using narrower prompts that ask whether code matches a specific abuse pattern rather than posing a general safety question.

The findings also suggest a shift in how AI systems used in security can be targeted: adversaries can influence model reasoning or overwhelm its attention without compromising the underlying software directly.

"The fact that detection accuracy plummets to 12% when payloads are buried in large library bundles suggests that adversaries no longer need to convince the AI that their code is safe-they only need to make the malicious signal too small for the AI to find."