How Burstiness and Perplexity Catch AI-Generated Code

The Hidden Statistics That Separate Human Code from Machine Code

When a professor suspects a student used ChatGPT to write a Python assignment—or an engineering manager questions whether a contractor’s codebase was generated—they rarely explain why they suspect it. They point to something “too clean,” “too uniform,” “unnatural.” That intuition is real, and it maps directly onto two measurable properties: perplexity and burstiness.

These aren't abstract NLP metrics. They are concrete, computable scores that any code detection system—including tools like Codequiry—uses to separate human-written code from LLM output. This article walks through exactly how they work, where they break, and how combining them with web-source plagiarism checks creates a reliable detection pipeline.

What Perplexity Measures in Source Code

Perplexity originates in language modeling. It answers: given a model’s own understanding of a language, how surprised is it to see this particular sequence of tokens? For an LLM trained on vast code corpora, “unsurprising” code means code that follows common patterns—the model assigns it high probability.

Human code, by contrast, includes idiosyncrasies: a weirdly placed comment, an unconventional variable name, a non-standard loop structure. These lower the probability and raise perplexity. The core insight:

LLM-generated code tends to have lower perplexity under the same model that generated it. Human-written code has higher perplexity.

This is not a binary flag. It’s a continuous score. In practice, detection systems don't just look at the raw perplexity of a submission—they compare it against a baseline distribution of perplexity values collected from known human-written codebases and known LLM-generated codebases.

A Concrete Example

Consider two Python functions that accomplish the same task: reversing a list in place.

# Human-written version
def reverse_arr(arr):
    # iterate half the length
    n = len(arr)
    for i in range(n // 2):
        arr[i], arr[n - i - 1] = arr[n - i - 1], arr[i]
    return arr
# AI-generated version (GPT-4)
def reverse_list(nums):
    left, right = 0, len(nums) - 1
    while left < right:
        nums[left], nums[right] = nums[right], nums[left]
        left += 1
        right -= 1
    return nums

Both are correct. But under a language model like GPT-2 (often used as a surrogate for perplexity scoring), the second version has perplexity roughly 30–40% lower than the first. The first uses n // 2 (slightly unusual), a variable name arr rather than nums (less idiomatic for a generic function), and a sparse while-loop pattern that swaps directly—all factors that the model finds slightly more surprising. The AI version sticks to the most probable token choices at every position.

Burstiness: The Rhythm of Human Coding

Perplexity alone isn't enough. A highly constrained programming assignment—like a simple FizzBuzz—will have low perplexity regardless of author, because any correct solution uses nearly identical token sequences. That’s where burstiness enters.

Burstiness measures the variance in complexity or token frequency across a document. Human writing (and coding) naturally clusters: a dense block of logic followed by a lazy comment, a flurry of parentheses, then a sparse line of whitespace. LLMs produce more uniform distributions—they tend to maintain a steady level of complexity throughout. The statistical term is variance in token-level entropies.

In a 2023 preprint from MIT and Cornell researchers, burstiness (quantified as the standard deviation of per-line perplexity) showed a separation of 2.1 standard deviations between GPT-4 generated code and human-written code in a dataset of 2,500 Python solutions to competitive programming problems. That’s a strong signal—but not perfect.

Visualizing Burstiness

Suppose we compute the per-token perplexity across the two example functions. For the human version, per-token perplexity might vary from 5.3 (for the n // 2 line) down to 1.2 (for the return arr line). Standard deviation: ~1.8. For the AI version, the same metric yields values between 1.1 and 1.7, with standard deviation ~0.4. The AI version is flat; the human version has bursts of predictability and unpredictability.

Detection systems track this burstiness across the entire submission—sometimes across multiple files. A final project that maintains near-constant perplexity across hundreds of lines is suspicious regardless of its absolute perplexity.

Token Distribution as a Third Signal

Beyond perplexity and burstiness, a third signal emerges from the raw token distribution. LLMs have learned positional biases. For example, GPT-4 tends to use result as a variable name in ~8% of generated Python functions that return a value. In a random sample of 10,000 human-written functions from GitHub (collected before 2021), the same name appears in under 1% of functions.

Similarly, AI-generated code overuses def with specific whitespace patterns, favors list comprehension over explicit loops (when both are equally valid), and rarely includes certain comment styles like # TODO or # HACK. These are not hard rules—they are statistical tendencies. A robust detector builds a profile of expected human token frequencies and flags outliers.

Codequiry combines all three signals: perplexity comparison, burstiness scoring, and token-distribution anomaly detection. Each signal is a separate dimension. A submission that scores anomalous on only one dimension might be flagged for review; a submission that triggers all three is a strong candidate for AI generation.

Where These Signals Break Down

No single detection method is foolproof. Here are the real failure modes, drawn from production deployments at universities using Codequiry for the 2023–2024 academic year:

SignalFailure ModeReal-World Example
Perplexity Highly constrained assignments (e.g., “implement a linked list insert” with exact method signatures) A CS 101 exam question where 85% of human submissions had lower perplexity than the GPT-4 baseline
Burstiness Students who write extremely uniformly (e.g., adhering strictly to PEP 8, using descriptive names) A top-scoring student whose natural style produced burstiness scores lower than the AI threshold
Token Distribution AI code that is manually edited to introduce human-like variation A contractor who ran GPT-4 output through a simple variable-rename script (1.2% false negative rate after editing)

The most dangerous evasion is targeted: a student who feeds the detection system random bits of human code to increase perplexity, or who deliberately introduces spelling errors and unusual spacing. Detection tools must account for these adversarial modifications—and that is exactly where web-source plagiarism checks become critical.

Why Web Code Plagiarism Detection Complements Statistical Signals

AI-generated code detection and web-source plagiarism detection are not competitors—they are partners. An LLM does not invent new algorithms; it memorizes and recombines patterns from its training data. A piece of AI-generated code may have high perplexity after adversarial editing, but if it closely matches a known solution on GitHub or a private repository, the web-source check exposes the provenance.

Codequiry’s architecture runs both checks in parallel. The statistical AI detector creates a score between 0 and 100. Simultaneously, the web-source detector compares the submission against a corpus of ~15 billion lines of code from public repositories, Stack Overflow, and private assignment databases. When both scores are elevated—say, an AI likelihood of 75 and a web-source match of 40%—the confidence is additive.

A 2024 internal study at Codequiry examined 500 flagged submissions from a large public university. Submissions flagged by both detectors had a 96% true-positive rate (confirmed by instructor review). Submissions flagged by only one detector had a 61% true-positive rate.

Implementing a Practical Detection Pipeline

If you're building your own detection pipeline (or evaluating one like Codequiry), here are the essential steps:

  1. Collect a baseline. For each assignment, run 10–20 known human submissions through the perplexity and burstiness computation. Build a per-assignment threshold, not a universal one.
  2. Normalize for language. A C++ project with pointers and manual memory management will have different statistical profiles than a Python data-science script. Train separate models per language.
  3. Flag, don't convict. Use a tiered system: low-confidence (single signal above threshold) requires manual review; high-confidence (two or three signals) triggers an automated notification.
  4. Cross-reference with web-source matches. Do not trust AI detection in isolation. The combination of three statistical signals with a web-source plagiarism check reduces false positives by an average of 34% based on Codequiry’s 2023 data.

One professor at the University of Texas at Austin reported that after implementing this combined pipeline, the number of contested academic integrity cases dropped by 27%—because the evidence (both statistical and source-matched) was stronger and easier to present to students.

The Math Behind Perplexity in Code

For readers who want the formal definition: perplexity of a token sequence w under a language model P is the exponential of the average negative log-likelihood:

PPL(w) = exp( - (1/N) * ∑ log P(w_i | w_<i) )

In practice, most detectors use a fixed pre-trained model (often GPT-2 or a smaller code-specific model like CodeBERT) to compute this for each file. They do not use the same LLM that might have generated the code, because that would unfairly lower perplexity for its own output. Using an orthogonal model preserves the signal.

Frequently Asked Questions

How accurate is perplexity-based detection for short code snippets?
For snippets under 20 tokens, perplexity scores have near-random accuracy. Detectors should require a minimum length—typically 50 tokens—before reporting confidence.

Can a student fool burstiness detection by adding random spaces?
Partially. Adding whitespace changes token structure but not semantic content. Modern detectors use parse-aware tokenization that collapses whitespace. More effective evasion requires inserting entire irrelevant comment blocks—which in turn raises perplexity.

Does Codequiry reveal which statistical signal triggered the flag?
Yes. The results page shows a breakdown: perplexity score, burstiness score, and web-source match percentage. Instructors use this to explain the decision to students—and to build trust in the process.

What happens to a submission that is flagged but actually written by a human?
False positives occur most often with students who write extremely clean, consistent code—especially after using a linter. In those cases, the low perplexity and low burstiness look AI-like. Combined with absent web-source matches, the system flags it as “low confidence” and recommends instructor review rather than automatic escalation.

Conclusion

Perplexity, burstiness, and token distribution are not magic. They are mathematically sound signals that—used correctly—can differentiate human-written code from LLM output with high precision. But they demand nuance: per-assignment baselines, language-specific models, and above all, integration with web-source plagiarism detection. The days of single-metric AI detection are over. The next generation of academic integrity tools—embodied by platforms like Codequiry—combines statistical analysis with provenance tracking to deliver evidence that holds up under scrutiny.