How to Build a Source Code Similarity Pipeline for Detection

Most engineering teams and academic departments rely on a single black-box tool when they need to check code for similarity. They upload submissions, wait for a percentage score, and then manually investigate anything above a threshold. That approach works — until it doesn't.

The problem is that these tools are opaque. When a false positive appears, or when a student or contractor has deliberately obfuscated copied code, you have no insight into why the similarity score is what it is. You cannot tune the detection strategy for your specific programming language, assignment structure, or coding standards. You are entirely dependent on someone else's heuristics.

This article walks through how to build your own source code similarity pipeline. You will learn the three main detection strategies — token-based, AST-based, and Winnowing fingerprinting — and how to combine them into a practical, configurable system. By the end, you will have a working Python implementation that can be extended and adapted for your specific use case.

Key insight: No single detection strategy catches everything. Token-based methods are fast but fooled by renaming. AST-based methods catch structural cheating but miss trivial character-level changes. Winnowing fingerprinting is robust against small modifications but requires careful parameter tuning. A production pipeline uses all three in concert.

Why Build Your Own Pipeline?

Before we start, understand the practical motivations. Commercial tools like MOSS and JPlag are excellent, but they operate as remote services. You cannot inspect their intermediate representations. You cannot add custom preprocessing steps for your assignment language. You cannot tune the similarity threshold per problem.

Building your own pipeline gives you:

Complete transparency. Every similarity score is explainable down to the line.
Language-specific preprocessing. Strip comments, normalize whitespace, or remove boilerplate import lines for each assignment.
Configurable detection strategies. Use token-based scanning for large cohorts, AST matching for refactored submissions, and Winnowing for suspiciously similar but non-identical code.
Reusable infrastructure. The same pipeline can serve academic integrity checks, open-source license compliance scanning, and contractor code verification.

Architecture Overview

Our pipeline consists of four stages:

Preprocessing — normalize the input code into a clean, comparable form
Tokenization — convert code into a sequence of tokens
Fingerprinting — generate hash fingerprints from token sequences
Comparison — compute similarity scores using fingerprint overlap

We will implement each stage in Python, using only standard library modules. The final pipeline will accept a directory of source files and produce a similarity matrix.

Stage 1: Preprocessing

Raw source code contains noise that inflates similarity scores: comments, blank lines, import statements, and license headers. For academic integrity checks, these should be stripped. For license compliance scanning, they should be preserved. Our pipeline supports both modes.

import re
import ast
import tokenize
import io
from collections import defaultdict
import hashlib

def preprocess_remove_comments_and_strings(source_code):
    """Remove comments and string literals from Python source code."""
    # Use the tokenize module to strip comments and strings
    result = []
    tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
    for toknum, tokval, _, _, _ in tokens:
        # Skip comments, strings, and newlines
        if toknum in (tokenize.COMMENT, tokenize.STRING, tokenize.NEWLINE, tokenize.NL):
            continue
        if toknum == tokenize.NAME or toknum == tokenize.OP or toknum == tokenize.NUMBER:
            result.append(tokval)
        # Skip INDENT and DEDENT for cleaner tokens
    return ' '.join(result)

def preprocess_standardize_identifiers(source_code):
    """Replace all identifiers with a generic placeholder."""
    # This defeats variable renaming obfuscation
    tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
    result = []
    for toknum, tokval, _, _, _ in tokens:
        if toknum == tokenize.NAME:
            result.append('ID')
        else:
            result.append(tokval)
    return ' '.join(result)

The preprocess_remove_comments_and_strings function is useful for normalizing student assignments where comments and docstrings are often copied verbatim. The preprocess_standardize_identifiers function is applied when you suspect variable renaming — replacing every name token with a generic "ID" placeholder.

For Java or C++ code, you would use a lexer for that language. The principle is the same: strip noise, normalize structure.

Stage 2: Tokenization

Tokenization converts the preprocessed source into a sequence of meaningful units. Two approaches are common:

Token-Based Tokenization

This approach uses the language's lexer to produce a linear stream of tokens. It is fast and catches verbatim copying.

def tokenize_to_sequence(source_code):
    """Generate a list of token values from source code."""
    tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
    return [tokval for toknum, tokval, _, _, _ in tokens 
            if toknum not in (tokenize.COMMENT, tokenize.STRING, 
                              tokenize.NEWLINE, tokenize.NL, 
                              tokenize.INDENT, tokenize.DEDENT)]

AST-Based Tokenization

Abstract Syntax Tree (AST) tokenization captures the structural shape of the code. Two functions that have the same AST are structurally identical, even if variable names, whitespace, or comment styles differ.

def ast_to_sequence(source_code):
    """Convert source code to a sequence representing AST structure."""
    try:
        tree = ast.parse(source_code)
    except SyntaxError:
        return []  # Handle invalid code gracefully
    
    def walk_node(node, depth=0):
        result = [type(node).__name__]
        for child in ast.iter_child_nodes(node):
            result.extend(walk_node(child, depth + 1))
        return result
    
    return walk_node(tree)

def ast_to_normalized_sequence(source_code):
    """AST sequence with identifier names removed."""
    try:
        tree = ast.parse(source_code)
    except SyntaxError:
        return []
    
    def walk_node(node, depth=0):
        node_type = type(node).__name__
        # For Name nodes, replace with generic 'Name'
        if node_type == 'Name':
            return ['Name']
        result = [node_type]
        for child in ast.iter_child_nodes(node):
            result.extend(walk_node(child, depth + 1))
        return result
    
    return walk_node(tree)

The AST-based approach is more robust against refactoring. Consider a student who copies a sorting algorithm but renames temp to swapBuffer and rearranges comments. Token-based detection might miss it if the variable name changes are widespread. AST-based detection will flag it, because the AST structure — a loop containing a swap operation — remains identical.

Real-world example: At the University of California, Berkeley, researchers found that AST-based detection caught 40% more cases of refactored code plagiarism than token-based methods in introductory CS courses. The trade-off is that AST-based methods are slower and cannot process code with syntax errors — a common issue with student submissions.

Stage 3: Fingerprinting with Winnowing

Token sequences are too long for pairwise comparison across a large dataset. Fingerprinting reduces each submission to a set of hashes. The Winnowing algorithm, popularized by Schleimer, Wilkerson, and Aiken (the same team behind MOSS), selects a subset of k-gram hashes to produce a compact fingerprint that is robust against insertion and deletion.

def winnowing_fingerprint(tokens, k=5, window_size=4):
    """
    Generate Winnowing fingerprint from token sequence.
    
    Parameters:
    - tokens: list of token values
    - k: k-gram size (number of tokens per gram)
    - window_size: number of consecutive k-grams to consider
    
    Returns:
    - set of integer hash values (the fingerprint)
    """
    if len(tokens) < k:
        return set()
    
    # Generate k-gram hashes
    kgrams = []
    for i in range(len(tokens) - k + 1):
        kgram = ' '.join(tokens[i:i+k])
        hash_val = int(hashlib.md5(kgram.encode()).hexdigest(), 16)
        kgrams.append(hash_val)
    
    # Winnowing: select minimum hash from each sliding window
    fingerprints = set()
    for i in range(len(kgrams) - window_size + 1):
        window = kgrams[i:i+window_size]
        min_hash = min(window)
        fingerprints.add(min_hash)
    
    return fingerprints

def jaccard_similarity(set1, set2):
    """Compute Jaccard similarity between two sets of fingerprints."""
    if not set1 or not set2:
        return 0.0
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union > 0 else 0.0

The parameters matter. A k value of 5 and a window_size of 4 work well for many languages, but tuning is essential for specific assignments.

k	Window Size	Behavior
3	2	Very sensitive — catches small copied blocks but generates many false positives
5	4	Good default for most languages — balances sensitivity and specificity
10	6	Less sensitive — requires large copied regions to flag; use for production code

Stage 4: Comparison and Scoring

The final stage computes pairwise similarity between all submissions. We build a similarity matrix using the Jaccard coefficient of the Winnowing fingerprints.

def build_similarity_matrix(file_paths, tokenizer_func, k=5, window_size=4):
    """
    Build a similarity matrix for a collection of source files.
    
    Returns:
    - dict mapping (file1, file2) -> similarity score
    """
    fingerprints = {}
    for filepath in file_paths:
        with open(filepath, 'r') as f:
            source = f.read()
        # Preprocess and tokenize
        cleaned = preprocess_remove_comments_and_strings(source)
        tokens = tokenizer_func(cleaned)
        fingerprints[filepath] = winnowing_fingerprint(tokens, k, window_size)
    
    # Compute pairwise similarity
    scores = {}
    file_list = list(fingerprints.keys())
    for i in range(len(file_list)):
        for j in range(i + 1, len(file_list)):
            sim = jaccard_similarity(fingerprints[file_list[i]], 
                                     fingerprints[file_list[j]])
            scores[(file_list[i], file_list[j])] = sim
    
    return scores

def report_top_pairs(scores, threshold=0.5, top_n=10):
    """Print the most similar pairs above a threshold."""
    sorted_pairs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for (f1, f2), sim in sorted_pairs[:top_n]:
        if sim >= threshold:
            print(f"{f1} <-> {f2}: {sim:.3f}")

This gives you a ranked list of suspicious pairs. A threshold of 0.5 (50% similarity) is a reasonable starting point for academic integrity checks. For open-source license compliance, you might lower it to 0.3 to catch partial matches.

Putting It All Together: A Complete Pipeline

Here is the full pipeline as a single function, combining preprocessing, tokenization, and Winnowing fingerprinting with configurable strategies.

def full_pipeline(file_paths, 
                  strategy='token', 
                  preprocess_comments=True, 
                  normalize_identifiers=False,
                  k=5, 
                  window_size=4,
                  threshold=0.5):
    """
    Run the full similarity detection pipeline.
    
    Parameters:
    - strategy: 'token' or 'ast'
    - preprocess_comments: strip comments and strings
    - normalize_identifiers: replace identifiers with 'ID'
    - k: k-gram size for Winnowing
    - window_size: window size for Winnowing
    - threshold: minimum similarity to report
    """
    tokenizer = tokenize_to_sequence if strategy == 'token' else ast_to_normalized_sequence
    
    fingerprints = {}
    for filepath in file_paths:
        with open(filepath, 'r') as f:
            source = f.read()
        
        # Preprocessing
        if preprocess_comments:
            source = preprocess_remove_comments_and_strings(source)
        if normalize_identifiers:
            source = preprocess_standardize_identifiers(source)
        
        # Tokenize
        tokens = tokenizer(source)
        fingerprints[filepath] = winnowing_fingerprint(tokens, k, window_size)
    
    # Compare
    scores = {}
    file_list = list(fingerprints.keys())
    for i in range(len(file_list)):
        for j in range(i + 1, len(file_list)):
            sim = jaccard_similarity(fingerprints[file_list[i]], 
                                     fingerprints[file_list[j]])
            if sim >= threshold:
                scores[(file_list[i], file_list[j])] = sim
    
    return scores

# Example usage
if __name__ == '__main__':
    import glob
    files = glob.glob('/path/to/submissions/*.py')
    results = full_pipeline(files, strategy='ast', k=5, window_size=4, threshold=0.4)
    for (f1, f2), sim in sorted(results.items(), key=lambda x: x[1], reverse=True):
        print(f"{f1} <-> {f2}: {sim:.3f}")

Handling Large Cohorts Efficiently

Pairwise comparison for n submissions is O(n²). For a cohort of 500 students, that is 124,750 comparisons. Winnowing fingerprinting reduces each comparison to set intersection, but the O(n²) overhead remains.

Two optimizations help:

Locality-Sensitive Hashing (LSH): Group similar fingerprints into buckets before comparison. Only compare files within the same bucket. This reduces comparisons to near-linear time in practice.
Inverted index on fingerprint hashes: Build a dictionary mapping each fingerprint hash to the list of files containing it. Files that share no hashes can be skipped entirely.

def efficient_comparison(fingerprints, threshold=0.5):
    """
    Use inverted index to skip pairs with no shared fingerprints.
    """
    # Build inverted index
    inverted = defaultdict(set)
    for filepath, fp_set in fingerprints.items():
        for hash_val in fp_set:
            inverted[hash_val].add(filepath)
    
    # Compute similarity only for pairs that share at least one hash
    scores = {}
    for hash_val, file_set in inverted.items():
        file_list = list(file_set)
        for i in range(len(file_list)):
            for j in range(i + 1, len(file_list)):
                f1, f2 = file_list[i], file_list[j]
                if (f1, f2) not in scores and (f2, f1) not in scores:
                    sim = jaccard_similarity(fingerprints[f1], fingerprints[f2])
                    if sim >= threshold:
                        scores[(f1, f2)] = sim
    
    return scores

Calibrating for Your Use Case

There is no universal threshold or k-gram size. Calibration requires a labeled dataset. Here is a process that works for university courses:

Collect a hand-verified set: Take 50 submissions from a previous semester. Have TAs manually verify which pairs are plagiarized.
Run the pipeline with different parameters: Try k = 3, 5, 7 and window sizes 2, 4, 6. Record the precision and recall for each configuration.
Select the best configuration: For academic integrity, err on the side of higher recall (catching more cases) even if precision drops slightly. For code review in production, prioritize precision to avoid false accusations.
Document and reuse: Once calibrated, the same configuration can be used for similar assignments. Different problem types may require different settings — a recursion assignment behaves differently from a file I/O assignment.

Practical note: Platforms like Codequiry automate this calibration process across multiple detection strategies — including token-based, AST-based, and Winnowing fingerprints — and present results in a unified dashboard. For many teams, this removes the operational burden of maintaining a custom pipeline while providing the same level of transparency through detailed similarity reports.

Limitations and Next Steps

This pipeline is functional but has limitations. It does not handle:

Cross-language plagiarism: A student translating C++ code to Python will not be detected by AST matching because the ASTs are structurally different.
Dead code insertion: Adding dummy functions or irrelevant variable declarations dilutes the fingerprint and lowers similarity scores.
Control flow obfuscation: Changing if-else to switch or restructuring loops changes the AST structure.

These limitations can be addressed with additional techniques: cross-language normalization (converting both languages to a common IR like three-address code), dead code removal heuristics, and control flow graph comparison. Each of these deserves its own article.

The pipeline described here provides a solid foundation. It is transparent, configurable, and extensible. Whether you are a TA tired of manually checking submissions against MOSS output, or an engineering manager wanting to verify contractor code originality, building your own similarity pipeline gives you control over the detection process. You understand exactly what is being compared, how the fingerprints are generated, and why two files received a particular similarity score.

That understanding is the difference between a black-box tool and a genuine integrity system.