When AI Writes Code You Can't Review

Jarkko Lietolahti · March 2025 · Source on GitHub · Live Demo

There's a growing assumption in the AI discourse that humans should always be able to review and verify AI-generated code. That a developer should read every line, understand every algorithm, and sign off on every commit. I want to challenge that assumption with a concrete experiment.

I took a research paper I don't fully understand, gave it to Claude as a specification, and asked it to build a working application. The result is a WebGPU-accelerated image compressor running in the browser — implementing mathematics that's beyond my ability to verify line by line. And yet, I'm confident the code works correctly. Here's why, and what this means for the future of software development.

The Experiment

The paper is "Switching Games for Image Compression" by Marko Huhtanen (IEEE Signal Processing Letters, 2025). It describes a compression technique using diagonal matrix scaling, alternating least-squares optimization, and 2D DCT transforms. The mathematical framework involves double orthogonality conditions, Cholesky decomposition of Gram matrices, and Stockham auto-sort FFT implementations.

I can follow the paper's intuition: you start with a rough approximation of an image using DCT coefficients, then iteratively refine it by finding optimal "switches" (diagonal scaling matrices) from both sides. But the precise mathematics of the ALS solver, the FFT-based IDCT normalization factors, and the GPU shader implementations? Those are beyond what I can confidently review.

The code Claude produced spans ~4,000 lines across 7 files:

~4K

Lines of Code

WGSL Compute Shaders

Tasks Completed

Critical Bugs Found & Fixed

This includes WebGPU compute shaders in WGSL (a language I've never written), Stockham FFT with mixed radix-2/radix-3 stages, a Cholesky batch solver running on GPU, histogram-based top-k selection, and a complete binary file format encoder/decoder. Not exactly code you can skim and approve.

The Verification Problem

Traditional code review assumes the reviewer can evaluate correctness by reading. But several things make that impractical here:

Domain expertise gap — The mathematics (Makhoul 1980 IDCT normalization, Cholesky decomposition stability in f32) requires signal processing knowledge I don't have.
Language unfamiliarity — WGSL compute shaders have unique semantics (workgroup barriers, storage buffer access patterns) that aren't obvious from reading.
System interactions — WebGPU buffer alignment, maxStorageBufferBindingSize limits, and browser module caching created bugs that only manifest at runtime, not in code review.
Scale — 4,000 lines of interconnected GPU code is a lot to hold in your head, especially when one normalization constant being wrong breaks everything silently.

How I Built Confidence Anyway

If you can't review the code, how do you trust it? I used four complementary strategies:

1. Feature as Proof

The strongest evidence is that the application works. You can load an 8192×6144 image, compress it to 38% of its original size, decompress it in 1 second on the GPU, and visually compare it with the original. The PSNR numbers match or exceed the Python reference implementation. The files round-trip correctly between browser and Python.

This isn't a toy demo — the compression involves real linear algebra running on the GPU, and producing correct results is strong evidence that the underlying mathematics is implemented correctly. A wrong normalization factor doesn't produce a slightly degraded image; it produces garbage (as we discovered three times during development).

2. AI-Assisted Security Scanning

I asked Claude to perform an intensive security audit of the SWG3 binary parser. It found 7 vulnerabilities:

Severity	Finding	Status
Critical	Zlib decompression bomb (no payload size limit)	Fixed
Critical	Out-of-bounds GPU write via unchecked scatter indices	Fixed
High	Integer overflow in n×k pixel count (65536²)	Fixed
High	Payload offset reads past end without bounds checking	Fixed
Medium	compressedSize not validated against file size	Fixed
Medium	NaN/Infinity in float16 diagonals propagate to GPU	Fixed
Low	TypedArray alignment assumption in delta decoder	Documented

3. Cross-AI Review (GitHub Copilot)

I ran GitHub Copilot's code review on the repository. It found an additional XSS vulnerability that Claude's audit missed — the info panel was injecting user-supplied filenames via innerHTML. A file named <img src=x onerror=alert(1)>.swg would execute arbitrary JavaScript.

Copilot also found encoding inconsistencies between the JavaScript and Python implementations, dead code, and missing type annotations. Using multiple AI reviewers catches different classes of issues — much like having multiple human reviewers with different expertise.

4. Automated Testing with Browser Automation

We used Playwright to run automated end-to-end tests:

DCT ↔ IDCT round-trip (max error < 0.0001)
Compress → download → re-decode pixel identity (maxDiff = 0)
Browser ↔ Python round-trip compatibility (49 dB through f16 quantization)
WebGPU validation error monitoring (0 errors in final build)
Large image decode performance (8192×6144 in ~1 second)

What Broke Along the Way

The bugs we found are instructive about the kinds of errors AI-generated code contains:

IDCT Normalization (Makhoul 1980)

The FFT-based inverse DCT pre-twiddle was multiplying by N instead of dividing by N, and missing a factor-of-2 correction for AC terms. This is a subtle mathematical error — the kind that passes a cursory code review because the formula looks reasonable. The IDCT produced values 1000x too large, but the ALS solver compensated by producing correspondingly tiny diagonal matrices, masking the bug until we measured PSNR against the reference implementation.

Cholesky Regularization

The batch Cholesky solver used fixed regularization of 1e-6, which was insufficient for Gram matrices with diagonal values in the tens of thousands (from squaring pixel values in range 0-255). After 2-3 ALS iterations, the solver produced NaN values that propagated through everything. The fix was to scale regularization proportionally to the diagonal magnitude.

WebGPU Buffer Limits

The FFT needed 384MB complex buffers for 8192×6144 images, but maxStorageBufferBindingSize is typically 128MB (distinct from maxBufferSize of 256MB). This required implementing row-batched FFT processing with WebGPU bind group buffer offsets — a non-obvious solution that wouldn't be caught by code review alone.

The Uncomfortable Truth

Here's what I think this experiment demonstrates:

Code review is not the only, or even the best, way to verify correctness. For mathematically complex code, behavioral testing (does it produce the right output?), cross-validation (does it match a reference implementation?), and adversarial analysis (what happens with malicious input?) are often more effective than reading source code.

This has always been true — few developers can verify a TLS implementation by reading it — but AI makes it more visible because the volume and complexity of AI-generated code can exceed any individual's review capacity.

The practical approach isn't "never trust AI code" or "always review every line." It's to build a verification pipeline that combines:

Feature-level proof — does it actually work?
Reference comparison — does it match known-good implementations?
Multi-AI review — different models catch different bugs
Security scanning — dedicated adversarial analysis of trust boundaries
Eventually, human expert review — targeted at the critical paths, informed by the automated findings

The key shift is from "I reviewed the code and it looks right" to "I tested the behavior, scanned for vulnerabilities, cross-validated with references, and had multiple AI systems review it." The second approach is arguably more rigorous than traditional code review, even if no human read every line.

Try It Yourself

The live demo runs entirely in your browser using WebGPU. Load the demo image (8192×6144, compressed to 38% of the original JPEG), or compress your own photos using the Compress tab. The source code is on GitHub, including the original paper, the Python reference implementation, and all 12 task files documenting what was built and verified.

If you're a signal processing expert and want to review the mathematical correctness of the GPU shaders, I'd love to hear from you. That's the one verification step we haven't done yet — and it would complete the pipeline.