When AI Writes Code You Can't Review
There's a growing assumption in the AI discourse that humans should always be able to review and verify AI-generated code. That a developer should read every line, understand every algorithm, and sign off on every commit. I want to challenge that assumption with a concrete experiment.
I took a research paper I don't fully understand, gave it to Claude as a specification, and asked it to build a working application. The result is a WebGPU-accelerated image compressor running in the browser — implementing mathematics that's beyond my ability to verify line by line. And yet, I'm confident the code works correctly. Here's why, and what this means for the future of software development.
The Experiment
The paper is "Switching Games for Image Compression" by Marko Huhtanen (IEEE Signal Processing Letters, 2025). It describes a compression technique using diagonal matrix scaling, alternating least-squares optimization, and 2D DCT transforms. The mathematical framework involves double orthogonality conditions, Cholesky decomposition of Gram matrices, and Stockham auto-sort FFT implementations.
I can follow the paper's intuition: you start with a rough approximation of an image using DCT coefficients, then iteratively refine it by finding optimal "switches" (diagonal scaling matrices) from both sides. But the precise mathematics of the ALS solver, the FFT-based IDCT normalization factors, and the GPU shader implementations? Those are beyond what I can confidently review.
The code Claude produced spans ~4,000 lines across 7 files:
This includes WebGPU compute shaders in WGSL (a language I've never written), Stockham FFT with mixed radix-2/radix-3 stages, a Cholesky batch solver running on GPU, histogram-based top-k selection, and a complete binary file format encoder/decoder. Not exactly code you can skim and approve.
The Verification Problem
Traditional code review assumes the reviewer can evaluate correctness by reading. But several things make that impractical here:
- Domain expertise gap — The mathematics (Makhoul 1980 IDCT normalization, Cholesky decomposition stability in f32) requires signal processing knowledge I don't have.
- Language unfamiliarity — WGSL compute shaders have unique semantics (workgroup barriers, storage buffer access patterns) that aren't obvious from reading.
- System interactions — WebGPU buffer alignment, maxStorageBufferBindingSize limits, and browser module caching created bugs that only manifest at runtime, not in code review.
- Scale — 4,000 lines of interconnected GPU code is a lot to hold in your head, especially when one normalization constant being wrong breaks everything silently.
How I Built Confidence Anyway
If you can't review the code, how do you trust it? I used four complementary strategies:
1. Feature as Proof
The strongest evidence is that the application works. You can load an 8192×6144 image, compress it to 38% of its original size, decompress it in 1 second on the GPU, and visually compare it with the original. The PSNR numbers match or exceed the Python reference implementation. The files round-trip correctly between browser and Python.
This isn't a toy demo — the compression involves real linear algebra running on the GPU, and producing correct results is strong evidence that the underlying mathematics is implemented correctly. A wrong normalization factor doesn't produce a slightly degraded image; it produces garbage (as we discovered three times during development).
2. AI-Assisted Security Scanning
I asked Claude to perform an intensive security audit of the SWG3 binary parser. It found 7 vulnerabilities:
| Severity | Finding | Status |
|---|---|---|
| Critical | Zlib decompression bomb (no payload size limit) | Fixed |
| Critical | Out-of-bounds GPU write via unchecked scatter indices | Fixed |
| High | Integer overflow in n×k pixel count (65536²) | Fixed |
| High | Payload offset reads past end without bounds checking | Fixed |
| Medium | compressedSize not validated against file size | Fixed |
| Medium | NaN/Infinity in float16 diagonals propagate to GPU | Fixed |
| Low | TypedArray alignment assumption in delta decoder | Documented |
3. Cross-AI Review (GitHub Copilot)
I ran GitHub Copilot's code review on the repository. It found an additional
XSS vulnerability that Claude's audit missed — the info panel was injecting
user-supplied filenames via innerHTML. A file named
<img src=x onerror=alert(1)>.swg would execute arbitrary JavaScript.
Copilot also found encoding inconsistencies between the JavaScript and Python implementations, dead code, and missing type annotations. Using multiple AI reviewers catches different classes of issues — much like having multiple human reviewers with different expertise.
4. Automated Testing with Browser Automation
We used Playwright to run automated end-to-end tests:
- DCT ↔ IDCT round-trip (max error < 0.0001)
- Compress → download → re-decode pixel identity (maxDiff = 0)
- Browser ↔ Python round-trip compatibility (49 dB through f16 quantization)
- WebGPU validation error monitoring (0 errors in final build)
- Large image decode performance (8192×6144 in ~1 second)
What Broke Along the Way
The bugs we found are instructive about the kinds of errors AI-generated code contains:
IDCT Normalization (Makhoul 1980)
The FFT-based inverse DCT pre-twiddle was multiplying by N instead of dividing by N, and missing a factor-of-2 correction for AC terms. This is a subtle mathematical error — the kind that passes a cursory code review because the formula looks reasonable. The IDCT produced values 1000x too large, but the ALS solver compensated by producing correspondingly tiny diagonal matrices, masking the bug until we measured PSNR against the reference implementation.
Cholesky Regularization
The batch Cholesky solver used fixed regularization of 1e-6, which was insufficient for Gram matrices with diagonal values in the tens of thousands (from squaring pixel values in range 0-255). After 2-3 ALS iterations, the solver produced NaN values that propagated through everything. The fix was to scale regularization proportionally to the diagonal magnitude.
WebGPU Buffer Limits
The FFT needed 384MB complex buffers for 8192×6144 images, but
maxStorageBufferBindingSize is typically 128MB (distinct from
maxBufferSize of 256MB). This required implementing row-batched FFT processing
with WebGPU bind group buffer offsets — a non-obvious solution that wouldn't be caught
by code review alone.
The Uncomfortable Truth
Here's what I think this experiment demonstrates:
Code review is not the only, or even the best, way to verify correctness. For mathematically complex code, behavioral testing (does it produce the right output?), cross-validation (does it match a reference implementation?), and adversarial analysis (what happens with malicious input?) are often more effective than reading source code.
This has always been true — few developers can verify a TLS implementation by reading it — but AI makes it more visible because the volume and complexity of AI-generated code can exceed any individual's review capacity.
The practical approach isn't "never trust AI code" or "always review every line." It's to build a verification pipeline that combines:
- Feature-level proof — does it actually work?
- Reference comparison — does it match known-good implementations?
- Multi-AI review — different models catch different bugs
- Security scanning — dedicated adversarial analysis of trust boundaries
- Eventually, human expert review — targeted at the critical paths, informed by the automated findings
The key shift is from "I reviewed the code and it looks right" to "I tested the behavior, scanned for vulnerabilities, cross-validated with references, and had multiple AI systems review it." The second approach is arguably more rigorous than traditional code review, even if no human read every line.
Try It Yourself
The live demo runs entirely in your browser using WebGPU. Load the demo image (8192×6144, compressed to 38% of the original JPEG), or compress your own photos using the Compress tab. The source code is on GitHub, including the original paper, the Python reference implementation, and all 12 task files documenting what was built and verified.
If you're a signal processing expert and want to review the mathematical correctness of the GPU shaders, I'd love to hear from you. That's the one verification step we haven't done yet — and it would complete the pipeline.