What Every "Clean Code" Expert Gets Wrong

1 month before PSC (Performance Summary Cycle, Meta's biannual review), I was drowning in desperation.

My manager had been crystal clear during our last one-on-one:

"Your impact numbers look great, but your Better Engineering diffs are concerning. You need to pump up those BE diffs, or we're going to have a problem in calibrations"

For those outside Meta's performance machine, Better Engineering is one of four axes every engineer gets evaluated on every 6 months. It's not about shipping features or moving metrics, it's about leaving the codebase healthier than you found it. Writing tests that actually catch bugs. Refactoring gnarly functions into something readable. Fixing those papercut issues that make everyone's day a little worse. The unglamorous work that doesn't show up in impact, but keeps the entire engineering org from collapsing under its own complexity.

And I'd been completely ignoring it.

For months, I'd been laser-focused on shipping. Product launches, user growth, user sessions impact, all the shiny metrics that look good in promo packets. Meanwhile, my Better Engineering contributions were embarrassingly thin. A few token test additions here, a minor refactor there. Nothing that would impress a calibration committee. Especially not at E6 level.

The math was brutal: I had 3 weeks to demonstrate months' worth of codebase stewardship, or risk getting feedback that would torpedo my trajectory.

That's when the stress hit me like a deployment gone wrong.

The Glucose Spike That Changed Everything

The irony wasn't lost on me… here I was, stressed about code quality metrics, sitting in a doctor's office getting lectured about actual health metrics.

"Your glucose is elevated"

My doctor said, stylus hovering over her tablet with the kind of concerned precision that makes you suddenly very aware of your mortality.

"But here's what's interesting, everything else is within healthy range. In the upper range, but healthy… Cholesterol, blood pressure, liver function, inflammation markers. All textbook normal."

She leaned back, and I could see the pattern recognition kicking in… ironically, the same look engineers get when they're connecting dots across seemingly unrelated symptoms.

"If glucose was the only thing off, I'd tell you to maybe don't have a hearty dinner before your next test. But let me ask you something… have you been stressed lately? Working long hours? Skipping meals?"

The questions hit like a perfectly timed code review comment. Of course I'd been stressed. Three weeks until PSC, scrambling to manufacture months of Better Engineering contributions. Energy drinks for breakfast, frantically refactoring legacy modules until midnight, stress levels that could power Meta's entire data center.

"Here's what I think is happening… Your body is responding to chronic stress. Glucose is just the canary in the coal mine. The other markers aren't elevated yet, but they're trending. If we don't address the underlying patterns now, in 6 months your entire panel will be a mess."

The epiphany hit me harder than a production outage at peak traffic.

I'd been thinking about code quality, about Better Engineering itself, completely backwards.

Sitting in that sterile office, staring at my glucose spike while my doctor talked about pattern recognition and early warning systems, I realized I'd been making the same mistake with code metrics that I was making with my own health: obsessing over individual numbers instead of reading the story they told together.

The Metrics That Fooled 10,000 Engineers

Back at Meta, I pulled up tests internal site. Thousands of tests, all color-coded in neat green-yellow-red charts. Test coverage there, test flakiness there… Each metric living in its own little box, each one telling its own little story.

Just like my glucose reading.

But what if the story wasn't in the individual metrics? What if it was in how they moved together?

I started digging into the modules that had caused the most production fires over the past year. The ones that made engineers groan during code reviews, the ones that spawned those dreaded "can someone help me understand this?"

Module A: Cyclomatic complexity 12 (yellow), but everything else green.
Module B: High fan-out (red), but manageable complexity and decent coverage.
Module C: Low test coverage (red), but simple logic and clean interfaces.

In isolation, none of these looked catastrophic. Our tooling would flag them, sure, but not as high-priority concerns. Reviewers would glance, shrug, and approve.

Then I started looking at the combinations.

The Pattern That Predicts Disasters

Module X had been the source of three separate outages in Q2. When I mapped its metrics, the story became crystal clear:

Cyclomatic complexity: high
Argument count: 6 average (red)
Nesting depth: ~5 levels (red)
Halstead volume: very high (~9000) (red)
Test coverage: 50% (yellow)
Fan-out coupling: 24 (red)

This wasn't just bad code. This was a disaster waiting to happen.

High cognitive load (complexity + nesting + Halstead), brittle interfaces (high arguments + fan-out), insufficient safety net (coverage).

Building the Diagnostic Framework

That night, I started sketching what would become the framework that changed how I thought about code quality.

The insight was borrowed straight from medical diagnostics: cluster analysis.

Instead of treating metrics as independent variables, I started grouping them into diagnostic clusters:

The Cognitive Load Cluster: Cyclomatic complexity + argument count + nesting depth + Halstead volume.
When these spike together, you're looking at code that will break your brain.
The Brittleness Cluster: Fan-out coupling + instability + RFC (Response For Class) + dependency cycles.
When these align, changes ripple unpredictably.
The Safety Net Cluster: Test coverage + mutation score + assertion strength. When these drop together, you're flying blind.
The Architecture Integrity Cluster: LCOM4 cohesion + CBO coupling + file size + naming consistency.
When these degrade together, your design is crumbling.

But here's the crucial part, and this is where most frameworks fail, I created scoring bands, not hard cutoffs.

Instead of "cyclomatic complexity must be under 10" my framework uses contextual ranges:

Green: 1-10 (most code)
Yellow: 11-15 (acceptable for complex business logic)
Red: 16+ (refactor immediately)
Hard stop: 20 (do not merge)

Each module gets a composite score across all clusters, weighted by risk and role.

Domain models can handle higher fan-in (they're supposed to be central).
Feature code can tolerate higher instability (they change fast).
Infrastructure code needs bulletproof coverage (failures cascade).

I built a rule that plugged into our existing team diffs. It would automatically leave comments when certain patterns showed up:

"Module X shows severe cognitive load symptoms (CC: 18, Args: 6, Nesting: 5). Combined with insufficient test coverage (50%), this is a change with high risk."

"Module Y exhibits brittleness markers (Fan-out: 24, Cycles: 3, Instability: 0.89). Architecture review recommended before next major feature."

But the real validation came during my major refactor.

The Proof in Production

After writing the rule for my team, I did what any self-respecting engineer does when they think they've built something revolutionary: I tested it against the disasters I already knew about.

I pulled up 6 months of incident postmortems… every production fire, every cascading failure, every "how did this pass code review?" moment that had made our on-call rotation feel like a particularly cruel form of performance art. Then I traced each incident back to its source diff and ran my diagnostic framework against the offending code.

The results were almost comically vindictive. The pattern was unmistakable. Every diff that had eventually caused us pain would have been flagged by the framework months before it exploded. Not flagged politely, either… screaming, sirens-blaring, "are you absolutely sure you want to deploy this nightmare?" flagged.

But here's the beautiful part: I'd just solved my Better Engineering problem with a single rule.

One cleverly designed lint check, properly configured and integrated into our CI pipeline, was now catching more potential issues than months of manual code reviews and ad-hoc refactoring sessions. The framework didn't just improve our code quality, it automated the very behavior that PSC evaluates engineers on.

I'd accidentally built a Better Engineering multiplier, and suddenly those three weeks of panic before PSC felt like a distant memory from someone else's career.

The Framework That Scales

The core insight is that code quality isn't about perfect individual metrics. It's about healthy patterns across multiple dimensions.

The practical application: Build diagnostic clusters that match your failure modes. Weight them by impact. Score them in bands, not binary pass/fail. Most importantly, trust the combinations more than the components.

The scaling secret: Automate the pattern detection, but keep human judgment in the loop. Metrics identify candidates for attention. Engineers make the final call.

Here's the framework I use (the complete scoring model is in the bonus section below):

Define your clusters based on your actual failure patterns
Set contextual ranges that match your codebase and team maturity
Weight by role and risk, not all modules are created equal
Score in bands to acknowledge uncertainty and context
Track deltas, trend matters more than absolute position

Combine automated detection with human review

The Blood Panel for Code

Six months after implementing this framework, I had another physical. This time, my doctor smiled at the results.

"Everything looks better" she said.

"Glucose is normal, but more importantly, all the patterns are healthy. Whatever changes you made, keep doing them."

The same week, our team's production incidents dropped to near-zero. Code review times decreased by some double-digit percent I do not recall anymore. Most tellingly, engineers stopped avoiding certain modules during feature development.

We'd learned to read the patterns, not just the individual metrics.

Your code has a blood panel too. The question is: are you reading the glucose number, or the full story?

Ready to build your own diagnostic framework?

Next week's installment gets even more tactical: I'm sharing the exact prompt and workflow we use at Torta Studios to ground our AI agents so they generate code that's indistinguishable from senior-level handwritten work. No more reviewing AI-generated spaghetti that technically compiles but reads like it was written by a very confident intern with a StackOverflow addiction.

The diagnostic patterns that saved my Meta career now power how we teach AI agents to write like the engineers we actually want on our teams.

Currently, we have capacity to help teams with product and growth engineering work. If you're building something interesting and could use hands that think in systems rather than just syntax, drop me a note… let's set up time to talk.