Simpson's Paradox on the Shop Floor: Segment Before You Decide | Material Model Blog
Material Model

Simpson's Paradox on the Shop Floor: Segment Before You Decide

Same data, opposite conclusions. The roll-up blames Night; the segments show Night ahead. How Simpson's Paradox leads industrial engineers to make wrong conclusions.

Ilya Valmianski
9 min read
Homer Simpson looks at aggregate vs group charts illustrating Simpson's paradox in manufacturing yields.

It’s Monday, and you stare at a neat set of bars rolling up last week’s performance. The roll‑up says “Night underperformed,” and you’re about to draft some action items and a coaching plan. It feels decisive because the picture is clean, but it’s also wrong.

The floor doesn’t make one kind of unit one kind of way: it makes easy builds and hard builds, it runs on fixtures that behave differently as they warm up, and it rotates operators of different skill levels. This leads to Simpson’s paradox for factories: the aggregate trend flips when you split the data along the dimensions that govern difficulty, equipment, or skill.

Plants that ignore it waste coaching on the wrong crews, escalate with the wrong suppliers, and approve the wrong capital. The fix is simple: segment before you decide.

What Actually Happened Last Week

The line missed the weekly target, overtime is already tight, and your inbox has a slide that rolls up the last 2,000 units by shift with a red box around Night. You have to tell the ops lead whether to coach the crew, add another inspection, or change the sequence. Here’s what the roll‑up shows:

Aggregate (what the dashboard shows)

ShiftUnitsGoodFPY
Day1,00093093.0%
Night1,00081981.9%

If you stop here, you fall for Simpson’s paradox. If you split by build difficulty, a new pattern emerges:

Within segments (what actually happened)

SegmentDay FPYNight FPYDay VolumeNight Volume
Easy builds95.0%99.0%900100
Hard builds75.0%80.0%100900

Within each segment, Night outperformed Day; it’s just that Night did most of the hard mix. If you reweight to a common mix, the illusion goes away:

Mix‑adjusted view (reweighting)

ScenarioFPY
Actual Day93.0%
Actual Night81.9%
Day with Night’s mix77.0%
Night with Day’s mix97.1%

The confounder (here, mix difficulty) influences who gets what work and how that work performs. Looking only at the roll‑ups gives a false impression; you need to segment to see the work as it happened. Looking at the segments can shift the conversation from “why are you worse?” to “what made this segment hard, and how do we balance it?”

Simpson’s Paradox Revealed

Toggle between views. The aggregate says Night underperforms. The segments say Night outperforms in both categories. Same data, opposite conclusions.

❌ Misleading: Night appears worse overall

✓ Truth: Night outperforms Day in BOTH segments

Day Shift
Night Shift

Fixtures Hide Inside Aggregates

Cycle times can have the same trap. Team A looks slow in the roll‑up. But before you intervene, ask a fundamental question: “Did the teams run the same assets under the same conditions?” They didn’t. The busiest asset in the area, Fixture 12, runs a little fast when cold and adds dwell as it warms. The blended average hides this behavior and makes Team A look like the problem because they ran most of the volume on 12.

Aggregate cycle time comparison

TeamCyclesAvg CT (s)
A1,20031.0
B1,10029.5

Cycle time by fixture

FixtureTeam A CT (s)Team B CT (s)Team A VolTeam B Vol
Fixture 12 (warms up, drifts)32.034.0900200
Fixture 7 (stable)28.028.5300900

Within fixtures, the story flips. On both fixtures, Team A is faster. The fix shifts from blaming Team A to introducing cool‑downs after changeovers, pulling the heaviest options off the first hour after lunch, and scheduling a quick maintenance task tied to the drift. The average improves because the asset does, not because the team is told to “own the number.”

Suppliers Aren’t Always the Story Your Roll-Up Tells

Quality escalations are the most expensive place to let the roll‑up drive the narrative. Line B flags a supplier lot (Q17), and the weekly report shows a clean gap against Line A. The next step on the escalation checklist is usually a vendor call. Before you make it, split the rows by who actually ran the work.

Supplier lot FPY comparison

Supplier LotLine A FPYLine B FPY
Lot Q1791.5%96.0%

FPY by operator experience

Operator BandLine A FPYLine B FPYLine A UnitsLine B Units
A‑band (experienced)97.8%97.5%300120
B‑band (new)88.9%95.7%700480

If most of Line A’s units on Q17 were run by newer operators while Line B put A‑band on the same lot, read the within‑band numbers as a staffing signal, not a supplier defect. Drop the escalation. Instead, review the exact inspection step that failed, standardize a quick in‑station check to remove the common misread, and pair B‑band with a mentor on that SKU for a week. The FPY gap should close without a vendor call.

The Discipline That Prevents Bad Decisions

Factories juggle descriptive and causal questions. “What happened?” is descriptive. “Why did it happen and what should we do?” is causal. Simpson’s paradox shows up when you answer the causal question with a descriptive roll‑up.

Here is a simple 6-step guide on avoiding this mistake:

  1. Write the exact decision in one line (e.g., “Coach Night or not?” “Fix Fixture 12 first?” “Escalate lot Q17?”).
  2. List what changes difficulty or assignment (mix, fixture/asset ID, operator band, time of day).
  3. Compare apples to apples: segment by those factors and compare inside each segment.
  4. Put both sides on the same mix and compare again (reweight to a common mix).
  5. If segments and the roll‑up disagree, trust the segments.
  6. Act on the cause you found (training, fixture control, staffing, sequence).

How to Reweight (Step 4)

Variable definitions

VariableDescription
FPYeasy\text{FPY}_{\text{easy}}First-pass yield for easy builds
FPYhard\text{FPY}_{\text{hard}}First-pass yield for hard builds
neasyn_{\text{easy}}Number of easy build units
nhardn_{\text{hard}}Number of hard build units
ntotaln_{\text{total}}Total units (1,000 in our example)

The aggregate yield for any shift is the weighted average across segments:

FPYaggregate=FPYeasy×neasy+FPYhard×nhardntotal\text{FPY}_{\text{aggregate}} = \frac{\text{FPY}_{\text{easy}} \times n_{\text{easy}} + \text{FPY}_{\text{hard}} \times n_{\text{hard}}}{n_{\text{total}}}

To compute the counterfactual “What would Day’s FPY be if it had Night’s mix?”, use Night’s volume weights with Day’s segment performance:

FPYDay, Night’s mix=FPYDay,easy×nNight,easy+FPYDay,hard×nNight,hardntotal\text{FPY}_{\text{Day, Night's mix}} = \frac{\text{FPY}_{\text{Day,easy}} \times n_{\text{Night,easy}} + \text{FPY}_{\text{Day,hard}} \times n_{\text{Night,hard}}}{n_{\text{total}}}

For the example from Table 2:

  • Day on easy: 95%, Night on easy: 99%
  • Day on hard: 75%, Night on hard: 80%
  • Night’s mix: 100 easy, 900 hard (out of 1,000)

Day with Night’s mix = (95×100+75×900)/1000=77.0%(95 \times 100 + 75 \times 900) / 1000 = 77.0\%

Night with Day’s mix = (99×900+80×100)/1000=97.1%(99 \times 900 + 80 \times 100) / 1000 = 97.1\%

This reveals Night outperforms Day at every mix ratio, even though the raw aggregate (81.9% vs 93.0%) says the opposite.

Only after the segment view is clear do you show the aggregate for context. Matching interventions to actual root causes shows up fast: less overtime chasing noise, fewer expedites, and calmer ramps because fixes land on the right fixtures, training, and sequence.

How Material Model Enables Segment‑First Analysis

Segment‑first only works when the raw observations are detailed enough to explain why one slice runs differently from another. Collecting this data is tedious and time‑intensive. This is where Material Model comes in. We help transform your work videos, no matter how they’re recorded, into elemental timelines: worksteps, MODAPTS motions, or machine states like clamp, transfer, and dwell. With those timelines, you can see differences in the work itself: extra reaches on one shift that don’t appear on the other, a longer “align” element on a specific fixture after lunch, a quality check that sometimes lands inside the cycle and sometimes floats outside. Those are the differences that justify segmenting, and they are the levers you can actually pull.

If you want to see the gaps on your own line, record a few clips from a manufacturing station in question, run them through Material Model at app.materialmodel.com, and compare the elemental timelines side by side.

Share:

Stay up to date

Get the latest insights on manufacturing automation and operational excellence

Subscribe for Updates