Fault Detection — Innovations, Mahalanobis Distance, and Sub-Filter Isolation#

This demo extends the 5-state tight-coupled EKF from Block 7 with GPS fault injection and three layers of chi-squared innovation monitoring. The scenario is the same canonical 2D GPS setup, but now one satellite gets a steadily-growing pseudorange bias that the filter has no way to know about. The job is to detect the fault from the symptom — the filter’s own innovation residuals — and isolate which satellite is responsible.

The demo is built around three coupled monitoring views, each addressing one stage of the detection-and-isolation pipeline:

  1. Single \(D^2_k\) — the Mahalanobis distance per satellite per measurement, against a 1-DOF chi-squared threshold \(\gamma_1\). Catches large faults instantly but struggles with slow ramps.

  2. Running \(M\)-window \(\sum D^2_k\) — accumulates evidence across \(M\) samples for a more powerful slow-ramp test, against a \(\chi^2_M\) threshold \(\gamma_M\).

  3. Sub-filter isolation matrix\(N \times N\) grid of running-window tests across \(N\) sub-filters, each excluding one satellite. Identifies which satellite is faulty by exclusion.

The headline pedagogical contrast is that the per-satellite tests in views ① and ② can detect a fault but cannot tell which sat is the source, because once the fault corrupts the filter state, every satellite’s \(h(\hat{\mathbf{x}})\) is wrong and every innovation grows. The sub-filter matrix in view ③ is the only way to isolate cleanly.

The detection statistic#

For each scalar pseudorange update at satellite \(i\), the EKF produces an innovation \(\nu_i\) and an innovation variance \(S_i = \mathbf{H}_i \mathbf{P}^- \mathbf{H}_i^\top + \sigma_\rho^2\). The Mahalanobis distance is

\[ D^2_i = \frac{\nu_i^2}{S_i}. \]

Under the null hypothesis (no fault), \(D^2_i \sim \chi^2(1)\) with \(\gamma_{1,\,0.95} \approx 3.84\). Under a ramp fault \(b_f(t) = \dot{b}_f \cdot (t - t_0)\) on satellite \(i\), the innovation picks up a growing bias \(\nu_i \approx v_i + b_f(t)\) while \(S_i\) stays unchanged (the filter has no idea anything is wrong), so \(D^2_i\) climbs.

The single-sample test \(D^2_i > \gamma_1\) is sensitive to large step faults but slow on ramps — at low \(\dot{b}_f\), the noise term in \(\nu_i\) dominates for many samples before the bias lifts the test statistic over the threshold. Aggregating \(M\) samples in a sliding window pools the evidence:

\[ \Lambda_M(t_k) = \sum_{j = k-M+1}^{k} D^2_i(t_j) \ \sim\ \chi^2(M) \]

with \(\gamma_{M,\,0.95}\) from the \(M\)-DOF chi-squared distribution. The threshold scales with \(M\), but so does the accumulated bias contribution under a ramp — so the windowed test is much more sensitive to slow faults.

Why the per-satellite tests are not enough#

When a fault corrupts the main filter’s state, the next prediction \(\hat{\mathbf{x}}^-\) is wrong. The predicted measurement \(h_j(\hat{\mathbf{x}}^-)\) is therefore wrong for every satellite, not just the faulted one. Innovations grow on healthy satellites too, and the per-sat windowed test starts firing on multiple sats. Detection works, but isolation fails.

The fix: run \(N\) sub-filters in parallel, each excluding one satellite from its updates. Sub-filter \(f\) never sees satellite \(f\)’s pseudorange. If satellite \(f\) is the faulted one, sub-filter \(f\)’s state stays clean — and so do all of its innovations. Every other sub-filter still ingests the bad sat’s measurement and gets corrupted. The clean column in an \(N \times N\) matrix of running \(D^2\) sums identifies the culprit.

Interactive demo#

Open in full screen

The default scenario injects a ramp fault on SAT 3 at \(t = 60\) s with \(\dot{b}_f = 3\) m/s on top of a 5-satellite, evenly-spread constellation with \(\sigma_\rho = 5\) m and \(M = 30\) s. The trajectory panel and three monitoring views play forward synchronously. The faulted satellite is set in the controls (you can change it), but none of the three monitoring views reveals the answer through coloring or labels — students should figure it out by reading the chi-squared structure.

Walkthrough#

  1. Watch the pre-fault behavior. For \(t < 60\) s, all five sub-filter paths cluster around the main filter and truth. Single \(D^2_k\) panels occasionally spike above \(\gamma_1 = 3.84\) (the 5% false-alarm rate); the running-window \(\sum D^2_k\) panels stay well below \(\gamma_M\). The isolation matrix is uniformly clean.

  2. The fault injects at \(t_0 = 60\) s. The trajectory header switches from pre-fault to FAULT ACTIVE · bias = X.X m, with X growing linearly. The main filter starts drifting away from truth.

  3. Open Tab ① · Single \(D^2_k\). SAT 3’s panel shows the faulted satellite climbing — but slowly, and the threshold \(\gamma_1 = 3.84\) is already crossed by the natural noise occasionally. The single-sample test struggles on this ramp until the bias is several sigma over the noise floor.

  4. Switch to Tab ② · Running window \(\sum D^2_k\). Same five panels, now showing the 30-sample accumulated chi-squared. SAT 3 climbs cleanly above \(\gamma_M \approx 43.8\) within roughly 30–60 s of fault start. Detection wins. But notice the other satellites also start climbing past the threshold a bit later — once the main filter is corrupted, every sat’s innovations are wrong. Detection works, isolation fails.

  5. Switch to Tab ③ · Sub-filter isolation matrix. \(5 \times 5\) grid: row = satellite being monitored, column = sub-filter that excluded sat \(f\). The threshold here is the stricter \(\gamma_\text{iso} = \chi^2_{M,\,0.999}\) to keep the false-alarm rate low when comparing \(N\) columns simultaneously. Most cells climb after \(t_0\). Find the column where every cell stays below the threshold — that’s the sub-filter that excluded the bad sat. The faulted satellite is the one that this clean column excludes.

  6. Verify your answer. Reload the trajectory tab. Toggle “show sub-filter tracks”. One of the colored sub-filter paths stays close to truth while the others drift with the main filter; that path’s color matches the satellite you identified as faulted. (Toggle “center trajectory on truth” if you’d rather watch the main filter drift away from a fixed truth, or leave it on the main filter to watch truth drift to the edge of the window.)

  7. Slide the ramp rate. At \(\dot{b}_f = 10\) m/s the single-sample test catches it almost immediately — fast faults are easy. At \(\dot{b}_f = 0.5\) m/s the single-sample test struggles for the entire run; only the windowed test reliably detects it. The sub-filter matrix still isolates correctly in both cases.

  8. Slide the window \(M\). \(M = 5\) s is too short — the windowed threshold is too low, the test gets noisy. \(M = 60\) s is slow to react and washes out the start-of-fault transient. There is no universally optimal \(M\); this is the operational tuning trade.

  9. Change the faulted satellite to a different SV. The clean column in the isolation matrix follows it. The trajectory’s drifting sub-filters reshuffle accordingly.

Key observations#

  • Single-sample \(D^2\) catches step faults; the windowed \(\sum D^2\) catches ramps. The former trades slow-fault sensitivity for instant response on big jumps; the latter trades instant response for robustness on slow drifts. Real systems run both in parallel and OR the alarms.

  • The filter’s covariance becomes inconsistent under an undetected fault. Watch the 95% main-filter covariance ellipse on the trajectory plot stay small while the actual position error blows up. The filter is “confidently wrong” — its \(\mathbf{P}\) is no longer a valid uncertainty estimate. This is the essence of HMI from the Block 8 reading.

  • Per-sat innovations are corrupted across the board. Once the fault gets into the main filter, every satellite’s predicted pseudorange \(h_j(\hat{\mathbf{x}})\) is biased. The single and windowed tests on healthy sats start firing too. There is no way to tell from the main filter alone which channel is the actual source.

  • Sub-filter exclusion is the cleanest isolation strategy. Each sub-filter’s state stays uncorrupted only when it has excluded the faulty satellite. The matrix layout makes the answer visually obvious: scan the columns, find the one that stays uniformly low, read off the excluded satellite.

  • The thresholds are tunable knobs against false-alarm rate. \(\gamma_1\) at 5% gives one false alarm every ~20 samples (per sat); \(\gamma_\text{iso}\) at 0.1% gives essentially none. The whole game in fault-detection design is choosing where to sit on the false-alarm-vs-time-to-detect curve.

From this demo to the F-47 capstone#

The Block 8 reading frames integrity through Hazardous Misleading Information (HMI): the true error grows past the protection level while the fault is still undetected. This demo gives you the chi-squared machinery that powers HMI prevention in practice — the per-satellite test for fast detection, the windowed test for slow ramps, and sub-filter isolation for naming the offender so the system can exclude it and continue with the remaining sensors. The F-47 ANS capstone in Block 10 will require you to measure the time-to-detect \(T_D\) and the HMI exposure window for a 30-event spoof dataset; the same three-layer architecture above is what gets you those numbers cleanly.

Source#

MATLAB · code/FaultDetectionDemo.m

With the 2D covariance ellipse helper in code/navutils/Draw2DErrorBounds.m. The MATLAB version produces five static figures (per-state errors, single \(D^2\), windowed \(\sum D^2\), sub-filter matrix, animated trajectory); the demo above unifies all of them into a single live-playback view with synchronized scrubber.