# Fault Detection — Innovations, Mahalanobis Distance, and Sub-Filter Isolation This demo extends the 5-state tight-coupled EKF from Block 7 with **GPS fault injection** and three layers of **chi-squared innovation monitoring**. The scenario is the same canonical 2D GPS setup, but now one satellite gets a steadily-growing pseudorange bias that the filter has no way to know about. The job is to detect the fault from the symptom — the filter's own innovation residuals — and isolate which satellite is responsible. The demo is built around three coupled monitoring views, each addressing one stage of the detection-and-isolation pipeline: 1. **Single $D^2_k$** — the Mahalanobis distance per satellite per measurement, against a 1-DOF chi-squared threshold $\gamma_1$. Catches large faults instantly but struggles with slow ramps. 2. **Running $M$-window $\sum D^2_k$** — accumulates evidence across $M$ samples for a more powerful slow-ramp test, against a $\chi^2_M$ threshold $\gamma_M$. 3. **Sub-filter isolation matrix** — $N \times N$ grid of running-window tests across $N$ sub-filters, each excluding one satellite. Identifies *which* satellite is faulty by exclusion. The headline pedagogical contrast is that the per-satellite tests in views ① and ② can detect a fault but **cannot tell which sat is the source**, because once the fault corrupts the filter state, every satellite's $h(\hat{\mathbf{x}})$ is wrong and *every* innovation grows. The sub-filter matrix in view ③ is the only way to isolate cleanly. ## The detection statistic For each scalar pseudorange update at satellite $i$, the EKF produces an innovation $\nu_i$ and an innovation variance $S_i = \mathbf{H}_i \mathbf{P}^- \mathbf{H}_i^\top + \sigma_\rho^2$. The Mahalanobis distance is $$ D^2_i = \frac{\nu_i^2}{S_i}. $$ Under the null hypothesis (no fault), $D^2_i \sim \chi^2(1)$ with $\gamma_{1,\,0.95} \approx 3.84$. Under a ramp fault $b_f(t) = \dot{b}_f \cdot (t - t_0)$ on satellite $i$, the innovation picks up a growing bias $\nu_i \approx v_i + b_f(t)$ while $S_i$ stays unchanged (the filter has no idea anything is wrong), so $D^2_i$ climbs. The single-sample test $D^2_i > \gamma_1$ is sensitive to large step faults but slow on ramps — at low $\dot{b}_f$, the noise term in $\nu_i$ dominates for many samples before the bias lifts the test statistic over the threshold. Aggregating $M$ samples in a sliding window pools the evidence: $$ \Lambda_M(t_k) = \sum_{j = k-M+1}^{k} D^2_i(t_j) \ \sim\ \chi^2(M) $$ with $\gamma_{M,\,0.95}$ from the $M$-DOF chi-squared distribution. The threshold scales with $M$, but so does the accumulated bias contribution under a ramp — so the windowed test is much more sensitive to slow faults. ## Why the per-satellite tests are not enough When a fault corrupts the *main filter's* state, the next prediction $\hat{\mathbf{x}}^-$ is wrong. The predicted measurement $h_j(\hat{\mathbf{x}}^-)$ is therefore wrong for **every** satellite, not just the faulted one. Innovations grow on healthy satellites too, and the per-sat windowed test starts firing on multiple sats. Detection works, but isolation fails. The fix: run $N$ **sub-filters** in parallel, each excluding one satellite from its updates. Sub-filter $f$ never sees satellite $f$'s pseudorange. If satellite $f$ is the faulted one, sub-filter $f$'s state stays clean — and so do all of its innovations. Every other sub-filter still ingests the bad sat's measurement and gets corrupted. The clean column in an $N \times N$ matrix of running $D^2$ sums identifies the culprit. ## Interactive demo Open in full screen

The default scenario injects a ramp fault on **SAT 3** at $t = 60$ s with $\dot{b}_f = 3$ m/s on top of a 5-satellite, evenly-spread constellation with $\sigma_\rho = 5$ m and $M = 30$ s. The trajectory panel and three monitoring views play forward synchronously. The faulted satellite is set in the controls (you can change it), but **none of the three monitoring views reveals the answer through coloring or labels** — students should figure it out by reading the chi-squared structure. ## Walkthrough 1. **Watch the pre-fault behavior.** For $t < 60$ s, all five sub-filter paths cluster around the main filter and truth. Single $D^2_k$ panels occasionally spike above $\gamma_1 = 3.84$ (the 5% false-alarm rate); the running-window $\sum D^2_k$ panels stay well below $\gamma_M$. The isolation matrix is uniformly clean. 2. **The fault injects at $t_0 = 60$ s.** The trajectory header switches from `pre-fault` to `FAULT ACTIVE · bias = X.X m`, with X growing linearly. The main filter starts drifting away from truth. 3. **Open Tab ① · Single $D^2_k$.** SAT 3's panel shows the faulted satellite climbing — but slowly, and the threshold $\gamma_1 = 3.84$ is already crossed by the natural noise occasionally. The single-sample test struggles on this ramp until the bias is several sigma over the noise floor. 4. **Switch to Tab ② · Running window $\sum D^2_k$.** Same five panels, now showing the 30-sample accumulated chi-squared. SAT 3 climbs cleanly above $\gamma_M \approx 43.8$ within roughly 30–60 s of fault start. **Detection wins.** But notice the *other* satellites also start climbing past the threshold a bit later — once the main filter is corrupted, every sat's innovations are wrong. **Detection works, isolation fails.** 5. **Switch to Tab ③ · Sub-filter isolation matrix.** $5 \times 5$ grid: row = satellite being monitored, column = sub-filter that excluded sat $f$. The threshold here is the stricter $\gamma_\text{iso} = \chi^2_{M,\,0.999}$ to keep the false-alarm rate low when comparing $N$ columns simultaneously. Most cells climb after $t_0$. **Find the column where every cell stays below the threshold** — that's the sub-filter that excluded the bad sat. The faulted satellite is the one that this clean column excludes. 6. **Verify your answer.** Reload the trajectory tab. Toggle "show sub-filter tracks". One of the colored sub-filter paths stays close to truth while the others drift with the main filter; that path's color matches the satellite you identified as faulted. (Toggle "center trajectory on truth" if you'd rather watch the main filter drift away from a fixed truth, or leave it on the main filter to watch truth drift to the edge of the window.) 7. **Slide the ramp rate.** At $\dot{b}_f = 10$ m/s the single-sample test catches it almost immediately — fast faults are easy. At $\dot{b}_f = 0.5$ m/s the single-sample test struggles for the entire run; only the windowed test reliably detects it. The sub-filter matrix still isolates correctly in both cases. 8. **Slide the window $M$.** $M = 5$ s is too short — the windowed threshold is too low, the test gets noisy. $M = 60$ s is slow to react and washes out the start-of-fault transient. There is no universally optimal $M$; this is the operational tuning trade. 9. **Change the faulted satellite to a different SV.** The clean column in the isolation matrix follows it. The trajectory's drifting sub-filters reshuffle accordingly. ## Key observations - **Single-sample $D^2$ catches step faults; the windowed $\sum D^2$ catches ramps.** The former trades slow-fault sensitivity for instant response on big jumps; the latter trades instant response for robustness on slow drifts. Real systems run both in parallel and OR the alarms. - **The filter's covariance becomes inconsistent under an undetected fault.** Watch the 95% main-filter covariance ellipse on the trajectory plot stay small while the actual position error blows up. The filter is "confidently wrong" — its $\mathbf{P}$ is no longer a valid uncertainty estimate. This is the essence of HMI from the Block 8 reading. - **Per-sat innovations are corrupted across the board.** Once the fault gets into the main filter, every satellite's predicted pseudorange $h_j(\hat{\mathbf{x}})$ is biased. The single and windowed tests on healthy sats start firing too. There is no way to tell from the main filter alone which channel is the actual source. - **Sub-filter exclusion is the cleanest isolation strategy.** Each sub-filter's state stays uncorrupted *only* when it has excluded the faulty satellite. The matrix layout makes the answer visually obvious: scan the columns, find the one that stays uniformly low, read off the excluded satellite. - **The thresholds are tunable knobs against false-alarm rate.** $\gamma_1$ at 5% gives one false alarm every ~20 samples (per sat); $\gamma_\text{iso}$ at 0.1% gives essentially none. The whole game in fault-detection design is choosing where to sit on the false-alarm-vs-time-to-detect curve. ## From this demo to the F-47 capstone The Block 8 reading frames integrity through Hazardous Misleading Information (HMI): the true error grows past the protection level while the fault is still undetected. This demo gives you the chi-squared machinery that powers HMI prevention in practice — the per-satellite test for fast detection, the windowed test for slow ramps, and sub-filter isolation for naming the offender so the system can exclude it and continue with the remaining sensors. The F-47 ANS capstone in Block 10 will require you to measure the time-to-detect $T_D$ and the HMI exposure window for a 30-event spoof dataset; the same three-layer architecture above is what gets you those numbers cleanly. ## Source MATLAB · code/FaultDetectionDemo.m↓ With the 2D covariance ellipse helper in `code/navutils/Draw2DErrorBounds.m`. The MATLAB version produces five static figures (per-state errors, single $D^2$, windowed $\sum D^2$, sub-filter matrix, animated trajectory); the demo above unifies all of them into a single live-playback view with synchronized scrubber.