Reliability Evaluation of PepperChat’s Progress Note Builder
This study evaluates how reliably PepperChat’s Progress Note Builder (PNB) performs three core tasks that support therapist documentation:
Because these features sit close to the clinical record, we focused on accuracy, consistency, and the types of errors that occur when the system is run repeatedly with the same inputs.
Across all three test suites, we ran 60 total trials (20 per suite). Each trial was graded pass/fail using predefined rubrics.
In other words, when the system failed, it did so by leaving out a required detail rather than adding incorrect or invented clinical information. No trial in any suite produced a hallucinated diagnosis, risk, quote, intervention, or other fabricated clinical content.
This error profile is important clinically: omission errors can be identified and corrected by the therapist (or by additional validation checks), whereas hallucinated content can quietly corrupt the record. The observed pattern (high accuracy with omission-only failures) indicates conservative and safety-oriented behavior.
We evaluated three production endpoints:
PNB-1: Freehand Notes → Preview
Rewrites a freehand progress note into a structured “preview” plus suggested documentation lines (added to the note immediately).
PNB-2: Minimal Notes → Suggestions
Takes a short or partial note and generates clinically appropriate suggestions to help complete the documentation (only added to the note after prompted to).
PNB-11: Category Detection
Extracts structured fields (e.g., symptoms, interventions, homework, topics) from a short paragraph of narrative text.
Each suite contained 20 trials. For each suite:
All tests were run against PepperChat’s actual production API. Full JSON responses were logged and then graded manually using binary scoring (1 = pass, 0 = fail) according to predefined rubrics.
For PNB-1 (Freehand Notes), each trial was scored on:
For PNB-2 (Minimal Notes → Suggestions), each trial was scored on:
For PNB-11 (Category Detection), each trial was scored on:
A trial was marked pass only if it met all rubric criteria for that suite. We calculated accuracy for each suite and dimension, along with 95% confidence intervals (CIs) for the suite-level accuracy rates.
Full prompts, outputs, and grading tables are provided in Appendix A.
Global performance across all suites:
Suite-level results:
| Test Suite | Total Tests | Passed Tests | Failed Tests | Accuracy % | CI Lower (95%) |
|---|---|---|---|---|---|
| PNB-1 (Freehand Notes) | 20 | 20 | 0 | 100.00% | 100.00% |
| PNB-2 (Suggestions) | 20 | 20 | 0 | 100.00% | 100.00% |
| PNB-11 (Category Detection) | 20 | 17 | 3 | 85.00% | 69.35% |
Figure 1. Passed vs. failed tests by suite (20 trials per suite). PNB‑1 and PNB‑2 show 20/20 passes; PNB‑11 shows 17/20 passes with 3 omission-only failures.
PNB-1 and PNB-2 achieved perfect reliability under repeated trials, with no variability across runs. PNB-11 also performed strongly overall but showed some inconsistency driven entirely by missing content in one field (e.g., an empty “topics” or “interventions” entry when it should have been populated).
All dimensions of PNB-1 and PNB-2 scored 100% accuracy, with confidence intervals of [1.00, 1.00].
For PNB-11:
all_categories_present: 100%no_hallucinations: 100%no_omissions: 85% (CI: 0.6935–1.00)The critical finding is that all PNB-11 failures were omissions in the “no omissions” dimension, meaning that a category that should have been detected was not.
The model is highly reliable when handling narrative text (PNB-1 and PNB-2). In every test:
The confidence intervals at 1.00 indicate that the system behaved consistently across repeated trials.
The category-detection task (PNB-11) performed well overall but showed the only meaningful source of variation. The system:
In other words, the model is safe and conservative but occasionally incomplete. This aligns with typical model behavior in structured extraction tasks, where omission is favored over hallucination.
Across all suites, failures were:
This reliability profile is a good foundation for production use, particularly in areas where completeness can be improved through either model refinement or post-processing checks.
These tests:
There are several additional areas where PepperChat will conduct reliability testing:
These evaluations will help ensure consistent behavior across the full documentation pipeline and will include additional statistical measures and stress-testing scenarios.
Under controlled testing across 60 trials, PepperChat’s Progress Note Builder demonstrated:
In practice, this means the system is:
Overall, the system is accurate, stable, and clinically appropriate. Continued testing across additional components of the platform will further strengthen confidence in PepperChat’s AI-assisted documentation tools.
The appendix below includes example prompts, responses, and grading summaries for reference. It is not intended as an exhaustive listing of all 60 trials but is representative of the test methodology used.