Reliability Evaluation

Reliability Evaluation of PepperChat’s Progress Note Builder

1. Introduction

This study evaluates how reliably PepperChat’s Progress Note Builder (PNB) performs three core tasks that support therapist documentation:

  • Rewriting freehand notes into structured previews and suggestions
  • Expanding minimal notes into clinically appropriate suggestions
  • Detecting structured clinical categories from short narrative text

Because these features sit close to the clinical record, we focused on accuracy, consistency, and the types of errors that occur when the system is run repeatedly with the same inputs.

2. Global Reliability Summary

Across all three test suites, we ran 60 total trials (20 per suite). Each trial was graded pass/fail using predefined rubrics.

  • 57/60 tests passed → 95.0% overall accuracy
  • All 3 failures (5%) came from a single suite (PNB-11, category detection)
  • All failures were due to omissions, not hallucinations or contradictions

In other words, when the system failed, it did so by leaving out a required detail rather than adding incorrect or invented clinical information. No trial in any suite produced a hallucinated diagnosis, risk, quote, intervention, or other fabricated clinical content.

This error profile is important clinically: omission errors can be identified and corrected by the therapist (or by additional validation checks), whereas hallucinated content can quietly corrupt the record. The observed pattern (high accuracy with omission-only failures) indicates conservative and safety-oriented behavior.

3. Methods

3.1 Test Suites

We evaluated three production endpoints:

  • PNB-1: Freehand Notes → Preview
    Rewrites a freehand progress note into a structured “preview” plus suggested documentation lines (added to the note immediately).

  • PNB-2: Minimal Notes → Suggestions
    Takes a short or partial note and generates clinically appropriate suggestions to help complete the documentation (only added to the note after prompted to).

  • PNB-11: Category Detection
    Extracts structured fields (e.g., symptoms, interventions, homework, topics) from a short paragraph of narrative text.

Each suite contained 20 trials. For each suite:

  • We selected two input examples (e.g., two different notes).
  • For each input, we ran the same request 10 times against the live production endpoint.
  • Inputs were held constant within a suite so that any differences in output reflected variability in the AI’s behavior, not changes in the prompt.

3.2 Execution and Scoring

All tests were run against PepperChat’s actual production API. Full JSON responses were logged and then graded manually using binary scoring (1 = pass, 0 = fail) according to predefined rubrics.

For PNB-1 (Freehand Notes), each trial was scored on:

  • Content preservation
  • No contradictions
  • No hallucinations
  • Clinical tone
  • Tense consistency
  • Formatting

For PNB-2 (Minimal Notes → Suggestions), each trial was scored on:

  • Alignment with input
  • No contradictions
  • No hallucinations
  • Clinical appropriateness
  • Context awareness

For PNB-11 (Category Detection), each trial was scored on:

  • All required categories present
  • No hallucinations
  • No omissions

A trial was marked pass only if it met all rubric criteria for that suite. We calculated accuracy for each suite and dimension, along with 95% confidence intervals (CIs) for the suite-level accuracy rates.

Full prompts, outputs, and grading tables are provided in Appendix A.

4. Results

4.1 Global and Suite-Level Performance

Global performance across all suites:

  • 60 total trials
  • 57 passes, 3 failures
  • Overall accuracy: 95.0%
  • All 3 failures were omission-only errors in PNB-11

Suite-level results:

Test SuiteTotal TestsPassed TestsFailed TestsAccuracy %CI Lower (95%)
PNB-1 (Freehand Notes)20200100.00%100.00%
PNB-2 (Suggestions)20200100.00%100.00%
PNB-11 (Category Detection)2017385.00%69.35%

Figure 1. Passed vs. failed tests by suite (20 trials per suite). PNB‑1 and PNB‑2 show 20/20 passes; PNB‑11 shows 17/20 passes with 3 omission-only failures.

PNB-1 and PNB-2 achieved perfect reliability under repeated trials, with no variability across runs. PNB-11 also performed strongly overall but showed some inconsistency driven entirely by missing content in one field (e.g., an empty “topics” or “interventions” entry when it should have been populated).

4.2 Dimension-Level Performance

All dimensions of PNB-1 and PNB-2 scored 100% accuracy, with confidence intervals of [1.00, 1.00].

For PNB-11:

  • all_categories_present: 100%
  • no_hallucinations: 100%
  • no_omissions: 85% (CI: 0.6935–1.00)

The critical finding is that all PNB-11 failures were omissions in the “no omissions” dimension, meaning that a category that should have been detected was not.

5. Discussion

5.1 Strengths

The model is highly reliable when handling narrative text (PNB-1 and PNB-2). In every test:

  • It preserved the meaning of the input
  • It avoided contradictions and hallucinations
  • It produced clinically appropriate suggestions
  • It maintained consistent tone, formatting, and tense

The confidence intervals at 1.00 indicate that the system behaved consistently across repeated trials.

5.2 Category Detection

The category-detection task (PNB-11) performed well overall but showed the only meaningful source of variation. The system:

  • Always produced the right set of categories
  • Never invented clinical information
  • Sometimes failed to include all information from the input

In other words, the model is safe and conservative but occasionally incomplete. This aligns with typical model behavior in structured extraction tasks, where omission is favored over hallucination.

5.3 Reliability Pattern

Across all suites, failures were:

  • Predictable
  • Limited to missing details
  • Never related to clinical inappropriateness or fabrication

This reliability profile is a good foundation for production use, particularly in areas where completeness can be improved through either model refinement or post-processing checks.

6. Limitations

These tests:

  • Covered only three PNB endpoints
  • Used inputs that were intentionally controlled and do not represent the full variety of therapist writing
  • Used binary scoring that may classify acceptable variation as failure
  • Focused on structured extraction tasks that are inherently more sensitive to wording and nuance

7. Future Work

There are several additional areas where PepperChat will conduct reliability testing:

  • Remaining Progress Note Builder endpoints
  • Treatment Plan Builder (goals, objectives, interventions)
  • Clinical Summaries
  • Chat-based therapist support features
  • Intake assessments

These evaluations will help ensure consistent behavior across the full documentation pipeline and will include additional statistical measures and stress-testing scenarios.

8. Conclusion

Under controlled testing across 60 trials, PepperChat’s Progress Note Builder demonstrated:

  • 95.0% overall accuracy across three test suites
  • 100% accuracy for narrative rewriting and suggestion generation (PNB-1, PNB-2)
  • 85% accuracy for category detection (PNB-11), with all errors due to omissions
  • 0% hallucination rate across all tests

In practice, this means the system is:

  • Accurate and stable for rewriting freehand notes and expanding minimal notes
  • Clinically conservative, favoring omission over invention in structured extraction tasks
  • Well-suited to therapist-in-the-loop workflows, where clinicians review and finalize all output

Overall, the system is accurate, stable, and clinically appropriate. Continued testing across additional components of the platform will further strengthen confidence in PepperChat’s AI-assisted documentation tools.


Appendix A.1 – Endpoint Prompts and Responses

The appendix below includes example prompts, responses, and grading summaries for reference. It is not intended as an exhaustive listing of all 60 trials but is representative of the test methodology used.