Report-Only CI For LLMs With PromptProof

Aug 15, 2025 by Benjamin Cohen 41 views

Implementing Report-Only CI with PromptProof for LLM Applications

Hey everyone! I'm excited to share a way to add a super useful, report-only CI check for your LLM applications using PromptProof. This setup gives reviewers a quick, at-a-glance HTML report, showing whether recorded answers still match the expected shape and adhere to basic guardrails. The best part? It does this without making any live model calls and without blocking merges. Let’s dive into why this is so beneficial and how you can implement it.

Why This Report-Only CI is a Game Changer

In the realm of LLM application development, maintaining the integrity and reliability of your models' outputs is critical. The report-only CI check offered by PromptProof provides several key advantages, making it an invaluable tool for developers. This approach ensures that example responses don't silently drift during refactors, which is a common issue when dealing with complex systems. By implementing this check, you can catch regressions early and maintain the consistency of your model's behavior. Furthermore, it makes drive-by PRs safer by giving reviewers a clear view of schema, PII checks, and cost summaries in a single artifact. This comprehensive overview helps in identifying potential issues more efficiently.

One of the standout features of this system is its deterministic nature. By default, it uses a seed and runs the checks multiple times (e.g., runs=3) to avoid flakes, ensuring reliability. It also doesn't require any secrets, making it easier to set up and maintain. This deterministic approach is crucial for ensuring that your CI checks are consistent and reproducible, which is vital for identifying and addressing regressions effectively. Overall, this report-only CI provides a robust and efficient way to monitor and maintain the quality of your LLM application outputs.

Preventing Silent Drifting of Example Responses

One of the most significant advantages of implementing this report-only CI is its ability to prevent example responses from silently drifting during refactors. In complex LLM applications, changes to the codebase can inadvertently affect the model's behavior, leading to unexpected outputs. By incorporating PromptProof, you can ensure that any deviations from the expected response shape are immediately flagged. This is particularly crucial when dealing with intricate models where subtle changes can have significant impacts. For instance, if a refactor introduces a bug that alters the format of the model's output, PromptProof will detect this discrepancy and alert reviewers.

This proactive approach helps maintain the stability and reliability of your application. By catching regressions early, you can prevent them from propagating into production, where they could cause more significant issues. Additionally, it provides a safety net for developers working on different parts of the codebase, ensuring that their changes don't inadvertently break existing functionality. This makes the development process more predictable and reduces the risk of introducing errors. Therefore, integrating PromptProof into your CI pipeline is a strategic move for ensuring the long-term health and consistency of your LLM applications.

Enhancing the Safety of Drive-By PRs

Another key benefit of using PromptProof's report-only CI is the enhanced safety it brings to drive-by pull requests (PRs). Drive-by PRs, which are often submitted by external contributors or developers less familiar with the codebase, can sometimes introduce unintended issues. By providing reviewers with a comprehensive artifact that includes schema checks, PII checks, and a cost summary, PromptProof significantly reduces the risk associated with these contributions. Reviewers can quickly assess the impact of the changes without needing to run the model live or dig through extensive logs.

This streamlined review process saves time and ensures that potential problems are identified early. For example, if a drive-by PR introduces a change that violates the output schema or exposes sensitive information, the PromptProof report will immediately highlight these issues. This allows reviewers to address these concerns before the changes are merged, preventing potential security vulnerabilities or functional regressions. Furthermore, the cost summary helps ensure that the changes don't inadvertently increase the operational costs of the application. Thus, PromptProof provides a valuable layer of protection, making drive-by PRs a safer part of your development workflow.

Ensuring Deterministic Checks

The deterministic nature of PromptProof's report-only CI is a crucial feature for maintaining the reliability and consistency of your checks. By default, PromptProof uses a seed and runs checks multiple times (e.g., runs=3) to avoid flakes. This means that the same input will always produce the same output, ensuring that your CI checks are reproducible and predictable. This is particularly important in the context of LLM applications, where the inherent randomness of model outputs can sometimes lead to inconsistent results.

The use of a seed ensures that the random number generators used by the model and the checks are initialized in the same way each time, leading to the same sequence of random numbers and, therefore, the same results. Running the checks multiple times further reduces the likelihood of false positives or negatives due to random variations. Additionally, PromptProof doesn't require any secrets, which simplifies the setup and reduces the risk of exposing sensitive information. This combination of determinism and security makes PromptProof an ideal solution for integrating CI checks into your LLM application development workflow.

Files to Add for Implementation

To get this up and running, you’ll need to add a few files to your project. Don't worry, it’s straightforward! Here’s what you need:

1. `.github/workflows/promptproof.yml`

This YAML file defines your GitHub Actions workflow for PromptProof. It tells GitHub how to run PromptProof on each pull request.

name: PromptProof
on:
 pull_request:
 paths:
 - ".github/workflows/promptproof.yml"
 - "promptproof.yaml"
 - "fixtures/promptproof/**"
jobs:
 proof:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4
 - uses: geminimir/promptproof-action@v0
 with:
 config: promptproof.yaml
 runs: 3
 seed: 1337
 max-run-cost: 0.75
 report-artifact: promptproof-report
 mode: report-only

Key points here:

name: The name of the workflow.
on: Triggers the workflow on pull requests that modify specific paths.
jobs: Defines the proof job that runs on ubuntu-latest.
steps:
- actions/checkout@v4: Checks out your repository.
- geminimir/promptproof-action@v0: Uses the PromptProof GitHub Action.
with: Configures the action:
- config: Specifies the promptproof.yaml file.
- runs: Sets the number of runs to 3 for deterministic checks.
- seed: Sets a seed value for reproducibility.
- max-run-cost: Sets a maximum cost for the run.
- report-artifact: Specifies the name of the report artifact.
- mode: Sets the mode to report-only so it doesn’t block merges.

2. `promptproof.yaml`

This YAML file configures PromptProof itself, defining the checks and budgets for your LLM outputs.

mode: fail
format: html
fixtures:
 - path: fixtures/promptproof/answer_engine.json
checks:
 - id: answer_schema
 type: schema
 json_schema:
 type: object
 properties:
 output:
 type: object
 properties:
 answer: { type: string, minLength: 1 }
 citations: { type: array, items: { type: string }, nullable: true }
 latency_ms: { type: number, minimum: 0 }
 required: [answer]
 required: [output]
 - id: forbid_emails
 type: regex_forbid
 pattern: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}"
budgets:
 max_run_cost: 0.75
stability:
 runs: 3
 seed: 1337

Let’s break it down:

mode: Set to fail so checks will fail if something goes wrong (but won’t block merges due to report-only mode in the workflow).
format: Specifies the report format as HTML.
fixtures: Defines the path to your fixture file.
checks:
- answer_schema: A schema check to ensure the output has a specific structure.
- forbid_emails: A regex check to prevent email addresses in the output.
budgets: Sets a maximum run cost.
stability: Configures the runs and seed for deterministic checks.

3. `fixtures/promptproof/answer_engine.json`

This JSON file contains example inputs and outputs for your LLM application. PromptProof uses these fixtures to run the checks.

{
 "record_id": "ae-hello-001",
 "input": { "question": "What is PromptProof?" },
 "output": {
 "answer": "A CI gate for LLM outputs.",
 "citations": ["https://example.com/doc"],
 "latency_ms": 42
 }
}

Key elements:

record_id: A unique identifier for the record.
input: The input to your LLM application.
output: The expected output, including answer, citations, and latency_ms.

Benefits for Maintainers

Maintainers get a lot of value from this setup. You'll receive a single, comprehensive HTML report artifact per PR, which includes a schema summary, regex checks, and cost analysis. This report makes it incredibly easy to review and understand the impact of changes. Plus, since there are zero live calls, it’s easy to delete if you don’t need it. This is a huge win for keeping your CI clean and efficient.

References and Further Exploration

For a sample report, you can check out https://geminimir.github.io/promptproof-action/reports/before.html. It gives you a clear idea of what the HTML report looks like and the information it provides. If you’re curious about the PromptProof Action, you can find it on the GitHub Marketplace. For a more hands-on example, the PromptProof Demo Project is a great resource to see it in action.

Next Steps

If this sounds good to you guys, I’m happy to open a 3-file PR with these changes. We can tweak the checks and paths to fit your preferences perfectly. Let me know what you think!