Report-Only CI For LLMs With PromptProof
Hey everyone! I'm excited to share a way to add a super useful, report-only CI check for your LLM applications using PromptProof. This setup gives reviewers a quick, at-a-glance HTML report, showing whether recorded answers still match the expected shape and adhere to basic guardrails. The best part? It does this without making any live model calls and without blocking merges. Let’s dive into why this is so beneficial and how you can implement it.
Why This Report-Only CI is a Game Changer
In the realm of LLM application development, maintaining the integrity and reliability of your models' outputs is critical. The report-only CI check offered by PromptProof provides several key advantages, making it an invaluable tool for developers. This approach ensures that example responses don't silently drift during refactors, which is a common issue when dealing with complex systems. By implementing this check, you can catch regressions early and maintain the consistency of your model's behavior. Furthermore, it makes drive-by PRs safer by giving reviewers a clear view of schema, PII checks, and cost summaries in a single artifact. This comprehensive overview helps in identifying potential issues more efficiently.
One of the standout features of this system is its deterministic nature. By default, it uses a seed and runs the checks multiple times (e.g., runs=3) to avoid flakes, ensuring reliability. It also doesn't require any secrets, making it easier to set up and maintain. This deterministic approach is crucial for ensuring that your CI checks are consistent and reproducible, which is vital for identifying and addressing regressions effectively. Overall, this report-only CI provides a robust and efficient way to monitor and maintain the quality of your LLM application outputs.
Preventing Silent Drifting of Example Responses
One of the most significant advantages of implementing this report-only CI is its ability to prevent example responses from silently drifting during refactors. In complex LLM applications, changes to the codebase can inadvertently affect the model's behavior, leading to unexpected outputs. By incorporating PromptProof, you can ensure that any deviations from the expected response shape are immediately flagged. This is particularly crucial when dealing with intricate models where subtle changes can have significant impacts. For instance, if a refactor introduces a bug that alters the format of the model's output, PromptProof will detect this discrepancy and alert reviewers.
This proactive approach helps maintain the stability and reliability of your application. By catching regressions early, you can prevent them from propagating into production, where they could cause more significant issues. Additionally, it provides a safety net for developers working on different parts of the codebase, ensuring that their changes don't inadvertently break existing functionality. This makes the development process more predictable and reduces the risk of introducing errors. Therefore, integrating PromptProof into your CI pipeline is a strategic move for ensuring the long-term health and consistency of your LLM applications.
Enhancing the Safety of Drive-By PRs
Another key benefit of using PromptProof's report-only CI is the enhanced safety it brings to drive-by pull requests (PRs). Drive-by PRs, which are often submitted by external contributors or developers less familiar with the codebase, can sometimes introduce unintended issues. By providing reviewers with a comprehensive artifact that includes schema checks, PII checks, and a cost summary, PromptProof significantly reduces the risk associated with these contributions. Reviewers can quickly assess the impact of the changes without needing to run the model live or dig through extensive logs.
This streamlined review process saves time and ensures that potential problems are identified early. For example, if a drive-by PR introduces a change that violates the output schema or exposes sensitive information, the PromptProof report will immediately highlight these issues. This allows reviewers to address these concerns before the changes are merged, preventing potential security vulnerabilities or functional regressions. Furthermore, the cost summary helps ensure that the changes don't inadvertently increase the operational costs of the application. Thus, PromptProof provides a valuable layer of protection, making drive-by PRs a safer part of your development workflow.
Ensuring Deterministic Checks
The deterministic nature of PromptProof's report-only CI is a crucial feature for maintaining the reliability and consistency of your checks. By default, PromptProof uses a seed and runs checks multiple times (e.g., runs=3) to avoid flakes. This means that the same input will always produce the same output, ensuring that your CI checks are reproducible and predictable. This is particularly important in the context of LLM applications, where the inherent randomness of model outputs can sometimes lead to inconsistent results.
The use of a seed ensures that the random number generators used by the model and the checks are initialized in the same way each time, leading to the same sequence of random numbers and, therefore, the same results. Running the checks multiple times further reduces the likelihood of false positives or negatives due to random variations. Additionally, PromptProof doesn't require any secrets, which simplifies the setup and reduces the risk of exposing sensitive information. This combination of determinism and security makes PromptProof an ideal solution for integrating CI checks into your LLM application development workflow.
Files to Add for Implementation
To get this up and running, you’ll need to add a few files to your project. Don't worry, it’s straightforward! Here’s what you need:
1. .github/workflows/promptproof.yml
This YAML file defines your GitHub Actions workflow for PromptProof. It tells GitHub how to run PromptProof on each pull request.
name: PromptProof
on:
pull_request:
paths:
- ".github/workflows/promptproof.yml"
- "promptproof.yaml"
- "fixtures/promptproof/**"
jobs:
proof:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: geminimir/promptproof-action@v0
with:
config: promptproof.yaml
runs: 3
seed: 1337
max-run-cost: 0.75
report-artifact: promptproof-report
mode: report-only
Key points here:
name
: The name of the workflow.on
: Triggers the workflow on pull requests that modify specific paths.jobs
: Defines theproof
job that runs onubuntu-latest
.steps
:actions/checkout@v4
: Checks out your repository.geminimir/promptproof-action@v0
: Uses the PromptProof GitHub Action.
with
: Configures the action:config
: Specifies thepromptproof.yaml
file.runs
: Sets the number of runs to 3 for deterministic checks.seed
: Sets a seed value for reproducibility.max-run-cost
: Sets a maximum cost for the run.report-artifact
: Specifies the name of the report artifact.mode
: Sets the mode toreport-only
so it doesn’t block merges.
2. promptproof.yaml
This YAML file configures PromptProof itself, defining the checks and budgets for your LLM outputs.
mode: fail
format: html
fixtures:
- path: fixtures/promptproof/answer_engine.json
checks:
- id: answer_schema
type: schema
json_schema:
type: object
properties:
output:
type: object
properties:
answer: { type: string, minLength: 1 }
citations: { type: array, items: { type: string }, nullable: true }
latency_ms: { type: number, minimum: 0 }
required: [answer]
required: [output]
- id: forbid_emails
type: regex_forbid
pattern: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}"
budgets:
max_run_cost: 0.75
stability:
runs: 3
seed: 1337
Let’s break it down:
mode
: Set tofail
so checks will fail if something goes wrong (but won’t block merges due toreport-only
mode in the workflow).format
: Specifies the report format as HTML.fixtures
: Defines the path to your fixture file.checks
:answer_schema
: A schema check to ensure the output has a specific structure.forbid_emails
: A regex check to prevent email addresses in the output.
budgets
: Sets a maximum run cost.stability
: Configures the runs and seed for deterministic checks.
3. fixtures/promptproof/answer_engine.json
This JSON file contains example inputs and outputs for your LLM application. PromptProof uses these fixtures to run the checks.
{
"record_id": "ae-hello-001",
"input": { "question": "What is PromptProof?" },
"output": {
"answer": "A CI gate for LLM outputs.",
"citations": ["https://example.com/doc"],
"latency_ms": 42
}
}
Key elements:
record_id
: A unique identifier for the record.input
: The input to your LLM application.output
: The expected output, includinganswer
,citations
, andlatency_ms
.
Benefits for Maintainers
Maintainers get a lot of value from this setup. You'll receive a single, comprehensive HTML report artifact per PR, which includes a schema summary, regex checks, and cost analysis. This report makes it incredibly easy to review and understand the impact of changes. Plus, since there are zero live calls, it’s easy to delete if you don’t need it. This is a huge win for keeping your CI clean and efficient.
References and Further Exploration
For a sample report, you can check out https://geminimir.github.io/promptproof-action/reports/before.html. It gives you a clear idea of what the HTML report looks like and the information it provides. If you’re curious about the PromptProof Action, you can find it on the GitHub Marketplace. For a more hands-on example, the PromptProof Demo Project is a great resource to see it in action.
Next Steps
If this sounds good to you guys, I’m happy to open a 3-file PR with these changes. We can tweak the checks and paths to fit your preferences perfectly. Let me know what you think!