Charter
What a Charter is
Not a project plan. Not a requirements document that executes once and collects dust. A Charter is the memory that survives the chaos. Its value is the decision log: when someone asks six months later why photo capture won over voice-only, or why we did not rip out the ERP, the answer is here, with the alternatives that were weighed and the evidence that settled it. The Architect keeps it current, same-day.
Metadata
| Field |
Value |
| Project |
Field-to-billing data capture |
| Client |
The HVAC and field services company (anonymized) |
| Charter Keeper |
The Architect |
| Dates |
Held in ../../../Clients/; relative markers used here |
| Current canon |
Refine. The system is live across locations and governed. |
| Version |
End-of-first-project state |
Positions
The work was held together by clear accountabilities, not an org chart.
| Position |
Who held it |
Tension owned |
| Sponsor |
Operations leadership |
Authority. Owned the objective and cleared the way. |
| Guide |
First Strategy senior practitioner |
Translation. Carried the method and kept the Charter honest. |
| Architect |
First Strategy |
Curiosity and stewardship. Close observation, design, and Charter Keeper. |
| Sage |
A long-tenured insider |
Context. Opened doors and supplied history. |
| Scout |
A respected field technician |
Empathy. Validated whether a change would actually be adopted. |
| Builder |
First Strategy developer |
Execution. Moved fast and discarded what failed. |
| Finance lead |
Client finance |
Safety. Watched the numbers and caught the drift. |
| Billing validator |
The billing manager |
Integrity. Became the human validation layer. |
On a small team one person can hold several Positions. As the system proved reliable, a Position could be augmented by an AI agent inside documented constraints, with the human shifting from doing to directing and reviewing.
Objectives and constraints
The build specification: what the project set out to do and the lines it would not cross.
Scope
In scope: capture job data at the point of service and get it to billing without a human re-keying paper. Out of scope: scheduling, inventory, routing, procurement. Those are later moves on the roadmap and were not touched in this build.
Objective and success criteria
Capture accurate job data at the point of service with under fifteen seconds of technician effort, and prove it against the baseline.
| Measure |
Baseline |
Target |
Result |
| Invoice error rate |
~38% |
Single digits |
Residential 9%, commercial 12% |
| Billing reconciliation |
3 days |
Same-day |
Same-day on the large majority of jobs |
| Technician capture time |
5 to 10 min on paper |
Under 15 sec, 8 the goal |
~11 sec average |
| Paper usage |
100% |
Down 80%, rest a fallback |
Down ~80%, remainder the fallback path |
Constraints
- Faster than paper or the field rejects it. Under fifteen seconds is the hard line.
- Works in real field conditions: background noise, weather, gloves, a customer waiting.
- Tolerant of lost connectivity in parts of the service area.
- Does not disturb the running systems. It replaces one handoff, nothing else.
Architecture and human-in-the-loop design
A capture client in the technician's hand. An extraction layer that turns photo and voice into structured fields, trained on the operation's own historical invoices so it knows the local vocabulary and voices. A human-validation step. A flow into the billing system. A dual-path design: voice primary, photographed paper fallback, both handled by the same extraction layer. Every failed transcription becomes training data.
During the pilot, the billing reviewer validated every submission: photo, transcription, and each extracted field, checked against the job record. Misses were logged with the pattern that caused them, part names confused, phrases misread, unclear photos, and fed back to improve the model. That log became the checklist for the next location's reviewer. The validation grip loosened only later, under the governance tiers in the Hierarchy of Agency, never before the evidence supported it. A human stays accountable for every billing decision.
Current state at the start
Carried from the Day One Audit. The dispatcher schedules from judgment the software cannot hold. The real friction is the paper handoff from field to billing. The baseline: a 38 percent invoice correction rate, six hours of daily rework per location, and a three-day reconciliation lag. Roughly 80 percent of invoice errors originate at the paper-capture step.
Decision log
The decisions that shaped the build, each with the alternatives weighed and the evidence that settled it. This is the part of the Charter that answers "why did we do it this way."
| When |
Decision |
Alternatives rejected |
Rationale |
Evidence |
| Interrogate, wk 1 |
Rule scheduling out as the first move |
Build the AI scheduling the vendors pitched |
The thing leadership and three vendors believed was never tested on the floor |
Experiment 1: the visibility board went all but unused |
| Interrogate, wk 2 |
Target the field-to-billing handoff |
Keep looking; replace the billing system |
The errors cluster at one step, not across the system |
Experiment 2: error tracing put ~80% at paper capture |
| Interrogate, wk 2 |
Treat the dispatcher as a later move, not the first |
Automate dispatch now |
A third of her work is judgment that should stay human |
Experiment 3: a day mapping her decision logic, ~70/30 |
| Interrogate, wk 3 |
Kill the digital form |
Roll a digital form out to the field |
Slower than paper and failed on low signal |
Experiment 4: technicians abandoned it within days |
| Solve, wk 1 |
Build dual-path capture: voice primary, photo-of-paper fallback |
Voice only; chase noise cancellation or better microphones |
Voice fails in field noise, but the field needs a path that always works |
First field test returned garbage on a noisy rooftop |
| Solve, wk 3 |
Human reviews every submission during the pilot |
Trust AI output and spot-check |
AI hallucinated on a meaningful share of submissions; humans must direct AI |
Pilot review caught duration and part-name errors before billing |
| Solve, exec review |
Replace one component at a time |
Rip out the ERP and rebuild from scratch |
Clean-slate replacements overrun for years and still fail |
Two prior failures cited; the pilot proved a single handoff can be replaced and validated |
| Expand |
Build a commercial version with milestone billing |
Force the same-day model onto commercial work |
Commercial jobs span phases; firing an invoice on every capture billed for unfinished work |
The third location broke to a 55% error rate |
| Refine |
Monitor accuracy by segment, alert at a 5% drop |
Trust the aggregate dashboard |
Aggregates hide a single failing segment |
The drift incident: one service line fell while the headline held |
| Refine |
Graduate routine residential to Tier 1 oversight |
Keep full human review everywhere |
Autonomy is earned on evidence, not granted |
90 days above 95% accuracy, disputes below baseline, zero boundary violations |
The decision and experiment record
The supporting narrative behind the log. The project ran the full WISER method. Witness had already found where AI fit; the project picked up at Interrogate and ran through Refine.
Interrogate
Three cheap experiments tested the inherited diagnosis, then three more tested how to capture data without slowing the technician. The scheduling theory died when a visibility board went unused and became, by the end of the week, a place to post a fantasy football league. Error tracing put roughly 80 percent of invoice errors at the paper-capture step. A day mapping the dispatcher's logic showed about 70 percent pattern-based and 30 percent judgment. Then: a digital form, killed for being slower than paper; a photo of the paper plus AI transcription, which dropped errors without changing technician behavior and was kept as the fallback path; and photo plus voice, which raised the question of removing paper entirely and set the Solve target.
Solve
The build target was photo plus voice, under fifteen seconds, eight the goal. The first field test exposed the jagged frontier of AI: it parsed complex part numbers without trouble and choked on wind noise on a commercial rooftop. The hard part was the noise, not the vocabulary. The team added the dual path and retrained the voice model on real field audio, compressors and truck engines and wind. Voice recognition in noisy conditions climbed [from about 60 to about 85 percent], with the paper-photo fallback catching the rest.
The pilot ran in one location: twelve technicians, four weeks, a human in the loop on every submission. The billing manager reviewed each one and logged every miss. The misses were not random; the model struggled with one trade's vocabulary and with certain spoken numbers, which is the same weakness that would later resurface as drift. The pilot held: invoice errors at 9 percent down from 38, clarification calls near zero, same-day reconciliation, eleven-second average capture, paper down 80 percent. One location recovered about 47,000 dollars in a single month.
At the executive review, leadership wanted to rip out the whole back-office system. The recommendation held: replace one component at a time. Ripping out everything at once is how a project becomes a multi-year overrun that still does not work.
Expand
The second location, residential, matched the first and proved it was not a fluke. The third was fully commercial and broke. The system had learned that a capture means a finished job, because every job in the first two locations was same-day. Commercial work captures at every phase, so the system fired invoices for unfinished jobs and the error rate climbed to 55 percent, worse than paper. The team pulled it within days, returned that location to paper, observed commercial work for three days, and rebuilt for milestone billing. Building it surfaced a constraint: the billing system did not support milestone billing, which forced an intermediary layer and a slipped timeline that the Builder owned directly to the client. The commercial version recovered the location from 55 to 11 percent. All locations went live, two to a wave, residential at 9 percent and commercial at 12.
Refine
Going live was not the end. Stable is not governed. The team moved from building to watching.
Hierarchy of Agency
Three tiers of human oversight by job risk. The technician's field confirmation is the first gate; the tier governs what happens after that confirmation.
| Tier |
Oversight |
Applies to |
| 1: Auto-approve with spot-checks |
No review of individual invoices. Random spot-checks on a small share, plus monthly red-team testing |
Low-risk routine jobs the AI has proven reliable on |
| 2: Oversight on key details |
A one-screen summary highlights price, parts, and duration. One click to approve if the highlighted fields look right |
Medium-risk jobs where specific fields need a check |
| 3: Full review |
Every invoice gets human eyes before it goes out. No exceptions |
High-risk jobs: complex repairs, commercial projects |
A job type moves to less oversight only on evidence: accuracy above 95 percent for 90 consecutive days, disputes below the pre-system baseline, zero boundary violations. Routine residential graduated to Tier 1 on this evidence. If a tier drifts, it falls back to more oversight. A human is accountable for every billing decision, at every tier. Autonomy is grown, not granted.
Risk register
| Risk |
Mitigation |
Status |
| Field rejects anything slower than paper |
Hard fifteen-second constraint; validated with a respected technician before building |
Held; capture ran ~11 sec |
| Voice fails in noise |
Dual-path fallback to photographed paper; retrained on real field audio |
Resolved; noise accuracy climbed and the fallback catches the rest |
| Lost connectivity |
Offline-tolerant capture with deferred sync |
Designed in |
| AI hallucination on edge cases |
Human review on every submission during pilot; hallucination log; segment monitoring later |
Active control |
| A new context breaks assumptions |
Sequence the rollout; test a genuinely different context before scaling |
Realized at the commercial location; recovered |
| Drift hidden by aggregate metrics |
Monitor by segment, alert on any type dropping >5% from baseline, weekly drift review |
Added after the drift incident |
Drift and incident record
After routine residential graduated to Tier 1, the system drifted. Extraction accuracy for one service line quietly dropped while the headline accuracy still looked fine. The training data had been heavier on the trade that adopted first, so the model learned that trade deeply and the other shallowly, and disputes for the weaker line rose before anyone saw it in the aggregate. It was caught by watching dispute trends by service type, not by the headline metric.
The response:
| Action |
Detail |
| Retrain |
Balanced data for the under-served line, including fresh sample voice notes |
| Segment monitoring |
Track accuracy by service type; alert if any type drops more than 5% from baseline |
| Weekly drift review |
A standing 30-minute review of segment accuracy, dispute trends, and anomalies, led by the Finance lead |
| Remediation |
Credits issued to customers billed wrong during the drift window |
The weaker line recovered. The lesson logged: AI optimizes for what you measure, so measure segments, not just aggregates, from the start. Any drift past threshold triggers a fall-back to more oversight under the Hierarchy of Agency until resolved.
Evolution history
How the oversight posture changed over time, and why.
| When |
Change |
Trigger |
| Pilot |
100% human review on every submission |
Trust not yet earned |
| Post-pilot rollout |
Same posture carried location to location; each wave monitored |
Expansion by evidence |
| After Tier 1 graduation |
Routine residential auto-approved with spot-checks and monthly red-team |
Graduation criteria met |
| After the drift incident |
Added segment monitoring, 5% alerts, weekly drift review, and the fall-back rule |
One service line drifted under the aggregate |
Current status and the autonomy transfer
Three projects delivered. The client now runs the method largely on their own, and we are on a low monthly advisory retainer. The relationship moved through tiers the same way a job type graduates to less oversight: on evidence.
| Tier |
Project |
We held |
Client held |
| Lead |
Project one: the data-capture build |
The Guide role and the build |
Watched the process work, learned the method |
| Co-guide |
Project two (unnamed) |
Co-guiding, technical oversight, trained their Guide |
Led the doing |
| Oversight |
Project three |
Oversight and judgment calls |
Led the project |
| Advisory |
Now |
A monthly check-in and a second set of eyes |
Owns the capability |
The goal was never to be permanently needed. It was to build the muscle and step back. The client continues to apply the same process to new components. If a problem arises that is worth more than oversight, we scope it the same way we scope any engagement.
Outcomes
- Invoice errors: residential 9 percent and commercial 12 percent, down from 38.
- Billing reconciliation: same-day, down from three days.
- Capture time: about eleven seconds per job, faster than paper.
- Recovered billing: about 247,000 dollars per month, roughly 2.96 million annually.
- No layoffs. Billing clerks who chased data-entry errors became exception handlers who catch what the AI misses and feed corrections back. The same team now carries more locations than before. Capacity grew instead of headcount.
Plays
The WISER plays this engagement ran, instantiated with the client's specifics. This is the index and what each produced. The high-value plays are held as standalone documents; the rest were applied inline in this Charter. | Canon | Play | What it produced | Source |
|-------|------|------------------|--------|
| Witness | Friction Mapping | The friction table and root-cause read at the field-to-billing handoff | Standalone play |
| Witness | Field Observation, User Flow Mapping, Documenting Current State | The end-to-end job trace and systems landscape | Inline in the Day One Audit |
| Interrogate | Assumption Auditing | The register of inherited beliefs to test, scheduling theory first | Standalone play |
| Interrogate | Experiment Selection, Logging, Rapid Prototyping | The six experiments and what each ruled in or out | Standalone play |
| Solve | Human-in-the-Loop Design | The billing-reviewer validation layer and miss log | Standalone play |
| Solve | Quality Objective Setting, Pilot Planning, Value Validation | The success criteria, the one-location pilot, the measured recovery | Inline above |
| Expand | Readiness Check, Sequencing, Context Fit, Deployment Gating | The wave plan and the commercial-context pivot | Inline above |
| Refine | Hierarchy of Agency Design | The three oversight tiers and graduation evidence | Standalone play |
| Refine | Drift Monitoring, Incident Response | The segment-monitoring plan and the drift fix | Standalone play |
| Refine | Graduation Decision Making, Red Team Testing | The evidence thresholds and tier-one spot-checks | Inline above |
The first problem is solved, and it will not be the last. The operation now has a list of candidates for the same treatment: inventory, scheduling support, technician routing, parts procurement. The difference is that the team knows how to do this now. They built the muscle.