CASE STUDY

HVAC and Field Services.

A First Strategy case study.

Company name is held in confidence.

PDF Download the full case study (PDF).

The story

HVAC and Field Services

The diagnosis everyone agreed on was wrong

A multi-location HVAC and field services operator brought us in to answer one question: where AI actually fits in a business like theirs. Three vendors before us had pitched scheduling, the demos had impressed, and every pilot had died once dispatchers tried to use it. The Director of Operations stopped asking for a scheduling tool and started asking a better question.

What was at stake

Margin was bleeding through a back office nobody had flagged as the problem. Invoices were going out wrong, billing was running days behind the work, and the field-to-office handoff was eating hours a day in rework. Another miss on AI would lose the field's trust for any next attempt, and the operator had no appetite for a multi-year rebuild that might not work. The next move had to be small enough to be reversible and clear enough to prove out in weeks, not quarters.

What the floor revealed

We asked for a day on the floor before proposing anything. The morning with leadership lined up with the inherited diagnosis: dispatch was the bottleneck, scheduling was the move. The rest of the day did not.

The lead dispatcher was not scheduling in the dispatch software. It sat minimized on her screen while she worked from a hand color-coded spreadsheet refined over years. The software knew where technicians were; it did not know which one runs long on certain days and why, or which customers always add work once the truck is in the driveway. "This is how I schedule," she said. "The software is for billing." Automating her would have ripped out the part of the operation that worked.

The real friction sat downstream. At the billing desk we watched a clerk pull a paper ticket from a basket, squint at handwriting, guess at a part number, and dial a technician who would not pick up. By late morning, three invoices were stuck waiting for a callback and two had already gone out with errors. This was a normal morning. More than a third of invoices came back needing correction. Nearly half of jobs triggered a clarification call. Billing ran three days behind the work. About six hours a day, per location, went to cleaning up a handoff that should have taken seconds. When we traced the errors back, roughly eighty percent originated at one step: the paper form in the technician's hand. Not scheduling. A piece of paper.

The strategic shift

Instead of automating where every vendor wanted to sell, we fixed the handoff that was actually carrying the loss. The dispatcher's judgment stayed. The paper-first handoff at the point of service went. The trade-off was visible: a smaller first move than leadership had been pitched, but one we could prove in weeks instead of years, and reverse without harm if it did not work.

The second shift was how we would build. A confident team would have started construction. We ran cheap experiments first, designed to kill our own ideas before they got expensive. Evidence on the floor would force the next decision, not the inherited theory and not our own.

How we built it

Three experiments came first, days each. A scheduling visibility board in front of technicians lasted a week before becoming a place to post a fantasy football league; nobody used it, because nobody returns to a board when their next job arrives by text. A digital form was slower than paper and failed the moment a technician lost signal in a basement. The third experiment kept the paper, photographed each finished ticket, let AI read it into billing, and texted the technician to confirm anything unclear. Invoice errors fell from thirty-eight percent to nine with technicians doing nothing different. The signal was clear: capture had to be lighter than paper, not heavier.

The hard line for the build was fifteen seconds of capture in a customer's driveway, eight the target. Voice was the primary path, photographed paper the fallback, the same AI reading both. The first field test taught the lesson no specification would have: the system parsed a complicated part number cleanly, then returned pure garbage on a commercial rooftop, defeated by wind across the microphone. The team retrained on real field audio, compressors and truck engines and wind, instead of recordings from a quiet office.

During the pilot the billing manager reviewed every submission. One afternoon she stopped on a routine water-heater swap: a standard replacement, about two hours, that the AI had logged as eight. She pulled the photo, confirmed the job, screenshotted the error, and walked it to the developer's desk. The system was working as designed: not an AI that is never wrong, but an AI whose mistakes a human sees before a customer does. Her error log became the training data for the next week and the checklist the next location's reviewer would use.

We expanded one location at a time. The second, residential like the first, went smoothly. The third broke on purpose. It was fully commercial, where jobs run across days instead of hours, and the system had learned a false rule from residential: that a capture means a finished job. Errors climbed to fifty-five percent. We pulled it within days, put that location back on paper, and spent three days watching commercial crews. Commercial jobs have milestones; billing should fire on a milestone, not on every capture. The fix surfaced a constraint nobody had flagged, a billing system that did not support milestone billing, which forced an intermediary layer and a slipped timeline. The team owned the missed estimate directly with the client. The commercial version brought that location from fifty-five percent down to eleven, and the rollout finished, two locations to a wave.

Stable was not the same as governed. Once the system was live everywhere, extraction accuracy on plumbing jobs drifted while the headline dashboard stayed green. Catching it took watching disputes by service type, not by total. The team retrained on balanced data, added an alert whenever any segment fell more than five percent from baseline, credited the customers billed wrong during the drift window, and built three tiers of human oversight by job risk. A job type earned less oversight only on evidence: ninety days above ninety-five percent accuracy and zero boundary violations. Routine residential earned it. Nothing earned it by decree.

What changed

Across the operation, recovered billing reached about two hundred forty-seven thousand dollars a month, on the order of three million dollars a year. One location alone recovered about forty-seven thousand dollars in a single month, billing that would otherwise have been disputed, delayed, or written off.

Billing went from a three-day lag to same-day. The billing team's job inverted. Clerks who had spent their mornings deciphering handwriting became exception handlers, catching what the AI missed and feeding each correction back so it missed less the next time. Nobody was laid off. The same team now carries more locations than it could before. Capacity grew where headcount would have.

Invoice errors settled at nine percent on residential work and twelve on commercial, down from thirty-eight. Capture averaged about eleven seconds, faster than paper. Paper usage fell by roughly eighty percent, with the remainder kept as the fallback path. Clarification calls landed near zero.

What stands as proof

The Day One Audit recorded the baseline before any build. The pre-build arithmetic and the measured result agreed, which is the point: the number was earned, not promised. The pilot held in one residential location, then a second, then broke on purpose in a commercial context and was rebuilt to fit. The Audit and the Playbook stand as the engagement record.

We then handed the capability over the way we built it. We led the first project and built it ourselves. We handed the second to their team, trained a guide on their side, co-guided alongside them, and held the technical calls. We oversaw the third while they led it outright. They now run the process on their own, with a monthly advisory check-in.

The point was never to be permanently needed. It was to fix the right problem first, prove the approach in the operator's own building, and build the muscle to find the next one. The team knows the first problem will not be the last. They now know how to find it.


The deliverables

Day One Proposal

Day One

Prepared for the operator, on referral.

What this is

One day in your business with your leadership. Real work, not slides.

In the morning, we sit with you and your leadership. What is running. What is stuck. Where AI already shows up in the business, and where it does not. What you have already tried, including the AI scheduling that impressed in a demo and never got used.

The rest of the day, we walk the work with the people who do it. The dispatcher. A technician through a full job. The billing desk. What their day actually looks like, what slows them down, where they would want help. On the floor, not in a side room. Because you run across [several] locations, the day follows the real work end to end rather than sitting in one office. This is the part of the day that does the work.

We come ready to listen and to think on our feet. No prepared deck.

What you walk away with

A playbook, within two weeks. Not a deck. Not a recommendation memo hiding in a PDF. A written read for operators.

The day pulls signal from your floor. The two weeks after are when we test what we heard and decide what would matter most. That work turns the signal into a sequenced plan you can run.

The playbook answers three questions:

  • Where AI fits in your business, and where it does not.
  • The highest-leverage moves we saw, sequenced so you can act on them in order. A roadmap, not a list.
  • What it would take to run the sequence: with your own team, with another firm, or with us.

The playbook is yours. Run it however makes sense.

What we need from you

  • Your leadership for the morning, including the Director of Operations who brought us in.
  • Access to the people doing the work for the rest of the day: dispatch, the field, and billing.
  • A day on the floor, in the business, not on video calls.

The terms

A flat fee of [flat fee] for the day and the playbook. Travel and expenses billed at cost, on top.

No retainer. No commitment beyond the day itself. If we are the right fit for what comes next, we will already have been talking about what that looks like. If we are not, the playbook is still yours to run.

What happens next

After the playbook, you decide. Run it with your own team, hand it to another firm, or build it with us. If the work points to a build we are right for, we will scope it in a separate proposal once the playbook has shown what is worth building.


Day One Audit

The one-line finding

Your AI opportunity is not where you were told to look. Leadership and three prior vendors pointed at scheduling. A day on your floor found the friction somewhere else entirely: the handoff that carries job data from the field to billing. That handoff corrects more than a third of its invoices, loses roughly six hours per day per location to rework, and runs a three-day billing reconciliation lag. The waste is large, contained to one handoff, measurable against a clear baseline, and reversible. It is the right place for the first AI work. Scheduling, where every vendor wanted to sell, is barely involved in the problem.

How we looked, and how we measured

One day in the business, with a short follow-up to quantify what we saw. The morning with leadership, capturing the diagnosis you carried without accepting it. The rest of the day on the floor, shadowing the work as it actually happens: dispatch, a technician through a full job, and the billing desk.

The numbers in this audit are measured, not assumed. We shadowed [a sample of jobs across three of the locations], timed each handoff with a stopwatch, and counted corrections against [a sample of recent invoices pulled from the billing system]. Where a figure is an extrapolation rather than a direct count, this audit says so. We did not optimize the documented process; we found the real one, and the most load-bearing parts of it are nowhere in any process document.

The systems landscape

Four disconnected tools do the work that should belong to one, and none of them is the system you think it is.

System Official role Actual role
Dispatch software The scheduling system of record A passthrough kept current for billing, not used to schedule
Dispatcher's spreadsheet Not official The real scheduling system, refined over years, on one desktop
Paper job tickets A field formality The actual source of truth for everything downstream
Accounting and billing system Billing of record Fed by hand from paper, the place errors surface

Paper is the connective tissue between all four. Paper cannot be searched, verified, or automated. Every downstream system inherits whatever the paper got wrong.

Stakeholder map

Each role defines the problem differently. None is wrong about their own pain. None sees the whole.

Role What they own Where their pain is Their definition of the problem
Operations leadership Throughput and dispatch Manual work everywhere Scheduling is the bottleneck
Lead dispatcher The schedule The software cannot hold her judgment The software is for billing, not for her
Field technicians Job capture Anything slower than paper Leave my workflow alone
Billing team Transcription and invoicing Drowning in corrections and clarification calls We fix more invoices than we send
Finance Margin Quiet revenue leakage from billing errors The errors are eating margin

The gap between leadership's definition and the billing team's definition is the whole case. Leadership looked where the vendors pointed. The people downstream had been naming the real problem for years. The billing manager put it plainly: we spend more time fixing invoices than sending them.

The work, end to end

We followed one residential job from dispatch to invoice. The path crosses six handoffs and four systems, with paper in the middle of all of them.

  1. Dispatcher assigns the job from her spreadsheet.
  2. Dispatch software updated afterward, for billing's benefit.
  3. Technician arrives, diagnoses, completes the work.
  4. Technician records the job on a paper ticket by hand, gets a signature on the carbon copy.
  5. Paper tickets pile up and are dropped at dispatch at end of day.
  6. The next morning, a billing clerk re-keys the paper into the billing system, squinting at handwriting, guessing at part numbers, calling technicians who are back in the field.

We watched this play out at the billing desk in real time. By late morning, on a normal day, [three invoices were stuck waiting for a technician callback, and two had already gone out with errors that would need correction later]. The invoice goes out days after the work. A large share comes back disputed.

Friction quantified

Measured on the floor, by timing handoffs and counting errors across the jobs and invoices we sampled:

Friction point Frequency Time impact Error rate How measured
Paper ticket completion Every job 5 to 10 min per job 25% illegible or incomplete Timed across shadowed jobs; legibility judged at the billing desk
Re-keying paper into billing Every job 8 min per job 15% data-entry errors Timed at the clerk's desk; errors counted against the source ticket
Billing clarification calls ~40% of jobs 12 min per call Blocks the invoice 24+ hours Counted against the sampled jobs
Invoice corrections ~38% of invoices 20 min per correction Customer trust erodes Counted against sampled recent invoices

Total handoff overhead: roughly six hours per day per location. That is the sum of the per-job times above across a location's daily job volume, and it is rework time only. It does not count the invoices that are never corrected and the margin that leaks with them.

How 38 percent compares

A field-services back office in good shape keeps invoice rework in [the low single digits to low teens of percent]. Correcting 38 percent of invoices is several times a normal rate. This is not a back office that needs tightening. It is a structural defect in how data reaches billing.

The error stack and root cause

Errors compound as the job moves downstream.

Stage Error introduced Type
Dispatch to technician ~2% Minor scheduling conflicts
Job capture on paper ~25% Illegible, incomplete, wrong codes
Paper re-keyed into billing ~15% Transcription errors
Cumulative at invoice ~38% Combination of the above

About 80 percent of invoice errors originate at the paper-capture step. The paper form at the point of service is the root cause. Everything downstream is inheriting bad data and paying to clean it up. A new billing system would not fix this. Better data capture would.

The scheduling myth

Every vendor pointed at scheduling because that is where they had a product. The floor says scheduling is not broken. The dispatcher schedules from judgment the software cannot hold: which technician runs long on which days and why, which customers always add work on arrival, how traffic moves on a given route at a given hour. Automating that would damage a part of the business that works.

Her knowledge splits roughly into two parts: about 70 percent is pattern-based and could be supported by data over time, and about 30 percent is judgment that should stay human. That split is a later opportunity, not the first move.

Opportunity, sized

Directional, to be validated cheaply before any build, but sized with arithmetic rather than adjectives. Two buckets.

Rework cost, the floor. Six hours per day per location of pure rework, at [a loaded labor cost of about 30 dollars an hour], over [about 22 working days a month], is roughly [4,000 dollars] per location per month in labor alone, before a single dispute. Across the operation that is in the low tens of thousands per month. This bucket is real but it is not the prize.

Recovered revenue, the prize. The larger loss is billing that is disputed, delayed past collection, or written off because the invoice was wrong. At [a representative volume of roughly 1,500 invoices per location per month] and [an average invoice around 450 dollars], a 38 percent error rate touches a large share of revenue, and even a few points of that revenue leaking through billing errors is tens of thousands of dollars per location per month.

Put together, these put credible per-location recovery in the tens of thousands of dollars per month, compounding across the operation into seven figures a year. We flagged this as directional and said it must be proven cheaply before any build. It was. The build later recovered about 47,000 dollars in a single location in a single month, and reached about 247,000 dollars per month across the operation, roughly 2.96 million dollars a year. The pre-build arithmetic and the measured result agree, which is the point: this number was earned, not promised.

Where AI fits, and where it does not

  • Fits: the field-to-billing data-capture handoff. Contained, measurable, reversible, and carrying the most waste.
  • Does not fit yet: scheduling. Judgment-heavy and working. Touching it first would automate the wrong thing and lose the field's trust.

Risks and constraints we observed

  • The field will reject anything slower than paper. Whatever gets built must beat paper on speed, not match it.
  • Connectivity is unreliable in parts of the service area. Any field tool needs an offline-tolerant path.
  • Critical knowledge sits in a few experienced heads. That is a resilience risk as well as an opportunity.
  • The running systems are load-bearing. They must not be ripped out. They must be improved one handoff at a time.

The signal we leave with

The first AI move is the field-to-billing data capture. Before committing to a build, three assumptions need cheap tests: that scheduling visibility is not what technicians want, that the handoff is genuinely the source of the errors, and that the dispatcher's knowledge is replicable enough to support later. Those tests are where the work goes next. The plan, sized by impact, is the Playbook and Delivery Proposal.


Playbook and Delivery Proposal

The playbook and the delivery proposal are one document because they are one act. The playbook says where AI fits and sizes the moves in order. The delivery proposal scopes the build for the move you choose to start with. The first earns the second. Nothing past the first move is committed until the first move proves the approach in your business.

Part One: The Playbook

A written read for operators, not a deck. It answers three questions: where AI fits in your business and where it does not, the highest-leverage moves in sequence, and what it takes to run them.

Where AI fits, and where it does not

It fits the field-to-billing data-capture handoff, where roughly 80 percent of invoice errors originate and where six hours a day per location are lost to rework. It does not fit scheduling, which runs on judgment your dispatcher holds better than software would. The evidence is in the Day One Audit. The short version: the paper handoff from the field to billing is the most contained, most measurable, most reversible waste in the operation, and it is where the first AI work belongs.

How to read the roadmap

Two of these moves we diagnosed on your floor and can size with measured numbers. The rest we saw the shape of but did not diagnose, and we say so rather than dress them up. Honesty about what is proven and what is a candidate is the difference between a roadmap and a sales sheet.

Each move is read across six dimensions: time, accuracy and quality, cost and recovered revenue, growth, employee experience, and risk. The first move earns the right to the next.

The roadmap at a glance

# Move Status Leverage Containment Why it sits here
1 Field-to-billing data capture Diagnosed, sized Highest One handoff Most waste, most reversible. Start here.
2 Dispatch decision support Diagnosed, sized High One role Needs the clean data move 1 produces.
3 Inventory and truck stock Candidate, not yet diagnosed Medium-high One workflow Builds on the parts data move 1 captures.
4 Technician routing density Candidate, not yet diagnosed Medium One decision loop Compounds with moves 2 and 3.
5 Parts procurement Candidate, not yet diagnosed Medium One supplier loop Last, once demand data is trustworthy.

Move 1: Field-to-billing data capture (start here)

Capture job data at the point of service in a form that flows straight to billing, with AI doing the transcription and a human reviewing it while trust is earned. Faster than paper for the technician, or they will not use it.

  • Time: removes about 8 minutes of re-keying per job, 12-minute clarification calls on roughly 40 percent of jobs, and 20-minute corrections on roughly 38 percent of invoices. Reclaims on the order of six hours per day per location, all of it measured in the audit.
  • Accuracy and quality: invoice error rate from about 38 percent toward single digits. The handful of errors that remain get caught by a human before the invoice goes out, not by the customer after.
  • Cost and recovered revenue: recovers billing that is currently disputed, delayed, or written off. Sized in the audit at tens of thousands of dollars per location per month, compounding across the operation into seven figures a year. This is the one move where the estimate can be checked against a result: the build later recovered about 47,000 dollars in one location in one month and about 247,000 dollars per month across the operation.
  • Growth: same-day billing instead of a three-day lag improves cash flow and frees capacity without adding headcount. The billing team carries more locations than it could before.
  • Employee experience: technicians keep their fast workflow. Billing clerks move from chasing errors to handling exceptions, which is higher-value work and a better job.
  • Risk: lowest on the board. Contained to one handoff, reversible to paper at any point. The right place to prove AI in this business.

Move 2: Dispatch decision support

Once capture holds, support the dispatcher with the pattern-based part of her work while keeping judgment human. We spent a day documenting her decision logic: about 70 percent is pattern-based and replicable with data, about 30 percent is judgment that stays with her. This move needs the clean, structured job data that move 1 produces, which is why it comes second.

  • Time: targets the routine assignment work, [an estimated two to three hours a day per dispatcher] of pattern-based decisions a support tool could propose for her approval.
  • Accuracy and quality: more consistent ETAs and technician matching, drawn from the durations and job data the capture system now records rather than from memory.
  • Cost and recovered revenue: more jobs completed per technician per day from tighter assignment and fewer return trips. Even [one additional completed job per technician per day] across the fleet is a large gain, because the marginal job is close to pure margin.
  • Growth: capacity to take on more volume without proportional dispatch headcount, and resilience if the lead dispatcher is out, which today is a single point of failure.
  • Employee experience: removes the resilience risk of knowledge held in one head, and frees the dispatcher for the 30 percent only she can do.
  • Risk: medium. Judgment stays human. The system proposes, the dispatcher approves. The 30 percent is never automated.

The later moves: named, not yet diagnosed

The audit surfaced three more candidates. We name them so the roadmap is honest about where this goes, but we did not diagnose them on the floor, so we will not pretend to size them to the dollar. Each gets its own cheap validation before any build, the same way move 1 did.

  • Move 3: Inventory and truck stock. The parts data that move 1 captures on every job becomes a forecast of what each truck should carry. The waste to chase: return trips for a missing part, and holding cost from stocking everything instead of the right things. The number to beat is first-visit completion rate.
  • Move 4: Technician routing density. Tighter clustering of jobs across the day, to complete more per truck without longer hours. The waste to chase is windshield time. The number to beat is completed jobs per truck per day.
  • Move 5: Parts procurement. Once demand data is trustworthy across moves 1 and 3, buy from forecast and price history rather than job by job. The waste to chase is expedite premiums and emergency orders. The number to beat is price variance.

Each of these is a contained handoff or decision with its own measurable waste. None is committed now. They earn their turn only after the moves ahead of them prove out.

What it takes to run the moves

The discipline matters more than the technology.

  • Test cheap before building. Days, not months. Rule out the inherited scheduling theory directly, then test capture approaches against what technicians actually use in the field.
  • Build for real conditions. Faster than paper, works in a noisy driveway, tolerant of lost signal. If it is slower than paper, it fails.
  • Keep a human in the loop. Every AI output reviewed while trust is earned. Billing already knows what correct job data looks like. That becomes the validation layer.
  • Expand by evidence, one location at a time. Prove it, prove it is not a fluke, then test a genuinely different context before scaling everywhere.
  • Govern what gets built. Grow autonomy against documented thresholds, with a human accountable for every billing decision.

The plays that run each canon come from our reusable plays library. The ones selected for this engagement are instantiated in the Charter. ## Who runs it

This can run with your own team, with another firm, or with us. It needs a few clear accountabilities: someone who owns the objective and clears the way, someone who does close observation and design, a builder who moves fast and discards what fails, a respected field voice whose adoption signals whether a change holds, and someone who watches the numbers for drift. You have most of these people. The build capability is the piece you would bring in.

The recommended first move and the 90-day frame

Start with the data-capture handoff. It is the highest-leverage, most contained, most measurable move on the board. The first 90 days: cheap experiments to validate the approach and rule out the wrong paths, a working build piloted in one location with a human reviewing every submission, and a measured result against the baseline that decides whether to expand. Prove AI works at one handoff and the operation gains not just a fixed handoff but the capability to fix the next one.

Part Two: The Delivery Proposal

The proposal to build the playbook's first move: the field-to-billing data-capture system. Scoped only after the playbook showed what is worth building.

What we understand

Technicians capture job data on paper. Humans re-key it into billing. That handoff produces a 38 percent invoice correction rate, clarification calls on nearly half of jobs, and a three-day billing reconciliation lag. Compounded, it runs to roughly six hours of rework per day per location. The scheduling software you were told to fix was never the problem. This handoff is.

What we will build

A field data-capture system that moves job data from the technician to billing without a human re-keying paper. AI does the transcription. A human reviews it while trust is earned. Technicians spend less time on capture than they do on paper, not more, or they will not use it.

How we will work

Four phases, mapped to the WISER canons. Each phase is independently valuable, priced on its own, and earns the next. The engagement can stop at any phase boundary with value already in hand.

Phase 1: Interrogate

Cheap experiments before any production build. Rule out the scheduling theory directly. Test capture approaches against what technicians will actually use in the field. Days, not months.

  • Validate where the friction really compounds.
  • Test capture options and measure them against paper on speed and accuracy.
  • End of phase: a validated approach, with the wrong paths ruled out for the cost of a few days.

Phase 2: Solve

Build the working system and pilot it in one location with a human in the loop on every submission.

  • Build the capture system for real field conditions, with a fallback path for when the primary fails.
  • Pilot in one location. Billing validates every submission and logs what the AI misses.
  • End of phase: the system proven in one location against the baseline, with the numbers to show it.

Phase 3: Expand

Roll out region by region. Residential first, then commercial, adapting to context rather than assuming it transfers.

  • Prove it in a second similar location. Then test it where the context is genuinely different.
  • Adapt the system to contexts the first locations did not have.
  • End of phase: all locations live, residential and commercial, monitored through each wave.

Phase 4: Refine

Grow autonomy as reliability is proven, under documented governance.

  • Define tiers of human oversight by job risk.
  • Set the evidence thresholds that let a job type move to less oversight.
  • Stand up the monitoring and review cadence that catches drift before it compounds.
  • End of phase: a governed system where autonomy has grown where it was earned, and a human is accountable for every billing decision.

What we need from you

  • A field validator. A respected technician who will test honestly and whose adoption signals whether the field will follow.
  • A billing validator. Someone who knows what correct job data looks like and will review the AI's output during the pilot.
  • Leadership air cover and weekly check-ins.
  • Access to the systems the capture touches.

Infrastructure

You provide the field devices, the billing system access, and the software licenses for the AI services used. We provide the build, the AI architecture, and the implementation.

Who is working on this

A senior practitioner who leads the engagement and owns the objective with your team, and a builder who does the development. Your people fill the field, billing, and operations seats. Small team, close to the work.

Investment

Phased. Each phase is priced on its own so the engagement can stop at any phase boundary with value already delivered. The fee basis and amounts are held in ../../../Clients/. We did not fabricate figures for this anonymized record.


Charter

What a Charter is

Not a project plan. Not a requirements document that executes once and collects dust. A Charter is the memory that survives the chaos. Its value is the decision log: when someone asks six months later why photo capture won over voice-only, or why we did not rip out the ERP, the answer is here, with the alternatives that were weighed and the evidence that settled it. The Architect keeps it current, same-day.

Metadata

Field Value
Project Field-to-billing data capture
Client The HVAC and field services company (anonymized)
Charter Keeper The Architect
Dates Held in ../../../Clients/; relative markers used here
Current canon Refine. The system is live across locations and governed.
Version End-of-first-project state

Positions

The work was held together by clear accountabilities, not an org chart.

Position Who held it Tension owned
Sponsor Operations leadership Authority. Owned the objective and cleared the way.
Guide First Strategy senior practitioner Translation. Carried the method and kept the Charter honest.
Architect First Strategy Curiosity and stewardship. Close observation, design, and Charter Keeper.
Sage A long-tenured insider Context. Opened doors and supplied history.
Scout A respected field technician Empathy. Validated whether a change would actually be adopted.
Builder First Strategy developer Execution. Moved fast and discarded what failed.
Finance lead Client finance Safety. Watched the numbers and caught the drift.
Billing validator The billing manager Integrity. Became the human validation layer.

On a small team one person can hold several Positions. As the system proved reliable, a Position could be augmented by an AI agent inside documented constraints, with the human shifting from doing to directing and reviewing.

Objectives and constraints

The build specification: what the project set out to do and the lines it would not cross.

Scope

In scope: capture job data at the point of service and get it to billing without a human re-keying paper. Out of scope: scheduling, inventory, routing, procurement. Those are later moves on the roadmap and were not touched in this build.

Objective and success criteria

Capture accurate job data at the point of service with under fifteen seconds of technician effort, and prove it against the baseline.

Measure Baseline Target Result
Invoice error rate ~38% Single digits Residential 9%, commercial 12%
Billing reconciliation 3 days Same-day Same-day on the large majority of jobs
Technician capture time 5 to 10 min on paper Under 15 sec, 8 the goal ~11 sec average
Paper usage 100% Down 80%, rest a fallback Down ~80%, remainder the fallback path

Constraints

  • Faster than paper or the field rejects it. Under fifteen seconds is the hard line.
  • Works in real field conditions: background noise, weather, gloves, a customer waiting.
  • Tolerant of lost connectivity in parts of the service area.
  • Does not disturb the running systems. It replaces one handoff, nothing else.

Architecture and human-in-the-loop design

A capture client in the technician's hand. An extraction layer that turns photo and voice into structured fields, trained on the operation's own historical invoices so it knows the local vocabulary and voices. A human-validation step. A flow into the billing system. A dual-path design: voice primary, photographed paper fallback, both handled by the same extraction layer. Every failed transcription becomes training data.

During the pilot, the billing reviewer validated every submission: photo, transcription, and each extracted field, checked against the job record. Misses were logged with the pattern that caused them, part names confused, phrases misread, unclear photos, and fed back to improve the model. That log became the checklist for the next location's reviewer. The validation grip loosened only later, under the governance tiers in the Hierarchy of Agency, never before the evidence supported it. A human stays accountable for every billing decision.

Current state at the start

Carried from the Day One Audit. The dispatcher schedules from judgment the software cannot hold. The real friction is the paper handoff from field to billing. The baseline: a 38 percent invoice correction rate, six hours of daily rework per location, and a three-day reconciliation lag. Roughly 80 percent of invoice errors originate at the paper-capture step.

Decision log

The decisions that shaped the build, each with the alternatives weighed and the evidence that settled it. This is the part of the Charter that answers "why did we do it this way."

When Decision Alternatives rejected Rationale Evidence
Interrogate, wk 1 Rule scheduling out as the first move Build the AI scheduling the vendors pitched The thing leadership and three vendors believed was never tested on the floor Experiment 1: the visibility board went all but unused
Interrogate, wk 2 Target the field-to-billing handoff Keep looking; replace the billing system The errors cluster at one step, not across the system Experiment 2: error tracing put ~80% at paper capture
Interrogate, wk 2 Treat the dispatcher as a later move, not the first Automate dispatch now A third of her work is judgment that should stay human Experiment 3: a day mapping her decision logic, ~70/30
Interrogate, wk 3 Kill the digital form Roll a digital form out to the field Slower than paper and failed on low signal Experiment 4: technicians abandoned it within days
Solve, wk 1 Build dual-path capture: voice primary, photo-of-paper fallback Voice only; chase noise cancellation or better microphones Voice fails in field noise, but the field needs a path that always works First field test returned garbage on a noisy rooftop
Solve, wk 3 Human reviews every submission during the pilot Trust AI output and spot-check AI hallucinated on a meaningful share of submissions; humans must direct AI Pilot review caught duration and part-name errors before billing
Solve, exec review Replace one component at a time Rip out the ERP and rebuild from scratch Clean-slate replacements overrun for years and still fail Two prior failures cited; the pilot proved a single handoff can be replaced and validated
Expand Build a commercial version with milestone billing Force the same-day model onto commercial work Commercial jobs span phases; firing an invoice on every capture billed for unfinished work The third location broke to a 55% error rate
Refine Monitor accuracy by segment, alert at a 5% drop Trust the aggregate dashboard Aggregates hide a single failing segment The drift incident: one service line fell while the headline held
Refine Graduate routine residential to Tier 1 oversight Keep full human review everywhere Autonomy is earned on evidence, not granted 90 days above 95% accuracy, disputes below baseline, zero boundary violations

The decision and experiment record

The supporting narrative behind the log. The project ran the full WISER method. Witness had already found where AI fit; the project picked up at Interrogate and ran through Refine.

Interrogate

Three cheap experiments tested the inherited diagnosis, then three more tested how to capture data without slowing the technician. The scheduling theory died when a visibility board went unused and became, by the end of the week, a place to post a fantasy football league. Error tracing put roughly 80 percent of invoice errors at the paper-capture step. A day mapping the dispatcher's logic showed about 70 percent pattern-based and 30 percent judgment. Then: a digital form, killed for being slower than paper; a photo of the paper plus AI transcription, which dropped errors without changing technician behavior and was kept as the fallback path; and photo plus voice, which raised the question of removing paper entirely and set the Solve target.

Solve

The build target was photo plus voice, under fifteen seconds, eight the goal. The first field test exposed the jagged frontier of AI: it parsed complex part numbers without trouble and choked on wind noise on a commercial rooftop. The hard part was the noise, not the vocabulary. The team added the dual path and retrained the voice model on real field audio, compressors and truck engines and wind. Voice recognition in noisy conditions climbed [from about 60 to about 85 percent], with the paper-photo fallback catching the rest.

The pilot ran in one location: twelve technicians, four weeks, a human in the loop on every submission. The billing manager reviewed each one and logged every miss. The misses were not random; the model struggled with one trade's vocabulary and with certain spoken numbers, which is the same weakness that would later resurface as drift. The pilot held: invoice errors at 9 percent down from 38, clarification calls near zero, same-day reconciliation, eleven-second average capture, paper down 80 percent. One location recovered about 47,000 dollars in a single month.

At the executive review, leadership wanted to rip out the whole back-office system. The recommendation held: replace one component at a time. Ripping out everything at once is how a project becomes a multi-year overrun that still does not work.

Expand

The second location, residential, matched the first and proved it was not a fluke. The third was fully commercial and broke. The system had learned that a capture means a finished job, because every job in the first two locations was same-day. Commercial work captures at every phase, so the system fired invoices for unfinished jobs and the error rate climbed to 55 percent, worse than paper. The team pulled it within days, returned that location to paper, observed commercial work for three days, and rebuilt for milestone billing. Building it surfaced a constraint: the billing system did not support milestone billing, which forced an intermediary layer and a slipped timeline that the Builder owned directly to the client. The commercial version recovered the location from 55 to 11 percent. All locations went live, two to a wave, residential at 9 percent and commercial at 12.

Refine

Going live was not the end. Stable is not governed. The team moved from building to watching.

Hierarchy of Agency

Three tiers of human oversight by job risk. The technician's field confirmation is the first gate; the tier governs what happens after that confirmation.

Tier Oversight Applies to
1: Auto-approve with spot-checks No review of individual invoices. Random spot-checks on a small share, plus monthly red-team testing Low-risk routine jobs the AI has proven reliable on
2: Oversight on key details A one-screen summary highlights price, parts, and duration. One click to approve if the highlighted fields look right Medium-risk jobs where specific fields need a check
3: Full review Every invoice gets human eyes before it goes out. No exceptions High-risk jobs: complex repairs, commercial projects

A job type moves to less oversight only on evidence: accuracy above 95 percent for 90 consecutive days, disputes below the pre-system baseline, zero boundary violations. Routine residential graduated to Tier 1 on this evidence. If a tier drifts, it falls back to more oversight. A human is accountable for every billing decision, at every tier. Autonomy is grown, not granted.

Risk register

Risk Mitigation Status
Field rejects anything slower than paper Hard fifteen-second constraint; validated with a respected technician before building Held; capture ran ~11 sec
Voice fails in noise Dual-path fallback to photographed paper; retrained on real field audio Resolved; noise accuracy climbed and the fallback catches the rest
Lost connectivity Offline-tolerant capture with deferred sync Designed in
AI hallucination on edge cases Human review on every submission during pilot; hallucination log; segment monitoring later Active control
A new context breaks assumptions Sequence the rollout; test a genuinely different context before scaling Realized at the commercial location; recovered
Drift hidden by aggregate metrics Monitor by segment, alert on any type dropping >5% from baseline, weekly drift review Added after the drift incident

Drift and incident record

After routine residential graduated to Tier 1, the system drifted. Extraction accuracy for one service line quietly dropped while the headline accuracy still looked fine. The training data had been heavier on the trade that adopted first, so the model learned that trade deeply and the other shallowly, and disputes for the weaker line rose before anyone saw it in the aggregate. It was caught by watching dispute trends by service type, not by the headline metric.

The response:

Action Detail
Retrain Balanced data for the under-served line, including fresh sample voice notes
Segment monitoring Track accuracy by service type; alert if any type drops more than 5% from baseline
Weekly drift review A standing 30-minute review of segment accuracy, dispute trends, and anomalies, led by the Finance lead
Remediation Credits issued to customers billed wrong during the drift window

The weaker line recovered. The lesson logged: AI optimizes for what you measure, so measure segments, not just aggregates, from the start. Any drift past threshold triggers a fall-back to more oversight under the Hierarchy of Agency until resolved.

Evolution history

How the oversight posture changed over time, and why.

When Change Trigger
Pilot 100% human review on every submission Trust not yet earned
Post-pilot rollout Same posture carried location to location; each wave monitored Expansion by evidence
After Tier 1 graduation Routine residential auto-approved with spot-checks and monthly red-team Graduation criteria met
After the drift incident Added segment monitoring, 5% alerts, weekly drift review, and the fall-back rule One service line drifted under the aggregate

Current status and the autonomy transfer

Three projects delivered. The client now runs the method largely on their own, and we are on a low monthly advisory retainer. The relationship moved through tiers the same way a job type graduates to less oversight: on evidence.

Tier Project We held Client held
Lead Project one: the data-capture build The Guide role and the build Watched the process work, learned the method
Co-guide Project two (unnamed) Co-guiding, technical oversight, trained their Guide Led the doing
Oversight Project three Oversight and judgment calls Led the project
Advisory Now A monthly check-in and a second set of eyes Owns the capability

The goal was never to be permanently needed. It was to build the muscle and step back. The client continues to apply the same process to new components. If a problem arises that is worth more than oversight, we scope it the same way we scope any engagement.

Outcomes

  • Invoice errors: residential 9 percent and commercial 12 percent, down from 38.
  • Billing reconciliation: same-day, down from three days.
  • Capture time: about eleven seconds per job, faster than paper.
  • Recovered billing: about 247,000 dollars per month, roughly 2.96 million annually.
  • No layoffs. Billing clerks who chased data-entry errors became exception handlers who catch what the AI misses and feed corrections back. The same team now carries more locations than before. Capacity grew instead of headcount.

Plays

The WISER plays this engagement ran, instantiated with the client's specifics. This is the index and what each produced. The high-value plays are held as standalone documents; the rest were applied inline in this Charter. | Canon | Play | What it produced | Source | |-------|------|------------------|--------| | Witness | Friction Mapping | The friction table and root-cause read at the field-to-billing handoff | Standalone play | | Witness | Field Observation, User Flow Mapping, Documenting Current State | The end-to-end job trace and systems landscape | Inline in the Day One Audit | | Interrogate | Assumption Auditing | The register of inherited beliefs to test, scheduling theory first | Standalone play | | Interrogate | Experiment Selection, Logging, Rapid Prototyping | The six experiments and what each ruled in or out | Standalone play | | Solve | Human-in-the-Loop Design | The billing-reviewer validation layer and miss log | Standalone play | | Solve | Quality Objective Setting, Pilot Planning, Value Validation | The success criteria, the one-location pilot, the measured recovery | Inline above | | Expand | Readiness Check, Sequencing, Context Fit, Deployment Gating | The wave plan and the commercial-context pivot | Inline above | | Refine | Hierarchy of Agency Design | The three oversight tiers and graduation evidence | Standalone play | | Refine | Drift Monitoring, Incident Response | The segment-monitoring plan and the drift fix | Standalone play | | Refine | Graduation Decision Making, Red Team Testing | The evidence thresholds and tier-one spot-checks | Inline above |

The first problem is solved, and it will not be the last. The operation now has a list of candidates for the same treatment: inventory, scheduling support, technician routing, parts procurement. The difference is that the team knows how to do this now. They built the muscle.


The plays

WITNESS

Friction Map

Witness play, instantiated for the HVAC services engagement. Purpose: locate and quantify where the work breaks, so the build targets a real problem rather than a theoretical one.

The handoffs

One job, dispatch to invoice. Six handoffs, four systems, paper in the middle.

  1. Dispatcher assigns from her spreadsheet.
  2. Dispatch software updated afterward, for billing.
  3. Technician completes the work.
  4. Technician records the job on paper, customer signs the carbon copy.
  5. Paper dropped at dispatch at end of day.
  6. Billing clerk re-keys paper into the billing system the next morning.

The friction, quantified

Paper ticket completion Every job · 5 to 10 min 25% illegible or incomplete
Re-keying into billing Every job · 8 min 15% data-entry errors
Billing clarification calls ~40% of jobs · 12 min Blocks invoice 24+ hours
Invoice corrections ~38% of invoices · 20 min Customer trust erodes
Friction point Frequency Time impact Error rate
Paper ticket completion Every job 5 to 10 min 25% illegible or incomplete
Re-keying into billing Every job 8 min 15% data-entry errors
Billing clarification calls ~40% of jobs 12 min Blocks invoice 24+ hours
Invoice corrections ~38% of invoices 20 min Customer trust erodes

Aggregate: roughly six hours per day per location. Across eight locations, on the order of forty-eight hours of daily waste.

The error stack

Dispatch to technician ~2%
Job capture on paper ~25%
Re-keyed into billing ~15%
Cumulative at invoice ~38%
Stage Error introduced Type
Dispatch to technician ~2% Scheduling conflicts
Job capture on paper ~25% Illegible, incomplete, wrong codes
Re-keyed into billing ~15% Transcription
Cumulative at invoice ~38% Combination

Root cause

About 80 percent of invoice errors originate at the paper-capture step. The paper form at the point of service is the root cause. The hottest point on the map is the field-to-billing handoff, not scheduling.

INTERROGATE

Assumption Register

Interrogate play, instantiated for the HVAC services engagement. Purpose: surface the assumptions to test before committing to a build, including the diagnosis the client arrived with. Each gets a cheap test and a clear bar for proof.

Killed Technicians need better scheduling visibility Believed by leadership and three prior vendors. A visibility board ran for a week and went unused.
Confirmed The field-to-billing handoff causes the errors An end-to-end shadow of one day of jobs put roughly 80 percent of invoice errors at the paper-capture step.
Reframed The dispatcher's knowledge is a single point of failure About 70 percent pattern-based and replicable with data; about 30 percent judgment that stays human. A later move, not the first.
# Assumption Source Cheap test What proves or kills it
1 Technicians need better scheduling visibility Leadership and prior vendors A simple visibility board in front of technicians for a week Proven if they use it; killed if they ignore it
2 The field-to-billing handoff causes the errors The audit's friction map A manual end-to-end shadow of one day of jobs, tracking where errors enter Proven if errors cluster at the paper step; killed if they cluster elsewhere
3 The dispatcher's knowledge is a single point of failure Observation of the spreadsheet A day documenting her decision logic, sorting it into pattern-based versus judgment Confirmed if most is replicable with data; reframed if it is mostly judgment

Why assumption 1 is tested at all: leadership and three prior vendors believed it for years. Ruling it out directly, cheaply, means nobody second-guesses the real finding later.

Results are recorded in the Experiment Log.

Experiment Log

Interrogate play, instantiated for the HVAC services engagement. Purpose: record each experiment, its result, and what it changed. Cheap experiments, days each, before any production build.

1 Scheduling visibility board Invalidated Almost no use. Scheduling ruled out for good.
2 End-to-end error shadow Confirmed ~80% of invoice errors originate at paper capture. Root cause fixed.
3 Dispatcher decision logic Reframed ~70% pattern-based, ~30% judgment. A later opportunity.
4 Digital form to replace paper Rejected Slower than paper; failed in low signal. Killed the form approach.
5 Photo of paper plus AI Worked Errors dropped sharply with no behavior change. Kept as the fallback path.
6 Photo plus voice Promising Set the Solve target: under 15 seconds, paper retired.

Hypothesis experiments

# Experiment Result What it changed
1 Scheduling visibility board in front of technicians for a week Invalidated. Almost no use; technicians get their next job by text and want less input, not more output Scheduling ruled out for good
2 Manual end-to-end shadow of one day of jobs, tracking where errors enter Confirmed. About 80 percent of invoice errors originate at the paper-capture step Root cause fixed: data capture at point of service
3 A day documenting the dispatcher's decision logic Partially confirmed. Roughly 70 percent pattern-based and replicable with data, 30 percent judgment that stays human Reframed the dispatcher as a later opportunity, not the first move

Capture-approach experiments

Having located the handoff as the target, we tested how to capture data without slowing the technician.

# Experiment Result What it changed
4 A digital form to replace paper Rejected. Slower than paper; too many taps in a customer's driveway; failed in low-signal areas Killed the form approach
5 Photo of the paper form plus AI transcription Worked. Errors dropped sharply without asking technicians to change behavior Confirmed AI transcription; kept as the fallback path
6 Photo plus voice to eliminate paper entirely Promising. Raised the hypothesis that paper could be removed, not just transcribed Set the Solve target: under 15 seconds, photo plus voice

Outcome

Two of three inherited hypotheses overturned or reframed. The build target set with the wrong paths ruled out for the cost of a few days. Next: Solve. The build it set up is recorded in the Charter.

SOLVE

Human-in-the-Loop Design

Solve play, instantiated for the HVAC services engagement. Purpose: define who reviews AI output, how, and what gets logged, so a human directs the AI rather than the reverse. A person stays accountable for every billing decision.

During the pilot

Technician capturesphoto + voice, ~11 sec AI extractsstructured job data Human reviewsevery submission during pilot Billingsame-day invoice

Every miss is logged with its pattern and fed back: retraining data for the model, a checklist for the next location's reviewer.

The billing reviewer validates every submission: the photo, the voice transcription, and each extracted field. They already know what correct job data looks like from years of catching errors.

Element Design
Who reviews The billing reviewer, on every submission
What they check Photo, transcription, and each extracted field against the job record
What gets logged Every miss, with the pattern that caused it: part names confused, phrases misread, unclear photos
What the log feeds Retraining the model and a checklist for the next location's reviewer
Accountability A named human owns each billing decision; the AI never sends unreviewed during pilot

How the grip loosens

Validation does not stay at 100 percent forever, and it does not loosen by decree. It loosens only under the agency tiers in the Hierarchy of Agency, and only when the evidence thresholds are met. Until then, every output gets human eyes.

Why this matters

The validation log is not overhead. It is the training data and the early-warning system. When the system later drifted, the discipline of watching outputs by segment is what made the catch possible. See Drift Monitoring.

REFINE

Drift Monitoring

Refine play, instantiated for the HVAC services engagement. Includes the incident response for the drift this engagement caught. Purpose: watch for drift, including drift that looks stable in aggregate but is failing a segment, and define how a failure is caught, contained, and fixed.

The incident

Routine residential graduates to Tier 1 90 days above 95% accuracy, disputes below baseline, zero violations.
One service line drifts, quietly Training data skewed to the trade that adopted first. The aggregate dashboard stays green while segment disputes rise.
Caught by segment, not by the headline Dispute trends watched by service type expose what the aggregate hid.
The three-part fix, plus remediation Retrain on balanced data; segment alerts at a 5% drop; a standing weekly drift review. Customers billed wrong are credited.
Recovered, with the lesson logged AI optimizes for what you measure. Measure segments, not just aggregates, from the start.

After routine residential graduated to Tier 1, the system drifted. Extraction accuracy for one service line quietly dropped while the headline accuracy still looked fine. The cause: the training data was heavier on the service line that adopted first, so the AI learned that line deeply and the other shallowly. Disputes for the weaker line rose before anyone saw it in the aggregate number.

It was caught by watching dispute trends by service type, not by the headline metric. The aggregate dashboard was measuring the wrong thing.

The fix

Action Detail
Retrain Feed the model balanced data for the under-served service line, including fresh sample voice notes
Segment monitoring Track accuracy by service type; alert if any type drops more than 5 percent from baseline
Weekly drift review A standing 30-minute review of accuracy by segment, dispute trends, and anomalies, led by the person who watches the numbers
Remediation Credits issued to customers affected during the drift window

The weaker line recovered. The deeper lesson was logged: AI optimizes for what you measure. Measure segments, not just aggregates, from the start.

Standing monitoring

  • Weekly drift review, by segment.
  • Segment accuracy alerts at the 5 percent threshold.
  • Red-team testing: deliberately try to break extraction with edge cases and ambiguous inputs.
  • Any drift past threshold triggers a fall-back to more oversight under the Hierarchy of Agency until resolved.

Incident response pattern

Catch by segment, not aggregate. Contain by falling back a tier. Fix the root cause, not the symptom. Document the cause, detection, and fix in the Charter so the next person does not relearn it.

Hierarchy of Agency

Refine play, instantiated for the HVAC services engagement. Purpose: define tiers of human oversight by job risk, so autonomy grows where it is earned and a human stays accountable everywhere.

The tiers

Tier 1 Auto-approve with spot-checks Low-risk routine jobs the AI has proven reliable on. Random spot-checks plus monthly red-team testing.
Tier 2 Oversight on key details Medium-risk jobs. A one-screen check on price, parts, and duration; one click to approve.
Tier 3 Full review, no exceptions High-risk jobs: complex repairs, commercial projects. Every invoice gets human eyes.

A job type moves up only on evidence: 90 days above 95% accuracy, disputes below baseline, zero boundary violations. Drift falls it back down.

Tier Oversight Applies to
1: Auto-approve with spot-checks No review of individual invoices. Random spot-checks on a small share, plus monthly red-team testing Low-risk routine jobs the AI has proven reliable on
2: Oversight on key details A summary screen highlights price, parts, and duration. One click to approve if the highlighted fields look right Medium-risk jobs where specific fields need a check
3: Full review Every invoice gets human eyes before it goes out. No exceptions High-risk jobs: complex repairs, commercial projects

Job-type mapping

  • Routine residential maintenance: Tier 1, after graduation.
  • Standard residential repairs: Tier 2.
  • Complex repairs and commercial projects: Tier 3.

Graduation criteria

A job type moves to less oversight only on evidence:

  • Accuracy above 95 percent for 90 consecutive days.
  • Disputes below the pre-system baseline.
  • Zero boundary violations.

Routine residential graduated to Tier 1 on this evidence. The technician's field confirmation is the first gate; the tier governs what happens after that confirmation.

The rule that does not move

A human is accountable for every billing decision, at every tier. Autonomy is grown, not granted. If a tier drifts, it falls back to more oversight. See Drift Monitoring.

Where each of these started.

Every one of these engagements started with a day. A fixed-fee day in the business with leadership. Real work, not slides. A playbook within two weeks. Then a decision.

Start with a Day One

Where to from here.

Start with a day.

A fixed-fee day in your business with your leadership. Real work, not slides. A playbook within two weeks. Day One.

Tell us what you're working on.

Already know you want a build, or have a problem that does not fit Day One? Inquire.