Why we built software-sales-bench
A practical benchmark for testing whether AI models understand business problems before recommending software.
GitHub For our work at Platfio, we need models with good commercial instincts about the problems businesses actually face, and how those problems can be addressed with software.
Unfortunately, a lot of models have a startup-shaped reflex. Ask them about a normal business and they can drift toward the next “AI blockchain Uber-for-dog-food-pizza-delivery marketplace”, or a grand platform idea that sounds exciting in a pitch deck but does not help anyone reduce admin, collect payments, improve utilization, retain customers, or protect margin.
Of course, you can usually push a model in the right direction with heavy system prompting, examples, constraints, and careful product framing. But for this work, I wanted to know which models already have the instincts: which ones naturally understand ordinary commercial problems before being over-directed.
Most businesses do not need a fantasy marketplace. They need software that produces a return on investment: fewer manual follow-ups, cleaner operations, faster approvals, better customer experience, and clearer visibility into where time and money are leaking.
That is why I created software-sales-bench. The goal is to assess current models and inform the playbooks, model selector, and recommended models used across Platfio’s single-turn, multi-turn, and agentic workflow products.
The summary of the first results is as follows.
The benchmark is open source here: github.com/hcassar93/software-sales-bench.
Those results are not the only input into Platfio defaults, but they are one of the practical calibration sources behind them. The benchmark helps decide which models should appear as sensible defaults, which ones are worth keeping available for higher-risk work, and which playbook paths need stronger guardrails before the agent starts writing proposal logic for a real business.
Agent model selector defaults
Playbook web app interface
Public website playbook interface
What software-sales-bench measures
software-sales-bench asks models to recommend software for ordinary businesses without spoon-feeding them the business problems.
The prompt shape is intentionally simple:
A dental clinic asks what custom software you would recommend for their business.
Give the recommendation you would actually make before a discovery call.
Keep it short and use this structure:
1. Likely business problems
2. Recommended software features
3. Why each feature matters
The model has to infer the likely business problems from the business type, recommend practical software, and explain why the features matter.
That last part matters. A feature list is easy. A useful recommendation connects each feature to an operating problem:
| Weak answer | Better answer |
|---|---|
| Build a member app. | Build member self-service so gym members can book classes, buy PT packs, update subscriptions, and stop messaging staff manually. |
| Add integrations. | Sync bookings and payments so staff can see whether a class, PT session, or event has been paid for. |
| Create a dashboard. | Show the owner class fill, failed payments, PT utilization, and churn risk so they know where revenue is leaking. |
This is not a persuasion benchmark. It does not test whether the model can write a slick sales pitch. It tests whether the model has practical software-agency judgment.
The six business tasks
The first version uses six ordinary business types, each chosen because the work is familiar, messy, and easy to hand-wave:
| Task | What a strong answer should understand |
|---|---|
| Dental clinic | Chairs, no-shows, recall, treatment acceptance, forms, payments, privacy, and existing practice software. |
| Gym | Class bookings, PT bookings, events, induction forms, merch sales, subscriptions, marketing, retention, and trainer workflows. |
| Cafe | Catering orders, pre-orders, stock, deposits, loyalty, prep, rush-period usability, and repeat customers. |
| Auto repair shop | Bay utilization, digital inspections, quote approvals, parts, customer updates, payments, and repeat service. |
| Accounting firm | Document chasing, deadlines, client portals, review workflows, billing, security, and existing tax/accounting tools. |
| Construction subcontractor | Site updates, photos, variations, timesheets, materials, progress claims, field usability, and margin control. |
The hidden rubric is broad, not a single ideal answer. A website is not automatically bad. An app is not automatically bad. CRM and ERP-style features are often exactly what a business needs.
The benchmark penalizes recommendations when they are generic, overbuilt, unsafe, full of jargon, or disconnected from how the business actually makes or saves money.
Scoring
Each answer is judged across seven dimensions:
| Dimension | What it asks |
|---|---|
| Problem understanding | Does the model understand what this kind of business actually struggles with? |
| Software fit | Would the software save time, reduce mistakes, increase revenue, or improve customer experience? |
| Feature reasoning | Does it explain why each main feature matters? |
| Scope discipline | Does it avoid random feature lists and overbuilt systems? |
| Implementation realism | Does it consider existing tools, staff usage, privacy, and purposeful integrations? |
| Rubbish resistance | Does it avoid shiny nonsense like vague AI, pointless marketplaces, blockchain, or apps with no reason? |
| Owner clarity | Could a normal business owner understand and care about the recommendation? |
I also record cost, input tokens, output tokens, and the judge’s rationale for every answer. The generated report includes a cost-versus-quality scatterplot, and clicking a model point opens the raw response and scoring notes.
Results from the first run
The first run used 15 models across 6 business tasks, for 90 judged answers. The judge was openai/gpt-5.4-mini.
| Rank | Model | Score | Avg cost / answer | Rubbish rate |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 82.5 | $0.018660 | 0% |
| 2 | GPT-5.5 High | 81.0 | $0.014395 | 0% |
| 3 | GPT-5.5 Extra High | 80.3 | $0.021845 | 0% |
| 4 | Gemini 3.1 Pro | 79.5 | $0.013772 | 0% |
| 5 | GPT-5.5 Medium | 78.3 | $0.011205 | 0% |
| 6 | DeepSeek V4 Pro | 77.8 | $0.001654 | 0% |
| 7 | Claude Opus 4.8 | 76.3 | $0.016648 | 0% |
| 8 | GPT-5.5 Low | 76.2 | $0.010190 | 0% |
| 9 | Kimi K2.6 | 72.7 | $0.006678 | 0% |
| 10 | GLM 5.1 | 72.7 | $0.005005 | 0% |
| 11 | Claude Sonnet 4.6 | 72.3 | $0.005049 | 0% |
| 12 | Gemini 3.5 Flash | 67.3 | $0.011737 | 33.3% |
| 13 | Claude Haiku 4.5 | 64.0 | $0.001764 | 0% |
| 14 | Grok 4.20 | 64.0 | $0.000890 | 16.7% |
| 15 | Grok 4.3 | 62.8 | $0.001685 | 0% |
The most interesting result was not simply who won. It was the cost-quality shape in the scatterplot at the top of the article.
Claude Opus 4.7 had the highest average score, but DeepSeek V4 Pro landed close to the top group at a fraction of the cost. GPT-5.5 High and Gemini 3.1 Pro were both strong, but not dramatically ahead of cheaper options. Some models that looked strong on the dental-only smoke test fell when the benchmark added gym, cafe, accounting, auto repair, and construction.
That is the point of using boring businesses. A model can memorize that dentists need reminders and intake forms. It is harder to consistently understand class bookings, PT sessions, cafe catering, quote approvals, tax document chasing, and change-order disputes.
Case difficulty
The average score by business type:
| Business task | Average score |
|---|---|
| Dental clinic | 79.9 |
| Auto repair shop | 76.6 |
| Accounting firm | 76.1 |
| Construction subcontractor | 70.8 |
| Cafe | 70.5 |
| Gym | 69.3 |
The gym task was the hardest in this run. That makes sense. The hidden rubric expects the model to understand several overlapping business lines: class bookings, personal training, events, induction forms, merchandise, subscriptions, and marketing. A generic “member app” answer is not enough.
What the benchmark is really testing
The benchmark is not asking, “Can the model sell software?”
It is asking:
- Does it know what problems this business probably has?
- Does it recommend software that would actually help?
- Does it explain the business reason for each feature?
- Does it avoid startup rubbish?
- Does it know when to wrap existing systems instead of replacing them?
- Can a business owner understand the answer?
This matters because software sales is not only persuasion. A good agency wins by diagnosing the right problem and proposing something practical enough to buy, build, and use.
If the model gets the plan wrong, the pitch does not matter.
What I would improve next
This is only a first version. The next iterations should add:
- More business types.
- Multiple runs per model to reduce variance.
- A judge panel instead of one judge model.
- Separate leaderboards for quality, cost efficiency, and owner clarity.
- A public results page generated from
run.json. - A few holdout cases so models cannot overfit the public rubric.
But even this first run is useful. It shows that practical business judgment is measurable, and that model quality depends heavily on the kind of ordinary commercial work you care about.