Why we built software-sales-bench

A practical benchmark for testing whether AI models understand business problems before recommending software.

Platfio OpenRouter GitHub

For our work at Platfio, we need models with good commercial instincts about the problems businesses actually face, and how those problems can be addressed with software.

Unfortunately, a lot of models have a startup-shaped reflex. Ask them about a normal business and they can drift toward the next “AI blockchain Uber-for-dog-food-pizza-delivery marketplace”, or a grand platform idea that sounds exciting in a pitch deck but does not help anyone reduce admin, collect payments, improve utilization, retain customers, or protect margin.

Of course, you can usually push a model in the right direction with heavy system prompting, examples, constraints, and careful product framing. But for this work, I wanted to know which models already have the instincts: which ones naturally understand ordinary commercial problems before being over-directed.

Most businesses do not need a fantasy marketplace. They need software that produces a return on investment: fewer manual follow-ups, cleaner operations, faster approvals, better customer experience, and clearer visibility into where time and money are leaking.

That is why I created software-sales-bench. The goal is to assess current models and inform the playbooks, model selector, and recommended models used across Platfio’s single-turn, multi-turn, and agentic workflow products.

The summary of the first results is as follows.

Scatterplot of software-sales-bench quality score against average cost per answer for fifteen models.
Higher is better. Further left is cheaper. DeepSeek V4 Pro is the cost-efficiency outlier in this run; Claude Opus 4.7 scored highest overall, but at a much higher average cost per answer.

The benchmark is open source here: github.com/hcassar93/software-sales-bench.

Screenshot of the software-sales-bench GitHub repository.
The repo includes the benchmark source, prompt shape, scoring data, and report outputs so the results can be inspected rather than treated as a black box.

Those results are not the only input into Platfio defaults, but they are one of the practical calibration sources behind them. The benchmark helps decide which models should appear as sensible defaults, which ones are worth keeping available for higher-risk work, and which playbook paths need stronger guardrails before the agent starts writing proposal logic for a real business.

Agent model selector defaults

A Platfio model selector showing default model families grouped by provider.

Playbook web app interface

A Platfio web app playbook prompt for writing a software proposal.

Public website playbook interface

A public Platfio website playbook page for generating a construction software proposal.
The benchmark forms part of the defaults behind these surfaces: the model selector, the internal playbook app, and the public playbook website all need models that can infer ordinary business problems before producing a software recommendation.

What software-sales-bench measures

software-sales-bench asks models to recommend software for ordinary businesses without spoon-feeding them the business problems.

The prompt shape is intentionally simple:

A dental clinic asks what custom software you would recommend for their business.
Give the recommendation you would actually make before a discovery call.

Keep it short and use this structure:
1. Likely business problems
2. Recommended software features
3. Why each feature matters

The model has to infer the likely business problems from the business type, recommend practical software, and explain why the features matter.

That last part matters. A feature list is easy. A useful recommendation connects each feature to an operating problem:

Weak answerBetter answer
Build a member app.Build member self-service so gym members can book classes, buy PT packs, update subscriptions, and stop messaging staff manually.
Add integrations.Sync bookings and payments so staff can see whether a class, PT session, or event has been paid for.
Create a dashboard.Show the owner class fill, failed payments, PT utilization, and churn risk so they know where revenue is leaking.

This is not a persuasion benchmark. It does not test whether the model can write a slick sales pitch. It tests whether the model has practical software-agency judgment.

The six business tasks

The first version uses six ordinary business types, each chosen because the work is familiar, messy, and easy to hand-wave:

TaskWhat a strong answer should understand
Dental clinicChairs, no-shows, recall, treatment acceptance, forms, payments, privacy, and existing practice software.
GymClass bookings, PT bookings, events, induction forms, merch sales, subscriptions, marketing, retention, and trainer workflows.
CafeCatering orders, pre-orders, stock, deposits, loyalty, prep, rush-period usability, and repeat customers.
Auto repair shopBay utilization, digital inspections, quote approvals, parts, customer updates, payments, and repeat service.
Accounting firmDocument chasing, deadlines, client portals, review workflows, billing, security, and existing tax/accounting tools.
Construction subcontractorSite updates, photos, variations, timesheets, materials, progress claims, field usability, and margin control.

The hidden rubric is broad, not a single ideal answer. A website is not automatically bad. An app is not automatically bad. CRM and ERP-style features are often exactly what a business needs.

The benchmark penalizes recommendations when they are generic, overbuilt, unsafe, full of jargon, or disconnected from how the business actually makes or saves money.

Scoring

Each answer is judged across seven dimensions:

DimensionWhat it asks
Problem understandingDoes the model understand what this kind of business actually struggles with?
Software fitWould the software save time, reduce mistakes, increase revenue, or improve customer experience?
Feature reasoningDoes it explain why each main feature matters?
Scope disciplineDoes it avoid random feature lists and overbuilt systems?
Implementation realismDoes it consider existing tools, staff usage, privacy, and purposeful integrations?
Rubbish resistanceDoes it avoid shiny nonsense like vague AI, pointless marketplaces, blockchain, or apps with no reason?
Owner clarityCould a normal business owner understand and care about the recommendation?

I also record cost, input tokens, output tokens, and the judge’s rationale for every answer. The generated report includes a cost-versus-quality scatterplot, and clicking a model point opens the raw response and scoring notes.

Results from the first run

The first run used 15 models across 6 business tasks, for 90 judged answers. The judge was openai/gpt-5.4-mini.

Horizontal bar chart ranking fifteen models by their average software-sales-bench quality score.
Claude Opus 4.7 led the first run, but the top cluster was tight: GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 Pro all stayed within practical striking distance.
RankModelScoreAvg cost / answerRubbish rate
1Claude Opus 4.782.5$0.0186600%
2GPT-5.5 High81.0$0.0143950%
3GPT-5.5 Extra High80.3$0.0218450%
4Gemini 3.1 Pro79.5$0.0137720%
5GPT-5.5 Medium78.3$0.0112050%
6DeepSeek V4 Pro77.8$0.0016540%
7Claude Opus 4.876.3$0.0166480%
8GPT-5.5 Low76.2$0.0101900%
9Kimi K2.672.7$0.0066780%
10GLM 5.172.7$0.0050050%
11Claude Sonnet 4.672.3$0.0050490%
12Gemini 3.5 Flash67.3$0.01173733.3%
13Claude Haiku 4.564.0$0.0017640%
14Grok 4.2064.0$0.00089016.7%
15Grok 4.362.8$0.0016850%
Rubbish resistance chart showing Gemini 3.5 Flash at 33.3 percent, Grok 4.20 at 16.7 percent, and thirteen other models at zero percent.
The rubbish flag stayed quiet for most models. Gemini 3.5 Flash and Grok 4.20 were the main exceptions in this run.

The most interesting result was not simply who won. It was the cost-quality shape in the scatterplot at the top of the article.

Claude Opus 4.7 had the highest average score, but DeepSeek V4 Pro landed close to the top group at a fraction of the cost. GPT-5.5 High and Gemini 3.1 Pro were both strong, but not dramatically ahead of cheaper options. Some models that looked strong on the dental-only smoke test fell when the benchmark added gym, cafe, accounting, auto repair, and construction.

Bar chart showing average benchmark score by model family or provider.
Grouped by family, OpenAI had the strongest average across its GPT-5.5 variants. DeepSeek was a single-model result, but it was strong enough to sit near the top while staying unusually cheap.

That is the point of using boring businesses. A model can memorize that dentists need reminders and intake forms. It is harder to consistently understand class bookings, PT sessions, cafe catering, quote approvals, tax document chasing, and change-order disputes.

Case difficulty

The average score by business type:

Business taskAverage score
Dental clinic79.9
Auto repair shop76.6
Accounting firm76.1
Construction subcontractor70.8
Cafe70.5
Gym69.3
Bar chart showing average benchmark score by business task, with gym as the lowest-scoring task.
Dental was the easiest case in this first run. Gym was the hardest because the model has to reason across classes, personal training, memberships, events, merchandise, and retention at the same time.

The gym task was the hardest in this run. That makes sense. The hidden rubric expects the model to understand several overlapping business lines: class bookings, personal training, events, induction forms, merchandise, subscriptions, and marketing. A generic “member app” answer is not enough.

What the benchmark is really testing

The benchmark is not asking, “Can the model sell software?”

It is asking:

  • Does it know what problems this business probably has?
  • Does it recommend software that would actually help?
  • Does it explain the business reason for each feature?
  • Does it avoid startup rubbish?
  • Does it know when to wrap existing systems instead of replacing them?
  • Can a business owner understand the answer?

This matters because software sales is not only persuasion. A good agency wins by diagnosing the right problem and proposing something practical enough to buy, build, and use.

If the model gets the plan wrong, the pitch does not matter.

What I would improve next

This is only a first version. The next iterations should add:

  • More business types.
  • Multiple runs per model to reduce variance.
  • A judge panel instead of one judge model.
  • Separate leaderboards for quality, cost efficiency, and owner clarity.
  • A public results page generated from run.json.
  • A few holdout cases so models cannot overfit the public rubric.

But even this first run is useful. It shows that practical business judgment is measurable, and that model quality depends heavily on the kind of ordinary commercial work you care about.