GuidefinanceaccountingFinOps

AI COGS vs R&D: A CFO's Decision Tree for Classifying AI Spend on the P&L

May 1, 2026 · Spendline

If 52% of your revenue is going to AI providers, where does it sit on the P&L?

ICONIQ's 2025 data on AI-native companies put that ratio in front of every CFO in the industry, and the question it raised is no longer hypothetical. AI spend has crossed the threshold where the difference between cost of goods sold (COGS) and research and development (R&D) is the difference between a 52% gross margin company and a 5% gross margin company — and a board narrative that holds up versus one that does not.

This guide is a decision tree, not a treatise. It walks through the classification choices a CFO has to make for AI spend, where the obvious answers are, where the genuinely hard cases are, and what auditors will eventually expect you to document.

It is not legal or accounting advice. Talk to your auditor. But the framework below is the one most companies are landing on as the practice settles.

Why classification matters

Three reasons, in order of how much they will affect you:

  1. Gross margin reporting. If AI inference cost gets booked to R&D, your gross margin looks great and your operating loss looks worse. If it gets booked to COGS — which is where it usually belongs — your gross margin reflects reality. Investors and analysts compare gross margins across companies; misclassification breaks that comparison and eventually gets caught.

  2. Investor perception and valuation. SaaS companies trade at multiples that assume 70–85% gross margins. AI-native companies do not have those margins, but how you report what you have determines whether your story is "high-multiple SaaS with some COGS pressure" or "low-multiple commodity reseller." The accounting choice is the narrative.

  3. Audit and disclosure. Once AI spend is material, public-company auditors will require a documented classification methodology. Private companies that intend to go public, or get acquired by a public company, are inheriting that requirement on the timeline of their next major event.

The penalty for getting this wrong is not a fine. It is a restatement, an investor backlash, or a deal that collapses in due diligence because gross margin was overstated. All three have already happened to companies that booked inference cost as R&D and got challenged on it.

The default rule

Start here. Spend on AI inference for revenue-generating product features is COGS.

That sentence does most of the work. Inference is the AI equivalent of cloud compute serving customer traffic. AWS bills for production EC2 are COGS. OpenAI bills for production inference serving the same workloads are also COGS. There is no accounting principle that distinguishes them.

The default rule has two halves:

  • "AI inference" — the cost of running a trained model on a customer request, paying per token or per call. The dominant share of AI spend at most companies is inference.
  • "Revenue-generating product features" — features that exist in the customer-facing product and generate, support, or directly enable revenue. If a customer pays for the product and the AI feature is part of what they are paying for, the inference is COGS.

If both halves apply, the classification is COGS. The CFO does not need to think about it further. The interesting cases are the ones that fall outside this rule.

The decision tree

Here is the framework, in the order of questions to ask:

Is the spend for inference, training, or something else?

├── Inference
│   └── Is the inference serving customer-paid traffic?
│       ├── Yes → COGS
│       └── No (internal-only / employee tools) → SG&A or R&D depending on team
│
├── Training (full pretraining)
│   └── Is the model expected to drive future revenue?
│       ├── Yes → R&D (capitalizable in some cases under ASC 350-40)
│       └── No (research only, not productized) → R&D
│
├── Fine-tuning
│   └── Is it for a deployed product feature or for research?
│       ├── Deployed feature → COGS (treated like inference setup cost)
│       ├── Customer-specific fine-tune (paid feature) → COGS
│       └── Research / not yet productized → R&D
│
├── Evaluation, prompt engineering, dev tooling
│   └── Always R&D or SG&A. Never COGS.
│
└── Other (RAG indexing, embeddings for search, vector storage)
    └── Same test as inference: is it serving customer traffic?
        ├── Yes → COGS
        └── No → R&D

Every box maps to a finance line. No spend should land in a fifth bucket called "other AI." If you cannot fit a particular workload into one of these branches, the workload is undefined enough that you should resolve the ambiguity before classifying.

Walking through each branch

Inference for paid traffic → COGS

The clearest case. A customer makes a request. Your application calls OpenAI or Anthropic. You pay for the tokens. The customer pays you (some portion of) the value of the response. The provider bill is a direct cost of revenue, no different from a transactional payment processor fee or a per-API-call cost from any other usage-based vendor.

This is the bucket where most AI spend lives at most companies. It should be obvious. If it is not, you have a tagging problem (you cannot tell which inference is paid-customer traffic and which is internal), not an accounting problem.

Inference for internal use → SG&A or R&D

Inference workloads that do not touch a paying customer are not COGS. Common cases:

  • Engineers using ChatGPT Enterprise → SG&A (knowledge worker tooling)
  • AI coding assistants (Copilot, Cursor) → R&D (engineering productivity)
  • Internal AI agents for sales/marketing → SG&A (the function being supported)
  • Internal AI agents for product/engineering research → R&D

The classification follows the team using the tool, not the tool itself. The dollars are usually small relative to product inference, but they should be tagged separately because they aggregate to a different line on the P&L.

Training (full pretraining) → R&D, sometimes capitalized

Most companies do not pretrain. If you do, the cost of training a model that is not yet generating revenue is R&D under ASC 730. There is a narrow capitalization carve-out under ASC 350-40 for software developed for internal use or for sale, but the criteria are strict (technological feasibility, intent to complete, etc.) and most early-stage AI training does not meet them. When in doubt, expense to R&D.

Fine-tuning → it depends

This is the hard one. The same activity ("fine-tune model X on dataset Y") can be either COGS or R&D depending on what the result is used for.

Fine-tuning for an already-deployed paid feature → COGS. If you fine-tune a model to improve a feature your customers pay for, that is product cost, treated like a one-time setup cost amortized over the feature's revenue life. The expensing pattern is similar to deploying a new EC2 fleet for an existing service.

Customer-specific fine-tuning sold as a paid feature → COGS. If a customer pays for a custom-tuned model and you fine-tune on their data, that fine-tuning cost is COGS for that customer's revenue.

Fine-tuning for research / not yet productized → R&D. If the fine-tune is exploratory and not connected to deployed customer traffic, expense to R&D.

The audit question on fine-tuning is always: what is the connection between this expense and revenue? If the connection is direct and current, COGS. If it is speculative or future, R&D.

Eval, prompt engineering, dev tooling → R&D or SG&A

These are never COGS. Evaluation harnesses, observability tools (Helicone, Langfuse, Langsmith), prompt-engineering platforms, and AI coding assistants used by engineering teams are R&D. AI tools used by GTM teams are SG&A.

It is tempting to put observability under COGS because "it supports the product." So does Datadog, and Datadog is SG&A. Tools that support the engineering function developing and operating the product are operating expenses, not direct costs of revenue.

RAG, embeddings, vector storage → apply the inference test

Retrieval-augmented generation has both an indexing component (often one-time per corpus) and a query-time component (per-request). Treat them by the same rule:

  • Indexing for a paid-customer feature → COGS (one-time COGS, amortized over the feature life if material)
  • Indexing for an internal tool or research → R&D
  • Query-time embedding lookups serving paid traffic → COGS
  • Vector storage hosting paid-customer data → COGS (treat like any storage cost)

The edge cases that trip people up

Free-trial and freemium traffic

Inference for a customer who is not yet paying — free trial, freemium tier, or product-led growth onboarding — is technically not generating revenue, but it is sales and marketing investment. Most companies book it to sales and marketing, not COGS, with the rationale that it is customer acquisition cost.

A defensible alternative is to book it to "trial COGS" as a separate sub-line of COGS, because the workload pattern is identical to paid-customer inference. Either is acceptable. What is not acceptable is loading trial inference cost onto paying customers' COGS, which inflates per-paying-customer cost and distorts gross margin per customer.

Multi-tenant inference where you cannot separate customer traffic

If your architecture pools inference requests in a way that prevents per-customer attribution, you have a tagging problem and possibly a chargeable-services problem. The accounting answer is: classify the whole bucket as COGS (since it is serving customer traffic), but understand that you cannot do per-customer gross margin until the tagging is fixed. See Per-customer attribution for the implementation pattern.

Agent workloads with thousands of intermediate calls

A single user request that triggers an agentic workflow may produce hundreds or thousands of intermediate LLM calls (planning, tool use, retries, evaluation). All of those calls, even though they are internal to the agent, are part of serving a paid customer request. They are COGS, in aggregate, attributed to the originating customer.

The control question is whether you can attribute them. If your agent framework does not propagate customer dimension into nested calls, you have unattributed COGS. The accounting classification is unaffected, but the unit economics analysis is broken until the propagation is fixed.

Capitalized AI assets

Some companies attempt to capitalize fine-tuned models or training runs as intangible assets under ASC 350-40 (internal-use software) or ASC 985-20 (software for sale). The criteria are tight and the audit risk is real. Before capitalizing, get a written opinion from your auditor that the specific activity meets the criteria. Most early-stage and growth-stage companies will fail the technological-feasibility test and end up expensing.

What auditors are starting to ask

Specific questions that have appeared in 2025–2026 due-diligence and audit inquiries:

  1. What percentage of total AI spend is classified as COGS, and how is the classification methodology documented?
  2. For AI spend classified as R&D, what is the connection (if any) to deployed product features?
  3. For inference workloads, can you produce a breakdown of paid-customer traffic vs. internal vs. trial?
  4. What controls prevent reclassification of historical AI spend without documentation?
  5. For any capitalized AI assets, what specific ASC criteria did you meet, and who reviewed the classification?

Companies that classify AI spend cleanly and consistently can answer all five in a meeting. Companies that have been booking AI to a single GL account cannot answer any of them, and the conversations are not pleasant.

A simple disclosure pattern

For internal management reporting and external investor communication, a clean disclosure pattern looks like:

| P&L line | AI spend component | Example | |---|---|---| | COGS | Production inference, RAG / embeddings serving paid traffic, customer-specific fine-tuning | OpenAI / Anthropic invoices for production traffic | | Sales & marketing | Free trial, freemium, PLG inference | Same models, but tagged to non-paying users | | R&D | Pretraining, exploratory fine-tuning, evals, prompt engineering, AI dev tooling | LLM-based testing infra; experimental model training | | SG&A | Internal AI productivity tooling | ChatGPT Enterprise for non-engineering teams |

If your accounting system can produce that breakdown automatically every month, you have solved the AI classification problem. Most cannot, today, which is why the AI month close is becoming a separate workflow rather than a side note in the standard close.

The single most important rule

If the classification of a particular AI workload is genuinely ambiguous, document the decision and apply it consistently. Auditors are far more forgiving of reasonable judgment calls applied consistently than of brilliant reasoning applied inconsistently. The classification methodology itself is the artifact that needs to survive scrutiny — and once it exists, the individual judgments stop being individually defensible and start being collectively defensible as policy.

This is the same standard already applied to revenue recognition, capitalized software, and any other gray-zone accounting question. AI spend is just the newest one. The discipline is the same.


Spendline captures customer, workflow, and agent on every request as first-class dimensions, plus arbitrary structured tags so you can layer in cost center, intent, or any other classification your accounting policy needs. If your AI is currently booked to a single GL account, request a pilot and we will rebuild the classification from raw request data.