
NVIDIA AI Podcast ยท 2026-05-21
NVIDIA on AI Tokenomics: Maximizing Token Value and Business Impact
Hosts: Noah Kravitz
Guests: Shruti Koparkar
Why it matters
Token value is determined by the intelligence embedded in the token and the speed of token generation (interactivity).
Key claims
- Token value is determined by the intelligence embedded in the token and the speed of token generation (interactivity).
- Use cases dictate the required token value and interactivity; simpler domain-specific models may suffice for narrow tasks.
- Token demand forecasting involves user count, session frequency, tokens per session, reasoning model overhead, agentic workflows, cache hit rates, and demand variability.
- Cost per token is a critical metric combining infrastructure cost and token output, providing a true ROI measure beyond traditional input metrics like GPU cost or FLOPS per dollar.
Episode summary
Summary
In this episode of the NVIDIA AI Podcast, Shruti Koparkar from NVIDIA's accelerator computing team explains the concept of AI tokenomics, focusing on how tokens generated by AI models can be valued, supplied, and monetized to create business value. She emphasizes that token value depends on the intelligence embedded in the token and the speed of token generation, which vary by model complexity, context length, and use case requirements. Business leaders are encouraged to map use cases to appropriate token values and interactivity levels to optimize AI deployments.
Shruti also discusses the importance of measuring cost per token as a key metric that captures both infrastructure input costs and token output, highlighting NVIDIA's Blackwell platform's 35x lower token cost compared to previous generations. She elaborates on NVIDIA's extreme co-design approach, integrating hardware, software, and ecosystem optimizations to drive efficiency gains. Finally, she outlines four primary business models for monetizing tokens: direct token sales, AI-native product development, AI-enhanced existing products, and internal operational improvements, advising leaders to start with customer needs and work backward through tokenomics pillars to maximize ROI.
- Token value is determined by the intelligence embedded in the token and the speed of token generation (interactivity).
- Use cases dictate the required token value and interactivity; simpler domain-specific models may suffice for narrow tasks.
- Token demand forecasting involves user count, session frequency, tokens per session, reasoning model overhead, agentic workflows, cache hit rates, and demand variability.
- Cost per token is a critical metric combining infrastructure cost and token output, providing a true ROI measure beyond traditional input metrics like GPU cost or FLOPS per dollar.
- NVIDIA's Blackwell platform delivers 50x more tokens per watt and 35x lower token cost compared to Hopper, enabled by extreme co-design of hardware, software, and ecosystem.
- Extreme co-design integrates compute, memory, storage, networking, software stack, and ecosystem partners to optimize token generation efficiency and latency, especially for agentic AI workloads.
- Software optimizations, including quantization, speculative decoding, disaggregated serving, and parallelism, significantly improve inference performance and reduce token cost.
- Four main business models for monetizing tokens: direct token sales, AI-native products, AI-enhanced existing products, and internal operational improvements.
Source material
Transcript
Not all tokens are created equal, and there is a way to look at token value.
There are two key factors that impact token value.
One is the intelligence embedded in the token, for how much intelligence does the token carry, and the other is how fast does it arrive.
Welcome to the NVIDIA AI podcast.
I'm Noah Kravitz.
I'm here with Shruti Kopakar.
Shruti is a member of the accelerator computing team here at NVIDIA, and she focuses on inference, and we heard a talk about tokenomics.
As data centers become AI factories and produce intelligence, for the new industrial revolution, this word, tokenomics has been fluttered about.
It's a useful term, but maybe we can break it down with your help Shruti, so that it's really something that business leaders can understand and take into practice.
Yes, absolutely.
Well, first of all, thanks a lot for having me, Noah.
Thank you very much.
I am very excited to dig into the economics of AI or tokenomics.
And as you said, it is a term that gets used quite a bit, and I welcome the opportunity to help define it, so to speak.
So the way to think about tokenomics is it's about how tokens are valued, supplied, consumed, and monetized.
And what that essentially maps to is token utility, which is all about token value, token supply, and this is where your AI infrastructure decisions are, right?
Thinking about what infrastructure to invest in, that will maximize your token output, while minimizing cost, then there's token demand.
This is where customers and organizations think through what is their number of users, how many use cases, what types of use cases?
So really sort of mapping out the volume and velocity of the tokens that they need, and then finally there's token monetization, which is taking the tokens and turning it into business value.
So those are sort of the four pillars, four tokenomics, and it's super important to understand all four of those and how they relate to each other to be able to deploy AI successfully.
So let's start at the top then with utility or value.
How do you define the value of a token?
Are all tokens worth the same?
Do they have differing values?
Is there a better way to look at it?
How do you approach that?
That's a really great question, and you are right, actually, that not all tokens are created equal, and there is a way to look at token value.
There are two key factors that impact token value.
One is the intelligence embedded in the token, or how much intelligence does the token and the other is how fast does it arrive, which is essentially the interactivity.
So to unpack that a little bit, the intelligence of the token is dependent on the model that produced the token.
So more complex, more intelligent models will produce tokens that in general have much more intelligence.
And then it also depends on the context that the models look at.
And generally speaking, the longer the context that you allow the model to look at, the better the accuracy, the better the intelligence of the tokens.
Now I say generally, because there is, there are cases where if the contest increase too much, then the model quality, the output quality can degrade, but I don't want to rabbit-hole in that.
Generally, more context does kind of equate to better intelligence.
So that's one aspect.
And then as I mentioned, how fast the token arrives is the token interactivity, which is essentially tokens per second per user.
So it's the rate of token generation.
And so if you look at token value as a spectrum, on the one hand, you have these basic models with shorter context generating tokens that not that fast speed.
And then on the other extreme is these more complex, more intelligent models, with much larger context and generating tokens that are really fast.
And across that entire spectrum is your different use cases and how you map those use cases to the token value.
So of course, there is a absolute way in which to think about the token value, but there is also a relative way with respect to the use cases in how to think about it.
And we can unpack that a little bit if you want.
So is it fair to say that the value of the token is tied at all to the task that it's hand as well?
Yes.
And that's exactly sort of what I was trying to get at.
And I said that you have to think about mapping the right use case to the token value.
So as an example, we said like I said earlier that that token generated by more complex intelligent models are more valuable.
But that's in an absolute sense.
Relatively speaking, your use case may not require that more complex more intelligent model.
And then that additional value is completely useless to you.
One example of this is domain specific applications where in a very narrow context, a post-trained which is fine tune, small language model.
So much smaller model can give you just the value you need.
In fact, in some cases, even better accuracy for that given task.
So you don't need always the big large models.
And so relatively speaking, you need to map where on the spectrum of token value does your use case set.
Same thing is true for the interactivity piece, which is agentic applications absolutely need the highly interactive token.
Right.
But you may have applications like chat interfaces or enterprise search, which don't need that level of interactivity.
And so that is very critical when you are thinking through your sort of AI deployment decisions of where to map your use case to what token value.
So when a business leader is thinking about demand and thinking about mapping out tokens to use cases and the different use cases have, you know, different values associated with them.
How what's a good approach for someone who's you know, looking at what they're, what their org is doing, what their different team members need.
How do they start to get a handle on?
Well, how many tokens are you going to need to produce?
And how many of each kind?
Yeah.
So use cases are extremely important when thinking through token demand.
And there are three layers in which you can think about this with improving levels of forecasting accuracy, if you will.
So the basics sort of, you know, back of napkin mat is look at how many users you have, how many requests or sessions the user is going to initiate in a given day or month.
And then how many tokens you need per request or per session.
And those three numbers put together will give you your base demand for a single day or a month or, you know, whatever is your time period of analysis.
Now that is the basis and very, very sort of simplistic look at it.
There are multipliers that you do need to account for that will dramatically change your understanding of your token requirement.
And a couple of those are number one, are you using reasoning models?
As we know, like reasoning models use thinking tokens, which never get seen by the end user.
And oftentimes when AI is deployed, you can actually set thresholds on how many thinking tokens are allowed per interaction.
And so when you are estimating demand, you do need to think through are we using reasoning models?
What are our thresholds?
What do we expect?
The peak in average to be on those, you know, on that use.
So that's one.
Second is agentic.
Agentic is a huge multiplier because any use case if you are deploying it in this sort of agentic workflow context, then there are multiple sort of turns and loops that might happen that can increase your token demand significantly.
And then finally, the last factor is something called cache hit rate or the kv cache hit rate.
And for those listeners for whom this term might be new, kv cache is sort of like the short term memory of a model.
And so any time an input request comes in to a model, it needs to process it.
But if it's already seen that input request before, then many times it actually gets stored in the cache.
And then when it comes in again, it doesn't need to recompute it.
Right.
It can just use those cache values.
So those are some key factors to kind of look at, to get to a higher degree of accuracy when thinking about token demand.
And then the final one is demand variability, which is how is your demand changing in a day?
Like sometimes you may have products that get used quite a lot in the morning hours, but not so much in the evening or vice versa.
Same thing with seasonal variability.
For example, retail providers or e-commerce, we'll see a surge during the holidays when they are trying a lot of products out.
So you do need to think through those.
And then of course there is the user growth.
So you started with a base number of users, but you as a business are trying to constantly drive up user growth.
So you need to factor in how much you expect that to grow as you think through your token demand.
So demand of course leads us to supply.
How do you start thinking about supply and you've mapped out, you know, sort of your baseline, the conditions that you just outlined treaty.
How do you go about then translating that into creating the supply necessary to get all these tests done?
So when it comes to token supply, that's where a lot of the AI infrastructure decisions like.
And when you're making that decision, what you want is maximum tokens availability, token output, while minimizing your token cost.
Now when you think about cost or total cost of ownership, oftentimes organizations and decision makers can gravitate towards the easily available metrics.
And what I like to call input metrics, such as the cost per GPU hour or the flops per dollar, which is essentially how many floating point operations are you getting per dollar.
And these are input metrics because they don't tell you anything about the actual deliver token output, which is a function of much more than just flops or just, you know, the memory you have.
It is a function of extreme core design.
And so the metric that represents both your input, but also the output, is cost per token.
It's a very simple metric that tells you what is the cost that you are paying of cost of generating one token.
And it's essentially, you know, the cost of GPU divided by how many tokens does the GPU reduce.
So you know, it incorporates both the input and the output.
And gives you a sense of your true ROI from the AI infrastructure.
It's interesting to hear you explain it because it sounds so simple and I can understand coming from the other point of view right, if we're outlaying for the GPUs and the server racks and all the interconnectivity.
And so we, you know, we count those costs, but looking at it from the other end, is you said the output cost just makes so much sense because that's what you're trying to get at the end is the intelligence, the token.
And so putting the price on that seems like really kind of clear way to think about it.
Does the cost per token metric vary it all?
Or do you have to think about it differently depending on the use cases that you're talking about before?
Cost per token is sort of the base metric.
Now of course, it will vary depending on all the other things like the model, the context, the intelligence basically and then the interactivity.
So any tokens that are generated by a more complex model or are more interactive are going to be cost of course.
They are just, that's just physics.
So yes, it definitely does depend on the models the context as well as the interactivity.
But you know, you said it really well earlier that ultimately, if the business runs on the output, which is the tokens, right?
It is it is kind of a fundamental mismatch.
If you are evaluating infrastructure based on the inputs, but your business runs on the output.
And and that's why cost per token starts to get at sort of the real ROI because it measures both in many ways.
Yeah.
So truly, as we think through input metrics and cost per token, is there an example that comes to mind that can really kind of bring this idea to life?
Yeah, absolutely.
In fact, if you look at Nvidia Blackwell compared to Nvidia Hopper, and if you look at just merely the the input metrics, which is the hourly GPU cost, that's 2x.
And so so that's Blackwell being maybe 2x more expensive than Hopper.
If you just look at Flops per dollar, that's also 2x.
So Blackwell does deliver 2x more Flops per dollar.
Right.
And that sounds like a huge advantage, which it is, but it also doesn't even scratch the surface of the true sort of benefit and value of Blackwell.
Okay.
And that's because Blackwell, when it comes to deliver output, delivers 50x more tokens per watt compared to Hopper.
50x.
50x.
Fantastic.
So with the same infrastructure footprint, the Blackwell and the else 72 system delivers 50x more tokens than than Hopper.
And that translates to a 35x lower token cost.
Amazing.
Yeah.
And so that really, I think, you know, puts, uh, brings the point home on why not just look at the input metrics, but look at a metric like Osper token, which represents both what you're paying, but also what you're truly getting.
So I'm glad you mentioned I was going to ask you to go back.
You mentioned extreme code design.
We've talked about it before, obviously, anyone familiar with the space has heard the term.
But maybe you can dig in a little bit to what it means, particularly in this context.
Yes.
I actually welcome the opportunity to talk about extreme code.
Fantastic.
We can ask this question quite a lot.
And so, you know, often we get asked, why extreme code design?
What does code design even mean?
Like, is it just integration?
And, you know, people may think that this is just splitting here or just semantics.
But I do think that the distinction is important because when you think about integration, you think about different parts, different sort of, you know, independent units that are then integrated, forced factor, whereas code design is about designing from the ground up.
Simultaneously multiple parts of the same system knowing that they are all optimized towards the same outcome, that of lowest token cost.
Right.
And so that's why the word code design is extremely important.
And the reason it is called or rather recall it, extreme code design is what and video does is because of the depth and breadth into which it extends.
So it's code design across just compute.
No, it's compute memory storage networking, every set of everything.
I mean, the very wind platform has seven chips.
But it goes even beyond that.
There's all the software that sits on top.
So everything from the CUDA kernels to the run times, to the serving software, as well as all the way out to the ecosystem.
Everyone from our, you know, silicon partners, our OEMs, our cloud providers that we work with, the various OSS frameworks that we work with, the code design extends beyond just sort of, you know, what's in a system, what's in an AI factory, all the way out to ecosystem.
And that's one of the reasons why it's, it's extreme.
And so anyway, but but you, you asked a question of a more specific question about what are some of the extreme code designs that help with the, with the cost per token.
And I think one important one, which I think you've actually discussed in the previous podcast is the mixture of experts smarter, how the black one, the L72 is such a great fit for them because it kind of helps with the inter GPU communication.
And then all the software, in terms of Dynamo's, disaggregated serving, coupled with any of the run times that we support, whether it is TensorRT, VLLM, SGLang, doing a technique called wide expert parallel that greatly optimizes the inference performance, and then thereby reduces the cost per token for those mixture of experts models.
So that's one great example.
The other really good example is actually the Vera Rubin platform itself, which is built for the age of agentic AI.
And to really understand why that extreme code design is required, maybe we can look at what an agentic workload is like.
Okay.
So thinking about an agentic workload, let's draw the parallel to the conversational workload.
When, in a conversational setting, the user prompt something, say you prompt something, and then the LLM says something.
And then you say something else, and then it says it back.
So you are taking as a human turns with the LLM, with the AI.
In agentic, it's actually AI taking turns with AI as well as with software, because a main agent can sort of based on the user input, decide to do some reasoning, then decide, oh, I need to do a tool call, so call some software.
Then it might decide, oh, I need a sub agent or a specialized agent to go do some work.
So it's going to take a turn with the specialized agent, the specialized agent does its computation, comes back with a result, and this just keeps going.
Right.
And we love a tactic for that.
That's right.
And it's multi-turn in a way that that is sort of has no user involvement, other than the prompt that it would be even terms of like maybe say book a ticket to Miami.
And then it goes through all of this several turns to then finally produce an outcome.
And the number of turns involved in agentic is significantly higher than conversational.
So the number of LLM calls, like the number of times large language knowledge is called is also higher.
And in general, the token demand for that reason is also higher.
And that's why extreme code design is so critical because you, you are using of so many tokens.
So you have to lower the cost per token.
Latency is really important because on every turn, even a couple minute milliseconds more add up to several potential seconds of delay for the end result.
And then finally coming back to the where a Rubin platform, now that we've sort of described the agentic workload, we can clearly see why the code design is required to accelerate sort of the LLM itself or the reasoning and the AI itself.
You need the Rubin GPU, you need the Grok 3LP ex solution and deliver that ultra low latency, you need very CPU because it's going to do all this tool calling or you know sandboxing for for code generation and code testing.
You need the CMX platform that we've talked about, which is the blue field DPUs together with Spectrum X, which allow for the KB cache or the short term, maybe as we discussed, to be offloaded when needed so that it can be retrieved when required for a match with and incoming request.
And so that's sort of another example of code design, where being able to develop all of these from ground up helps a lot.
Right, right.
So we talked about extreme code design and you mentioned all the different pieces that go into a building, designing and building from the ground up.
Software is a part of that, but maybe you could double click Trity into how software plays a role on how important software really is.
Yes, absolutely.
So software actually is the difference between what you get in the real world, the deliver token output and the actual token cost versus what you see on a spec sheet.
Software makes all the difference.
Right.
All the things on the spec sheet, the system design, are cannot be fully realized unless you have software that makes use of it and delivers really good output.
Right.
And the the other important thing about software is that it cannot be piecemeal optimization.
You need to have a robust software stack that can turn on enable every single optimization so that can do say NVFP for quantization.
It can also do MTP or speculative decoding.
It can also do disaggregator surveying.
It can do the wide export parallel, the KV cache offloading, the KVR routing and on and on and on and on.
Right.
To be able to stack all of those optimizations together is really important.
Because that is what gets you the 50 X.
Right.
The 50 X more throughput that we see with Blackwell and the 35 X lower token cost.
And so that's software is a huge huge piece of that story for sure.
The other thing about software is that it never stops open source software, especially it never stops.
No.
And it's not just the Nvidia team that is building the software, it's the entire ecosystem.
Right.
It's all the OSS frameworks, all of our partners, customers, the developer community and every small optimization that they do, that's a drop that just keeps adding and adding to this massive ocean of advantage.
That is the Nvidia ecosystem.
And so just as an example on both VLLM and SGLang, which are these inference run times, we've seen eight X more performance in just about six months.
And that's huge.
Yeah.
Because from the same infrastructure footprint, you're getting so much more token output.
And that's driving down your token cost as well.
So absolutely software is a huge huge piece of the puzzle.
So we're through three of the four pillars.
The fourth one, perhaps the big one, monetization.
How do you talk about monetization?
How does a business leader think about?
Okay.
So I understand the importance of extreme code design.
I understand the different value of tokens in different situations and different tasks.
And intent is wonderful, regardless of all the things you've lucidated here.
How does a business leader think about monetization of the tokens?
Right.
So when it comes to monetizing tokens, there are various different ways in which you can go to market.
But one of the best proxies is to just think through it as you're generating tokens and then you're selling the tokens.
And so when you think about selling the tokens, how much do you sell them for?
And it's sort of a classic exercise in figuring out your pricing, which is a you deep to think about what is the cost to produce the token, which is this lowest token cost that NVIDIA is helping with.
Right.
But you do need to understand what is your token utility and given that token utility and token value, what is your cost to produce the token?
And you obviously want to charge more than that.
Right.
So there's that.
Okay.
So that's cost-based pricing.
Right.
And then you also obviously have to think through value-based pricing, which is essentially how much is the willingness to pay?
What is this sort of token utility?
How valuable is it to the people who are going to pay for it?
Right.
So you do need to take that into account.
And then finally before you think through the pricing, you also need to think through what is the demand distribution?
Because ultimately, there are revenue goals and you know, kind of profit margin goals that you are working towards.
And so to land at a place that you like, you do need to think through where will your sort of bulk demand be?
And how will the demand taper off when it is safe for tokens that are not as as much utility?
There may not be many takers, but in the same way tokens that are highly valuable, there will be fewer people who are willing to pay the premium for that.
So you do need to account for that sort of demand distribution.
And then with those three things, you can figure out what the pricing for each token can be and then deploy it successfully.
So the key thing here, though, is that, you know, pricing the tokens, again, is obviously just one proxy.
There will be customers who are building value added services on top of those tokens.
Sure.
So like customer was building any product or something like that.
And then in that case, the process is similar, but you do need to think through then what is the additional value you are adding on top of on top of just generating those tokens?
Right.
So going back, surety, to something I was thinking about when you were explaining extreme code design.
When you get to a point kind of a sweet spot where you've, you know, your infrastructure is humming and the cost per token is low.
Does that mean that ultimately you won't need as many GPUs to produce the number of tokens that you really need or what happens in that kind of a scenario?
Right.
This, this is a great question.
And what we see here is the classic Jewon Spatterdock.
Right.
Which is essentially, you know, you would think that, okay, the GPUs are way more productive.
They're, you know, generating so many more tokens, do you need less of them?
And the answer is absolutely no.
And the reason is is as you see the efficiency, new use cases get unlocked.
Sure.
And people just figure out, we have all this, you know, thriving research community, data scientists, ML engineers, they just figure out how to just use up that efficiency, how to absorb that efficiency and do more with it.
Right.
People aren't going to run away from intelligence.
They want to use it.
That's right.
And if you look at the sort of macro pattern that we've seen so far, it's, it's very telling.
So when Generate of AI became a thing and people were, you know, sort of generating summaries and images, that was great.
Then we lowered the cost per token.
And instead of, you know, needing less GPUs, they needed more GPUs and more tokens, why?
Because test time scaling and reasoning.
And so our research was figured out that bias scaling at test time, we can generate better, accurate, more intelligent responses.
And that was valuable for the use cases.
And so that happened.
And it just, it didn't just happen once.
We are seeing that again now with a gented.
Right.
Now that we've, you know, sort of figured out how to deploy these mixture of expert models, reasoning models efficiently and lowered the cost per token for those significantly.
Now here comes another instruction point where it's like, hey, we've got more tokens.
Let's do more with them.
Of course.
And so that's where the agentec revolution is happening.
And so definitely it's, it's, it's Javon Spatterdocks in action at the macro level.
And I've also seen this play out at sort of, you know, individual customer.
Right.
So that's a great question.
So Trudy, can we kind of ground this in some examples of how businesses or organizations that you've been working with are putting all of this into action and really, you know, extracting value from the tokens and using them to build?
Yeah, absolutely.
So when you think about taking tokens and turning it in a business value, this for primary, I would say business model, so to speak.
Okay.
Number one is what we just discussed, which is selling tokens directly.
And a lot of Nvidia customers and partners are doing this and the examples are fireworks and based and together AI deep in front, that's just so many of them.
And all of them are helping their and customers build, you know, valuable services on top of the tokens that they're selling.
So that's number one.
Number two is AI native companies who are building products from the ground up with AI in it, you know, sort of permeating through it from day one.
And those are customers like Propexity or a cursor who have a coding engine and, you know, many, many, many, many others.
So that's sort of the second model.
Third is you might use AI to enhance your existing products and infuse AI through your existing products.
And again, lots of different examples we have Shopify, we have Airbnb, there is Adobe.
In fact, a lot of them are doing both.
They are building AI native capabilities, but they're also using AI to improve their existing products.
For example, Adobe is, you know, they've built their Firefly family of models.
And then they're using those models to infuse new capabilities into Photoshop for example.
And then the final bucket is pretty much every organization today, which is trying to improve their internal operations, their internal processes, improve employee productivity by deploying AI.
So they are not necessarily sort of deploying external customer-facing products or services, but these are internal to their old operations.
And again, and various working pretty much with everyone on something like that as well.
So those are the four key ways.
I'm sure that there are others that are more nuanced, that I missed.
But that's a useful framework to think about how to take the tokens and tone them into business value.
Right.
So for the business leader who's listening and comes away from this with a better understanding of the pillars and how they relate and really what in tokenomics, right?
What the cost of a token is and how you describe value and everything.
How do they get started putting this into practice?
What advice would you leave them with for thinking about how to put this into action and their own organizations?
I think the best place to start is to first just think through what is the final outcome.
And usually that starts with your customers.
Whether they are external customers or your own internal employees and internal processes that's in material, you have to start back from the customer need from the use case.
Because as we discuss the use case actually dictates a whole lot.
The user and the use case dictates what type of model will you use.
It dictates what type of context lens you might need to support.
It dictates what type of interactivity will you need.
So the intelligent interactivity and then those factors are what dictates what type of infrastructure you need.
Right.
Right.
And then of course we walk through the key metrics such as cost per token when making those infrastructure decisions.
And then that's the supply.
So you essentially walk back from token utility and token demand, think through token supply.
And then once you have a handle on all three of those you think through your monetization strategy and then go to market and then you fly.
Customer first, work back from that easy enough.
Truly thank you so much for taking the time to join the podcast and really break down tokenomics in a way that I think listeners viewers can extract so much value from to talk about extracting value.
It's a really comprehensive but yet really easy to follow and understand, you know, start to finish how this all comes together.
So thank you again.
Yeah.
Thank you for having me.