
Latent Space · 2025-12-18
SAM 3: The Eyes for AI — Meta's Unified Concept Segmentation Model
Hosts: Alessio (swyx) Fanelli, Joseph Nelson (Roboflow, co-host)
Guests: Nikhila Ravi (Meta Superintelligence Labs), Pengchuan Zhang (Meta Superintelligence Labs), Joseph Nelson (Roboflow)
Why it matters
Launched as three separate models: SAM 3, SAM 3D objects, and SAM 3D body — SAM 3 is just the 2D image/video model
Key claims
- SAM 3 unifies segmentation, detection, and tracking into one model using text concept prompts, replacing multiple task-specific models
- Launched as three separate models: SAM 3, SAM 3D objects, and SAM 3D body — SAM 3 is just the 2D image/video model
- Built on Meta's Perception Encoder with a novel 'presence token' that decouples recognition from localization; detector and tracker are decoupled
- New SA-Co benchmark has 200K+ unique concepts vs. 1.2K in prior ELVIS benchmark; 70%+ of training annotations are negative phrases
Episode summary
Summary
Meta's Nikhila Ravi and Pengchuan Zhang join Roboflow co-founder Joseph Nelson to discuss the launch of SAM 3, which they describe as more than a version bump — it's an entirely new interface for segmentation. SAM 3 unifies interactive segmentation, text prompting, open-vocabulary detection, and tracking into a single model using "concept prompts" (short text phrases like "yellow school bus" or "watering can"), eliminating the need to click every instance. The launch actually included three separate models: SAM 3 (image/video), SAM 3D objects, and SAM 3D body.
The team detailed the architectural decisions behind SAM 3, including decoupling the detector from the tracker to resolve identity vs. identity-agnostic representation conflicts, introducing a "presence token" that explicitly separates recognition from localization, and building on Meta's Perception Encoder. The new SA-Co benchmark contains over 200,000 unique concepts (vs. 1,200 in prior benchmarks like ELVIS), and a novel data engine reduced per-example annotation time from ~2 minutes to ~25 seconds through AI-assisted verification steps. SAM 3 runs at ~30ms per image with 100 objects on an H200 and scales linearly with object count on video (10/28/64 objects on 2/4/8 H200s).
A major theme was positioning SAM 3 as "the eye" for LLMs and frontier multimodal models. The team demonstrated SAM 3 Agents, where LLMs (Gemini 2.5, Llama) call SAM 3 to ground complex visual reasoning tasks SAM 3 alone can't handle. They explicitly benchmarked against Gemini 3 Pro and Florence 2, showing SAM 3 was faster, more accurate, and produced segmentation masks rather than just bounding boxes. The philosophical framing: simple visual tasks (counting <20 objects) should be native to frontier models, while complex tasks warrant hybrid tool-call approaches. Pengchuan predicted SAM 3-native integration into multimodal models is more likely than a pure tool-call paradigm.
- SAM 3 unifies segmentation, detection, and tracking into one model using text concept prompts, replacing multiple task-specific models
- Launched as three separate models: SAM 3, SAM 3D objects, and SAM 3D body — SAM 3 is just the 2D image/video model
- Built on Meta's Perception Encoder with a novel 'presence token' that decouples recognition from localization; detector and tracker are decoupled
- New SA-Co benchmark has 200K+ unique concepts vs. 1.2K in prior ELVIS benchmark; 70%+ of training annotations are negative phrases
- Data engine cut annotation time from ~2 minutes to ~25 seconds per example via AI-assisted mask verification using fine-tuned Llama 3.2
- Runs at ~30ms per image (100 objects on H200); video scales linearly (10/28/64 objects on 2/4/8 H200s); fine-tunable with as few as 10 examples
- Positioned as 'the eye' for LLMs — SAM 3 Agents demonstrated with Gemini 2.5/Llama outperforming Gemini 3 Pro and Florence 2 on grounding tasks
- Roadmap: SAM 3.X smaller/efficient models, video-specific improvements, better integration with frontier multimodal models; Pengchuan argues simple visual reasoning should be native to LLMs, not tool-called
Source material
Transcript
Yeah, yeah.
Okay, we're here in the remote studio with the grand return of the Roboflow and the InSpace and Sam combo.
Welcome to Joseph by sort of vision co-host, I guess.
Thanks.
Good to be here.
Welcome back.
We also have, welcome back, Nikita Ravi, who's the lead on Sam too, I guess just Sam in general, right?
And we have Junjani Aspongchuan, who's also a researcher on Sam.
Yeah, nice to meet you guys.
So congrats on Sam 3's launch.
I mean, like the demo each time you set it up, like really amazingly.
And I think like every time my general impression or takeaway when I tell people about Sam is like, just the every time you have a new release, like it's like once a year you show up, you drop a banger and then you like, you know, you just like drop the mic and go for next year.
And you also add a dimension.
So I was entirely like really not surprised when Sam 3 had a 3D thing.
Because I'm like, well, yeah, which which is the next dimension to go?
It's like 3D.
Yeah, actually, maybe just on that one, I think that's actually a common misconception.
We launched actually three separate models this time.
It was Sam 3, Sam 3D objects and Sam 3D body.
Yes.
Those were two completely separate models.
And Sam 3 is just the image and video understanding model.
Which is on a on a dead or backbone and is fed up.
Yeah, sorry, I didn't I didn't mean to sort of pre preface all this.
But maybe for just to remind our audience or maybe for it for people new to the Sam series of a podcast that we've done so far, maybe each of you can sort of go around and intro like your or your sort of entry into computer vision, or instead of your relationship with Sam.
Go ahead, Nikki.
Okay, cool.
Hi, everyone.
I'm Nikila.
I'm a researcher at META.
I've been at NASA for eight and a half years.
So really been through evolution of the field in that time, I really started working on a range of different problems in computer vision and worked briefly on 3D.
We bought this library called Python 3D.
But really started on this segment anything as a project in around sort of late 2021.
So it's actually been almost four years since I've been like working on this segment anything space.
And you know, we started with Sam one in 2023.
Sam two last year in July 2024.
And then now Sam three.
So it's been you know, a combination of a lot of work of a lot of people over the years.
So yeah, really, really excited to be at this point and you know, get to share it with all of you.
I'll hand it over to Pengchuan.
Yeah, hello, everyone.
So I'm Pengchuan.
I'm a researcher at the same team.
I have been working in computer vision this field for nearly nine years starting from 2017.
You can I think it's a long time I have been working in MSR for five years and then kind of moved to Meta-Ready Lab to work on egocentric foundation models on AI glasses for a while.
And then in 2023 near I moved to San team and that time is exactly the start time of San three and really can I think that's the lifetime experience I have on the San three team.
And it's glad that San three is out and I cannot achieve my original grand goal of computer vision to reach kind of human performance of detection segmentation tracking image and videos.
I'm Joseph co founder CEO at Roboflow, where our mission is to make the world programmable.
We think software should have the sense of sight and models like Sam and others are critical to unlocking that capability.
Now millions of developers have the fortune 100 build with Roboflow's tools and infrastructure to create and deploy models to production.
We've been big believers of the meta family of open source models all the way back to like mask our CNN and detectron two, all the way to presence of Sam one, Sam two and Sam three.
The work that the meta team does to advance state of the art and open source computer vision has been bedrock to enabling developers and enterprises globally to adopt AI.
So we've been big fans of the work and I'm pleased pleased to be joining you today, Swix to co host the episode on Sam three.
And you guys shipped your own better model to Yeah, we've been we've been doing some work to advance a machine learning research to like one of the for example, debtor detection transformers, which was born out of NeurIPS last year, I think Swix actually challenged us you were like, Hey, what are some of the advancements that are happening in computer vision and in visual AI.
And we had this observation that transformers had surpassed a lot of CNN's and vision tasks, but they hadn't been made to run real time.
As in, you know, over 30 frames per second, for example, on like a small T four, or excuse me, small like edge device, and hundreds of frames per second on like a T four, we did some research and published RF data, ruble flow detection transformer, which is, you know, we kind of joke the greatest of all time model for doing real time segmentation and object detection on the edge.
Now, in RF debtor, it's you know, you have to have a fixed class list and need to know some of the objects that you want to segment at a time.
But for anyone that's running on like constrained compute and on an edge device and wants like an Apache to model to do that RF debtor and its family of models are key to fulfilling that mission and that goal.
Yeah, amazing.
Okay, I think we are going to just go into a SAM three demo.
I think Nikki, you've prepped some stuff to show us and this is great because obviously there's nothing better than the creator of the tool showing off the tool.
So just to start with like, what is SAM three?
So SAM three is a model that can detect segment and track objects and images and videos using what we call concept prompts.
So I'm going to start with a simple image example and then we'll show you a video example.
So a concept can be anything that is a short text phrase.
So here, for example, we can use something like watering can and you can see the model predicts a mask for the watering can.
You can also then refine the prompts using clicks or additional visual exemplars, which I'll show you in a different image, but essentially idea of a concept prompt opens up the ability to find all instances of an object category without having to manually click on every single instance as you would have had to do if you were using SAM two or SAM one.
Now if the model misses any of the any of the instances, you can add visual exemplars.
So a visual exemplar is also a way to describe a concept to the model.
So here I can add a positive box here and show the model that this is also an instance of a flower that we want to detect.
So this is just an images.
But what's really cool is you can now also do this in video.
And so here I'll show you an example.
Maybe this is a football match.
You want to track all the players in white, for example, so red jersey or white jersey, you can provide a concept prompt and the model find the objects in the first frame and then track and detect the new instances that appear later on in the video.
So it's not just detecting on the first frame, but both tracking those detections and finding new instances that appear throughout the video.
And one of the things we love to do in our demos is also show some real world applications of this.
And so one idea here is you can use this for video editing or adding effects.
But here was a really simple mask effect.
But you can imagine, for example, you might want to add a trail around the players.
Yeah, you can follow them around.
Maybe you want to clone them.
So you've got multiple players running around.
You can also do background effect.
For example, spotlighting players.
And so these are just fun things you can do on top of the SAM 3 outputs.
And this is just like a way to show people like what you can do.
There's also some templates, which basically are pre populated with a text prompt and an effect.
And these are just some fun ways you can use the outputs.
But really, you know, the crux of it is in this like create from scratch, where you can upload any image or video and try SAM 3 on that.
And we'll share the link so you can try it out as well.
One of the other demos that I have is like a busy scene for like doing labeling, which we can do later on, but just to give you a preview.
It's like if you want to find tablecloth and maybe like back there, there's like airplanes.
People do airplane and you kind of get the ability to start to you find the confidence thresholds.
They do.
I don't know why tablecloth wasn't as good.
I've used that one the past table maybe.
Yeah, cool.
Wow, look at that.
Yeah, I think I think the other impressive thing that you guys emphasize in your launch is also like the latency.
I don't know where this this particular inference is running.
But it says something like SAM 3 runs in 30 milliseconds on single image.
If you want 100 detected objects on an H 200.
Obviously this is an H 200.
But it's also like this impressively fast.
And sometimes basically you can be real time if you want.
Yeah, definitely on images on images, it's really fast.
And then on video, it kind of scales with the number of objects, but it's for a limited number of objects is still still real.
Yeah, also add even for video if if you can't afford the kind of GPUs, pretty many very kind of do the kind of parallel inference algorithm.
So even you have another object to track you can still get real time tracking performance as well as your skill up the GPUs there.
So I'm reading in the paper, it's 10 objects on two H 200 28 on four H 200 and 64 on eight 200.
So something like that.
I don't think there's an architecture.
I don't know if this this is the parallelism demonstration that we're talking about.
Yeah, in fact, when you kind of try the demo, the video to the kind of parallel implementation of the kind of video grounding.
So it's already kind of in that fast mode.
Yeah, try it with a video with like lots of objects.
And then you can notice that it's actually not very slow.
And you get the sense that we are doing the multi GPU inference.
Yeah, everyone should try out and see for them.
So okay, amazing.
So this this thing about concept segmentation, I feel like you had a prototypical version of this.
And in your paper, you really talk about like sort of generalizing it, I guess, like, what was the planning like in SAM three?
Like what at the start of this, did you you know, is what we have today exactly what you plan for?
Or did you kind of did it emerge as you discover capabilities?
Yeah, quickly talk about Yeah, in someone we did have a proof of concept of text prompting, but that was just a very early exploration, it wasn't really built out and, you know, became the most highly requested feature since then.
And so we, you know, in SAM three, we really wanted to do it properly, and actually do this in a way that it works in all different scenarios.
And so we had to really think about how to formulate the problem.
So we it could have been that we took open ended text input, and it works for all open ended texts, or we could have even more be more focused, which is what we chose to do and really focus on these atomic visual concepts like yellow school bus or a purple umbrella, and really focus on nailing the problem for these like atomic visual concepts.
But things are on maybe you want to talk a little bit about kind of the benchmarks that existed previously and how we had to actually fully redefine the task and the benchmark that we wanted to solve.
Yeah, and maybe just to add to Peng Tran's point, like if you look at the size of these benchmarks, the previous benchmark, Peng Tran mentioned Elvis that everyone uses it has about 1.2 K unique concepts, and the benchmark that we created, which we're calling segment anything with concepts or Seiko, Coco for short, Seiko has more than 200,000 unique concepts.
If you think about the way natural language that people use, we don't just use 1000 words we use, we have a very large vocabulary and we really wanted to build a benchmark that can capture that diversity and size.
Yeah, it's really impressive and also like very formulaic, I guess or classic that every great model work starts with a lot of data work.
I think basically is a very scaled up version of the same process we're sent to.
Yeah, in some ways, I think that in SAM3 data engine really was like a very novel and critical component.
I think, to your point, has the advantage in AI is not just about the models, but really about the data and maybe even more so is actually the data engine to generate that data and we put a lot of effort in SAM3 specifically to try and automate that process a lot.
One of the things that we're really impressed by is the diversity and depth, as well as breadth of uses that we see with models like SAM in production.
Basically, when you think about computer vision, folks kind of like always class they think about dogs and cats and simple sorts of things.
And the reality is like computer vision is where AI kind of meets the real world.
So any sort of thing that needs to be seen and understood, you need to have understanding of that thing.
So a model like SAM, expanding the concepts from like a few thousand closed form concepts, MACs and a single model to tens of thousands of concepts means that you're going to see such a huge acceleration of the number of fields and applications of the model.
So this is SAM3, right?
So we've already seen and measured some of the impact of the SAM family of models and we pulled some of the updated stats on how impactful SAM is being across the Robofil community.
I think Robofil might maintain one of, if not the largest hosted instances of SAM.
And we've seen basically 106 million kind of smart poly created examples that are SAM1, 2 or 3 powered.
And we estimate that that saved humanity collectively like 100, maybe 130 years depending on exactly how you want to do the calculation of time just curating data.
And each of those use cases isn't dogs and cats and internet.
It's things like, I don't know, we see medical labs across the world that are accelerating cancer research by doing things like counting and identifying the automation of neutrophils after a given experiment.
Or we see folks that are using aerial imagery for things like helping a drone navigate through the world, or maybe counting and seeing solar panels from above, or maybe even doing like insurance estimates.
We see folks that are building underwater trash cleaning up robots.
So like you can imagine an autonomous underwater bot that's navigating through the Pacific Ocean and identifying and grabbing on and grabbing plastics and cleaning up the world's ecosystem.
Relatedly, we've seen some work with aquariums across the US like MBARI, who are doing work for keeping track of species and identifying the impact of ensuring given steps that are taken or increasing the populations of given fish with like underwater fish cameras.
We see folks in industrial settings like doing work to produce electric vehicles or get products from point A to point B.
At the time of recording this, it's like near Christmas time and it's like high time for holidays for folks that are doing gift giving.
And that ends up being really really high time for making sure goods and services show up where they're supposed to be at the given point in time.
One of the statistics that we track is the frequency with which folks cite works like SAM or Roboflow or blogs that we publish.
And there's now basically like a little over two research papers published every day citing some of the work across like the Roboflow community.
And that's folks that are like publishing in nature and science direct and a fairly prestigious number of journals.
And each of those you got to think about it.
Each one of those publications is someone's like seminal work often six, 12, 24 months of effort that's been accelerated from models like SAM.
So it's not an exaggeration to say like models like SAM are speeding up the rate at which we solve global hunger or find cures to cancer or make sure critical medical products make their way to people all across the planet.
And at the infrastructure level, we're like thrilled and surprised constantly by the breadth and depth of adoption that we see from the community.
I mean, in the first five days of SAM 3, there was like 8 million inferences of folks that were running across all diverse sets of fields.
And that's actually only increased because it was released and then there's like Thanksgiving and now it's back and folks are like hitting it pretty hard.
So it's been incredibly encouraging to see the both depth of adoption and how much the community takes and uses and relies on models like SAM and PROD.
Yeah.
And I think from maybe just to add to that from like meta side, like we don't usually get as much visibility into all of these real world use choices.
There, you know, being able to kind of hear that from Roboflow and having these models available on the platform is like so valuable for us is also new.
We get to know how these models actually work in the real world, which is, you know, ultimately the best eval for a model.
So I think, you know, it's definitely awesome to hear about all these things that we're empowering.
Nikhila, you had this you had this comment of like the best eval for a model is like, it's not necessarily benchmarked.
What was it like if it works on unreal world things?
I think it's a really good soundbite.
Probably something like the best eval as if it works in the real world.
Yeah, sure.
And that's like the ultimate goal for all of our models like SAM 1, SAM 2, SAM 3, we want people to use it out of the box as much as possible.
And I think, you know, with language and SAM 3 specifically, there does need to be in some cases some domain adaptation.
But we have sort of tried to make that easy.
I don't know, Peng Tran, you want to talk a little bit about about that, like the fine tuning aspect.
I wanted to also endorse like the real world thing.
I was just so happily surprised when I was visiting the CZI Imaging Institute for in preparation for our pod with Mark, that they were using SAM in imaging the human cell.
And they showed us like how the in reality, all these sort of masses are actually like really undifferentiated and it's really hard for the human eye to track.
This is actually a simpler one.
We can actually there's not this is a pretty clean here.
In reality, a lot of it is just like just gray mush.
And you have to like segment individual lysomes out of these and they showed us how they were using SAM and fine tuning SAM to do it.
Yeah, really, really, really complicated and also like very meaningful for basic science research.
And I was also maybe mentioned like this in the paper, the distribution, you can actually see what SACO does.
So a lot of a lot of animals, a lot of animals.
And then very surprisingly few maps.
I'm like, maybe there should be more maps.
I'll say Hugging Face has been doing a lot here and other other companies.
Yeah, this is actually something we get asked a lot is like, what's the minimum amount of data I need to fine tune and, you know, being able to do that with just sort of 10 data points is hopefully will unlock a lot more than we can do ourselves.
Yeah, I mean, the more the merrier, obviously, this is where oblations are really helpful.
You probably didn't have any fine tune oblations in here.
I think this is all data and model screening oriented.
But yeah, I mean, like very, very, very awkward.
And I just have a cheeky curious point.
Is there a neck ratio of what is the ratio of the negative examples of positive example, right?
So in Nicholas example, when you were when you were demoing just now, you only selected positive examples.
Obviously, there's going to be a lot more negative examples of knobs class than positive example of class.
So there should be some exchange ratio where like negative examples contribute smaller than a positive example.
Or is that not the case?
For positive and negative examples, I don't know that I have seen like a golden ratio that that works well or not works well, but I can offer anecdotally that a single negative example goes a long way.
A common place where fine tuning is really helpful is like data that's out of distribution that might might have been a possibly in distribution.
Like when I fear fine tune examples is like counting way most.
There's not that much data that have like way most labeled throughout the streets of San Francisco.
But Sam does a really good job to identify way mo as like a vehicle.
If you prompt with way mo it doesn't find anything you find vehicle it labels away mo as a vehicle, which is valid, but away mo is a specific type of vehicle, right?
Usually from even just like a 10 second video clip, you can actually start to have Sam three learn what should have been seen versus as a way mo versus what should have been seen as as a vehicle.
And even on a single image example, we see that like Sam three starts to adapt because it takes the text and image prompt into account when it makes a subsequent inference from like three to five negative examples alongside positive examples, you start to see the model update its priors, if you will, for where it would predict things from what the user provided.
All this is written with caveats, right?
Because like, when you talk about visual world, the negative example and the positive examples could have been a very different perspective or a very different type of objects.
Like maybe you're like labeling dog breeds and suddenly a new dog breed appears or maybe you have a perspective where it's overhead and then suddenly you have a side by side view.
So usually the best way is to like have these things meet the real world data and try but I'll offer maybe the note that a small number of negative examples.
So it was a really long way like small like three to five not like hundreds.
Yeah, the other place where negatives play a big role is just is it in the image or not.
And that was one of the things that we did was really separate the problem into a recognition problem and a localization problem.
So first, can you answer the question is this object or is this concept in the image?
And then if it's in the image, where is it in the image?
And so to really, to really build in that capability, we had to annotate a lot of negative phrases in images.
So basically a lot of phrases that don't exist in the image in addition to the concepts that exist in the image with the corresponding mask pair.
So we have, you know, if you look at our one of the tables in the paper, which shows the training data set distribution, I think it's table 24, we have about 70 more than 70% of the annotations are these like negative phrases that are not present in the image.
So we have to really train the model to not detect stuff that is not in the image.
Yeah, I think that the separation of localization and it's basically precision recall, right?
But in the vision domain, we basically add this presence token to the model, which explicitly separates the task of recognition and localization.
So basically, it simplifies the task.
And so the model doesn't have to try to do everything with just the proposals in the detector.
We'll be able to have this global like sort of learned token just for the recognition part.
Yeah.
In general, I find that you guys did a lot of extra net new work, you had a really nice chart in here about like, the yellow boxes being like the new stuff.
Forget what the architecture diagram.
Yeah, I'm like, holy crap.
Last time it was it was like, you know, there's like the memory stuff.
This is sad too.
And here it is all this.
Obviously, we know it's hard to cover it all.
But, you know, I wonder if there's any other interesting stories or tricks like the presence token that you might want to focus on.
Yeah, I mean, this is nice.
This diagram, I'm glad you brought it up because some theories and just a version bump, it's, you know, an entirely new approach to do segmentation.
It's like this new interface for segmentation.
And it combines so many different tasks where previously you would have needed a task specific model for each of these tasks.
You know, interactive segmentation, text, prompting, open vocabulary detection, tracking, like all of these tasks, you would have needed a separate model.
And so, you know, really had to do a lot of work to bring it together.
I think one of the things we did was really decouple the detection component and the tracking component.
So you can see, you know, we still preserve the tracking components from SAM2.
But the detector is separate.
And the reason we do this is if you think about what a detector has to do and what the tracker has to do, the detector needs to be identity agnostic.
So if you have a concept dog, it needs to be able to find all instances of that dog.
And it needs to sort of have this representation of dog that is the same for all dogs.
But when you're tracking those dogs through the video, each dog needs to have a separate representation such that we're able to preserve the identities.
And so there is this kind of task conflict that emerges between the detector and the tracker.
And so we really had to, you know, we experimented a lot.
We really tried to build kind of a unified approach to do things.
But then what we found was having the separate detection tracker really worked.
But we share, we use the perception encoder as this shared visual background.
And this is sort of a text and image aligned encoder.
You can see the green boxes there, there from, it says from PE, that's perception encoder.
That was also from our group in there at the time.
This was released earlier this year in April.
And so this really is bringing together components from like the entire fair and matter ecosystem.
We have perception encoder, we have a deep detector, we use STAM2.
We also use llama and our data engine.
So we really like using all the components from...
Yeah, it's like any third film in a trilogy, like you always see the previous recurring characters come back.
Yeah, well, it doesn't work.
You got to continue using it.
And to connect to something we just discussed earlier, you mentioned that at video component, each object needs to be tracked independently.
That's why the compute scales linearly with the number of classes, right?
Because each of those instance types needs to be maintained?
Each of the scales are the number of detected objects.
Yeah.
So for example, like each dog that appears in the video, each one of those needs to be tracked independently.
There was something else that you started to allude to in the paper that I was hoping we would spend some time discussing and it's interaction of SAM3 and LLMs, llama and others.
So using SAM3 to almost be like a tool call for LLMs to give them better grounding and give them better visual understanding.
And there's a paper in the table where you describe the increase in performance.
It's kind of alluding, I think, to maybe where things are going for using SAM3 as a component part of multimodal architectures.
Do you want me to describe a bit about what the introduction of that work was meaning to showcase and how the interaction of SAM3 and LLMs is envisioned to be important?
Yeah, maybe I can just do a quick intro and I'll hand over to Peng Tran to do the deep dive.
But essentially, as I mentioned, SAM3, we constrain the text input to these atomic visual concepts, like yellow school bus or yellow watering can.
But obviously, people want to interact with the model of natural language.
And we want to enable that as well.
And so that really segues into being able to use SAM3 as this visual agent for an MLM.
And so I'll hand over to Peng Tran.
Maybe you can explain about the SAM3 agent setup and then talk through some of the results that we got there.
Yeah, yeah.
So as Nikina mentioned, the big picture is that Sansuloy is focused on this kind of atomic concept.
But people definitely want to try kind of much more complex phrases like, okay, I'm going to showcase the bigger kind of character for me.
What can, for example, this kind of line example, what is the feature that distinguish male and female in this picture?
Then there's a more kind of complex language.
This is exactly kind of something we cannot do, but Sansuloy agents are going to solve.
In this case, you can see that it needs much more advanced language understanding and reasoning.
The Sansuloy currently do not have this kind of capability because it's more language encoder.
But we know that large language models definitely are going to watch a lot of this data and has this kind of word knowledge and reasoning capability.
Sansuloy, Sansuloy agents is exactly using Sansuloy as the eye for the large language models to solve this kind of complex visual grounding tasks.
Is there any sort of insights or surprises that you have other than, I guess, SAM3 is a very good tool?
Is that the main conclusion?
Go to Table 8 in the paper as you described this, if you don't mind.
Table 8, okay.
Yeah.
Yeah.
Here we go.
Yeah, please.
Maybe kind of quickly reply to kind of a quick kind of question.
I would say that first, besides that Sansuloy is really a good tool kind of provides the eye for large language model.
The other thing we definitely found is that Sansuloy is not perfect.
It's not just kind of as robust as kind of human eye.
Then large language model also kind of helps to correct the kind of sound error.
They have a synergy between each other instead of just, okay, large language model provides the brain and Sansuloy provides the eye.
Interestingly, you use number four.
I saw there's a mix of number three and number four here, but it looks like it does best with Gemini 2.5, which makes sense given this comparable set of MLMs.
I think the baseline also is just that, well, what extra addition does this add on top of just the MLM?
I would maybe want to do that, but maybe you've already done it somewhere.
What do you mean by additional sync?
Basically, without a tool call, there's some native capability inside the MLM itself to draw.
Wow.
In fact, that's a really kind of good question.
In fact, I was going to review or even ask a question.
You can imagine that without large language models, without VOM, Sansuloy, for reasons sake, it only achieves about on the validation set, if I remember it correctly, it's only achieved kind of 30 kind of numbers there.
Also, it's very intuitive.
You can see that for reasons sake, it has this kind of short non-untested.
It has kind of different subset short non-shorts.
Then it's very close to Sansuloy's training data.
It's kind of atomic freezes, short freezes.
Non-is there's kind of very kind of complex reasoning.
You will see that for short Sansuloy only is very close to kind of the Sansuloy conditions.
But for now, the gap is so large, which indicates that that is exactly the capability that's not an object model.
I can show an example here that might be insightful too.
Go for it.
So even comparing Sam 3 and Gemini, let's say that we just want to have them do an object detection task here of finding here we're going to prompt with a speedometer and RPMs.
And we're going to ask for things like indicator light, number and needle.
And if we run Sam 3 head to head with Gemini 3 and Florence 2 almost as a baseline of like where things have been and we see each of the results.
First things first, you'll note that the speed of inference of Sam 3 is quite quick.
This is just calling the Gemini 3 Pro API.
So whatever is provided from hosted computers, sort of what you get on the response time.
And then the second thing you'll note is in addition to speed is some of the accuracy of results who might get it.
We might have a timeout error.
Let's see.
Do you have ELO scores?
What scores?
ELO scores like.
Yeah, we had it.
You had the arena.
Okay.
I was wondering what the ELO was because you said you were blind testing this.
Yeah, that's actually interesting because we had blind tested Sam 3 before it was released not a Sam 3 just for people to try and compare.
I think we call it like a potential sag or sag preview or something.
And we allowed users to vote and they kind of unanimously voted for what they didn't know at the time was Sam 3.
We actually got like emails of people being like, hey, like where can I use that?
And we just sort of ignored them until the model came out.
So here with the responses, you see that the grounding capabilities of Sam and Sam 3 compared to even Gemini are out ahead currently.
So not only is it doing grounding, but if you look closely, you can actually see it's making segmentation masks too.
Whereas Gemini 3 struggles to do it just as detection by comparison.
And then the other thing is just the richness of detections, like the recall is high as well as the precision.
And if we compare here, it does almost as well, right?
But you see that it misses some of the numbers and has kind of these, some of these erroneous boxes that it's that it's predicted.
And then it also doesn't do segmentation.
So it just does detection of the task.
So you can envision that the same way the Sam 3 paper introduces the idea of using Sam 3 in tandem with MLMs.
I would expect that to be the case pretty soon and maybe the Google team taking some notes to improve Gemini and other series of models based on what Sam 3 demonstrates here.
So in other words, not only is it faster, but it seems to be more comprehensive for concept segmentation.
And I think the speed actually is a huge factor for many use cases.
I think like even Matta were using Sam 3 for various different products use cases and fast inference speed is very critical to enable that.
And so I think that's something that I think in many cases you don't even need an MLM for.
It's just kind of overkill to use an MLM to some applications.
The other interesting thing is the Florence 2 results.
And Florence 2 is a little bit older of a model now.
So maybe it's not fair to put up head to head with state of the art, but it is useful as a way to just see how far we've come.
Because Florence 2 by comparison labels the entire region as a single class without seeing individual detection of numbers and indicator lights and needle.
And not only that, but it actually runs at about three times the speed is Sam 3.
So Sam 3 again is faster, doing a task that the other models are not doing in segmentation and more accurate, both in recall and precision of the things that it's intended to find.
Which I think really showcases the capabilities of the model.
In fact, I even got a little surprised about this because this domain is more like an OCR because recognition numbers is nearly OCR.
We do not prioritize this domain of data collection.
It works.
So we know that it roughly works, but I think I got surprised that it works so well.
That's encouraging.
Even a task that wasn't expressly prioritized, it still does a great job on.
Yeah, in fact, during our data engine, we intentionally do not sample OCR-heavy images.
Wow.
On an easier one, Glass Mug, Sam 3, Gemini 3, Florence 2, Sam 3 loaded first and has, really impressively, it sees even this glass mug in the corner, which I think is something Sam 3 does a great job of is occlusion and partial objects.
Gemini 3 struggles a bit with this one, I think maybe because the opacity of the objects by comparison.
And then Florence 2 does a good job at finding one of the glass mugs.
So again, another type of task that shows the power and veracity of the model.
Yeah, I mean, exhaustivity, like finding every instance is something we heavily prioritized and is really built into the data engine design.
You know, Merime Peng-Chuan, you want to talk about how we design the data engine to really scale exhaustivity.
Because if a human was to say an appetite every single instance, it would take a really long time and verify, but we put a lot of effort into trying to automate and speed up that process, such that we could get to the data scale and diversity needed to get to a step change.
Yeah, yeah, I think definitely, I would say data engine is the critical component that we achieve since performance now.
So maybe we can go to the data engine picture.
I think we have an illustration there.
Yeah, page drives.
Yeah, yeah, yeah.
You can see that this is our annotation pipeline.
So we first source the images and they can generate the nonphases.
So this is the input of this task, source images and the generates nonphases from, for example, NAMAA, generate caption and we pass the caption to get the nonphases.
This is the input distribution.
Then we use kind of sensory model in the loop to generate kind of candidate kind of masks that we kind of that should be the candidate, but it's not perfect, especially in the beginning.
Then we go to kind of, you can say go to the next step is verification.
So sensory give you this mask, then we need to first do mask verification to verify each mask, whether it's good or not.
And then kind of after we kind of filter all the bad masks, we can, there are some good mask left and we verify whether this kind of this good mask are exhaustive or not, like your mask example.
So for example, they can buy the model, do not predict that kind of that partial mask is then the exhaust really check will be kind of feeling there.
And then kind of if the exhaustivity is filled, then we go to next step.
You can see that we can go to the pipeline, go to this kind of so-called human manual correction.
Human can manually annotate all this kind of missing masks.
You make this data point exhaustive.
So you can see that exhaustivity is a very big factor there.
And we play it as the kind of center place in this data engine.
And, but you can see that if we ask human annotator to annotate every mask from scratch, it will take a lot of time.
I remember kind of each data point in the beginning will take about more than kind of two minutes to finish.
But if you use model in the loop, then it's reduced to about kind of 45 seconds.
And you can use model to propose mask and then just a few man to kind of to annotate the missing mask.
Then it's 45 minutes.
Another very key kind of innovation in this data engine is that we really found that this verification steps like to verify a mask is good or not, or to verify now the good mask are exhaustive or not can be done by AI, can be done by matching multimodal model.
That is a big issue.
And then we can fine tune all kind of, for example, NAMAS 3.2 with our kind of verification human annotate verification data, we get kind of superhuman performance on these two verification tasks.
And then we do not need a few more of these two tasks.
Let's further wind our data point annotation time to about 25 seconds.
So you can see that from the original kind of all human to about two minutes to finally kind of 25 minutes for one kind of data point.
How can this is kind of our the journey of our data engine to make it super efficient.
Did you maintain statistics on how many images were specifically hard?
For example, like we had n many objects that were very difficultly occluded, or we had some number of images where the comprehensive test was was really hard?
Or did you just bet that by having a large scale, you would encompass occlusion and exhaustive cases?
In fact, we know we kind of maintain this kind of information exhaustivity, which one is hard, which one is easy, because first, and I in our data engine, when human annotates, then we exactly know which kind of which data point are exhaustivity by the model, which part we need a human intervene.
In fact, we have that kind of metadata in our data sets.
The second one is that the better kind of the more beautiful part is we have this kind of exhaustivity AI annotator, then we can then give a new data point, we can automatically decide whether this is a difficult kind of data point or can easily take data point by this AI annotator.
Yeah, I think the sort of bootstrapping annotation story was very strong last time around.
And it's, you know, it's even stronger this time.
What are you gonna do when you run out of humans?
Like, you know, next year, you're gonna have superhuman level of everything, right, like PCS and PBS?
What then?
I'm not so optimistic about this.
And first, indeed, our current platform next project is kind of this kind of fully automated data engine without humans.
That's our dream.
I would say that that will can I think that is the kind of perfect thing, but still we need some kind of useful information.
There's no free lunch.
There's kind of something kind of don't get it.
No model can do well.
And we need a few months to inject that useful information.
I would say that what can actually can do is running mean, mean, more human intervention.
Human only do the tasks that the model cannot do the most difficult tasks.
So that's the kind of kind of first one in terms of data engine.
The second one is about human performance on this kind of PCS task.
Hey, my feeling is that this kind of computer vision is going to enter this.
When we get to human performance, we will enter this R-A-O-H-F domain of computer vision.
So you can see that language models kind of before can in the birth age and the language model are not human performance can SFT can really imitation learning really do their job get to very good performance.
But if you only do SFT and the SFT data is annotated by human, then your performance is only the by human.
You cannot get kind of superhuman performance just by kind of this kind of data engine approach to use human annotated data and then learn from that.
You need to go to this R-A-O-H-F domain that human really just tell which two point which one is better.
This is exactly kind of the philosophy that you down to tell which one is better is easier to really kind of to construct the data point from scratch.
So you can get kind of higher performance kind of get better performance from human to from scratch.
I would see that kind of I hope that after sanskrit we can see kind of new research emerge from kind of in computer vision, which is okay how we go beyond human performance.
Sanskrit is close to that but I would say that new learning paradigm is needed to go beyond human performance for sanskrit has and for computer vision.
Yeah now just to add to that this is Peng Joon is only talking about images.
I think video is a whole another challenging beast and getting to that really automated data engine is something that we tried to do in SAM too.
We actually didn't get to that fully automated approach.
In SAM 1 we did we you know fully as a 1b dataset that we released was fully annotated automatically.
We didn't really get to that in SAM 2 for video and in SAM 3 for video I think there's still like a lot of room to push on this sort of pseudo labeling for video and really be able to get to that same step change as we had on images.
What are the biggest changes to see the same step change in video that you've seen in images for automated data pipeline?
Yeah I would say that learning kind of good video not language kind of video mode not multimodal model.
So when we do sanskrit is kind of earlier this year or kind of last year you can see that image not multimodal model is very good but video not multimodal model I think really kind of it's become good or practical later this year like kind of queen's way this kind of model gets kind of roughly kind of okay in that stage so we have a good kind of base model to fine tune our data and to get human performance for this recognition or verification path.
I would say that you can see that we need definitely kind of sanskrit like effort in the perception side but we also need kind of this kind of multimodal not language model kind of efforts kind of good foundation model on the kind of vision language side.
I think it's ready it's ready now.
That also video annotation is just so much more time intensive to to get to that to you may be able to annotate an updated a trainer verifier like video mask annotation we just found it was like very time intensive so maybe there are more efficient video annotation strategies I think that's you know a lot of exploration that could be done that too.
Yeah you know spending a bit of time on video I wanted to also talk about you know obviously last time we were focused a lot on memory attention I think this time there was this sort of a masklet thing that I wanted to just like get more ideas off or does it share the idea just generally what was it called the masklet detection muscle detection score exactly and how it's basically smoothing within a temporal window which I think basically you know a lot of computer vision models don't have this and they could just simply add it and it'll be a lot more stable when it comes to video and I don't know why they don't do it.
Yeah maybe I can comment on this first why they didn't do that I think um one big reason is kind of the streaming requirements you can see when you want to gather information of course the entire masklets then you need to wait for the masklets in the end and then and kind of get the strategy so that will sacrifice some streaming kind of capability so you can see that the streaming requirement is kind of somehow kind of limits we kind of traditional measure to do this but I would say that this is definitely kind of beneficial the kind of the reason why is that I think even humans do this you can imagine that when something just appears kind of at the corner of the video like a hand appears at the corner of the window kind of the video you just do not know whether there's a man or woman so a few might even make mistakes also for essentially it will make these mistakes but kind of when you get more and more information the person really enters the video fully then you get to know okay whether there's a man and woman so this kind of gonna gather kind of more information to really kind of know whether kind of this concept is kind of the concepts you're currently is the idea here so there is a trade-off between kind of the latency and the accuracy here if you care more about accuracy then you can use kind of this kind of overall kind of information can of course the masklets to get kind of more robust signal about the concept but if you care about kind of latency then you need to make a decision in the very beginning and then you will sacrifice some accuracy I think also in many video use cases I think because if you're showing on Roboflow users care more about detecting the objects rather than having unique identities so in some cases this maybe it's this isn't required to preserve the identities throughout the video you just want to essentially do detection per frame like for the Roboflow rapid examples you're sharing yeah there's cases where being able to count and you know the objects are all going to be the same so you don't care as much about unique classes you just want to know the full presence things like that matter but then there's other cases like you mentioned where I don't know like in sport you care about individual players versus just knowing that there's 11 players on the pitch one thing that might be useful actually to discuss with some of our time is we talked a little bit about how sam3 and MLM will play nicely together but there's probably like a greater discussion about how sam3 fits into the broader AI ecosystem and like what bigger picture trends it might fit into do you have some thoughts on what this represents about where things are headed now maybe I could say one point and then Peng Tran feel free to add one you know as we mentioned before sam3 isn't just a version bump we are really having a unified model that can do many different tasks in the same unified architecture and so you know then the same way that LLMs can do many different tasks without needing a task specific model like with sam3 we're able to do image, promptable concept segmentation, video, promptable concept segmentation, we can do we don't need a specialist model for counting we can do interactivity there really is like multi-capability visual models that are on par or better than the single task state-of-the-art models so that's really one place in which sam3 fits into the AI ecosystem in terms of MLMs I don't know if Peng Tran you want to talk about the agent approach yeah yeah definitely I would you can see let me give you know I would see that sam3 can now kind of really get a big step change in vision how it really helps the general AGI fitting to general AGI or frontier model landscape is very very kind of exciting for me we always have this example kind of give this kind of six finger kind of hand up picture ask how many fingers do we have in this picture and then can be any other frontier model say five and you can imagine that with sam3 then we can just kind of first detect how many fingers we have that very robustly kind of six fingers and then the multimodal model should know that okay this is six finger hand instead of five five finger you can see that the arrows made by frontier models can be solved if we use kind of sam3 as a tool but then how really can I say is sam3 as a tool is the end of the picture or should really somehow sam3 even just be more initially embedded into this frontier models the frontier models have running this sensory capability by themselves I would say that there's another possibility there kind of my picture is that now we have a very good green with this kind of frontier models and we have a very good eye with sam3 now let's see kind of whether the eye really kind of is kind of working together kind of natively with the brain together or eye is really kind of a different kind of organ and then need to kind of somehow like a tool to kind of work with the brain I think this is a very exciting kind of research area and so in your analogy if you think about like the visual cortex compared to like a human human brain like you know we have rods and cones in our eyes that do kind of very fast we joke like lizard brain level detection simple stuff and then you have your brain that reasons about some of the visual information that your eyes see in your example of sam3 as a tool call or sam3 as natively a part of the multimodal models which future do you think is more likely I think at least I want to bet on and then running they work natively together the future for simple I would say for simple or even intermediate difficult vision tasks for example can't kind of counting with less than 20 objects I think for this kind of simple task this is like system one kind of visual reasoning you know with our brain this should be kind of our brain kind of should do it by kind of by themselves but with very very difficult class you can see that if we are counting you know maybe kind of thousands of objects that kind of in the picture so crowded then we kind of even need to kind of draw something there I would say that at that time maybe we did kind of some extra model kind of for difficult tasks this is you can see that this is a hybrid approach but I'm more excited that kind of I think for most of the cases should be native the reason why there is is you can think that I would see kind of perception or grounding and I really kind of know where it is how many days is like a fundamental capability of our brain it's I'm just not happy that kind of the frontier model just cannot count how many fingers immediately and instead of need to call a tool to do that I think this kind of should be system one thing and this should be kind of natively in our brain and also if our brain cannot do this task which means that it's definitely kind of missing some kind of very critical kind of visual capability by itself so that's kind of I would say that it's just feels that the intuition just feels that it's not correct to do not have this capability by itself so for very simple system one questions things like how many fingers on a hand that should be native but for maybe more complex things that are maybe long running tasks and long running reasoning then maybe there's a bit more of like a tool call approach yeah exactly for example you can see that we already kind of in our sense through agents or kind of in our AI annotator we even demonstrates this approach and kind of for simple cases the model can do it by itself that okay I can detect for example 10 people here and then the natural language model can even the AI annotator can even know that okay this 10 people is not exhaustive okay there are more people there so if you want to do kind of well then maybe kind of you need to do more step for example kind of to call an extra model so you can see that this is a very very kind of native kind of true kind of reasoning process for for more advanced or complicated vision questions I have a related but maybe slightly different question M3 is an incredibly powerful piece of work and it's open source as a part of now MSL open source critical to achieving AGI maybe I can comment on SAM specifically but in SAM 3 we did leverage many of the open source contributions people have made on top of SAM 2 there were new there were new data sets there were new benchmarks there were new kind of inference time optimizations we adopt a lot of the things that the community built on top of the models on top of the data sets and so the all those contributions helped make SAM 3 for the SAM series we really benefited a lot from you know being very generous with what we open source and then leveraging what the community builds on top of that but that's just from the SAM perspective I think it's clear what the community brings and offers and I think you know every time we do this we always shout to the community to like you know try it on their use cases and record like weird findings and like you know if it doesn't do what you are trying to make it do well let's let's talk about it right and then maybe sort of implement it in the next version like you already said you already think that like what might be coming for SAM 4 which is at least a little bit more of the document and OCR work any other directions are interesting I guess obviously a lot more video work as well what's what is the talk of the top the town in like the CG community that like you know it'd be really great or like super obvious like next year is going to be the year of what yeah maybe kind of I can first talk something and then the kind I can add first definitely kind of I think even it's not SAM 4 it's SAM 3 something and SAM 3.
something like small models SAM 3 currently only have really kind of one model kind of one size model kind of more kind of efficient model that's kind of free to four kind of eight cases and also kind of a more efficient model for video I think currently the video model is not efficient you either you can achieve very good kind of throughput but you need GPUs to do that so first kind of small and efficient models that's one and a big thing the second big thing is definitely kind of video.
Roboflok can do that for you.
Yeah the second thing is video I would say that way video is still far from I would say have a big ad from human performance right now there's kind of still kind of a lot of research need to be done there how to do end-to-end training with video we do not have and kind of we have this kind of decoupled approach but we do not end-to-end train this model and we expect definitely kind of it will be kind of a benefit from kind of end-to-end training and also as we just kind of on video side really kind of how to scale up the data engine we need kind of definitely kind of AI annotators for video we try that but yeah we can I think that's that's something definitely worthwhile to do the third one we also discuss about that how all sides will help perception fit into AGI this big landscape now we have the eye how the eye work with the brain to do yourself real kind of reasoning pass not only output segmentation but really kind of answer how many cases are here or even answer the question okay I can I have an example of you know biology labs kind of the robots need to decide whether can they can liquid in the contest tube is can add the kind of correct level or not you can see that this is kind of involved perception but also involve reasoning how to kind of solve this more kind of beneficial reasoning tasks with Sun is kind of a very big direction on the robotics topic there was exciting to hear from like several friends that work at you know different robotics companies on how they're like immediately starting to use Sam 3 and I think especially for the video use case I think robotics is probably one of the domains where I think improving video performance will have a lot of impact and so I think yeah that's definitely an area that we could improve on further but yeah depending to one's point I think there's still another step change to be achieved on video PCS yeah just a quick comment on the robotics things I know we're interviewing a bunch of robotics folks here as well as like Faith Alee who obviously started ImageNet a lot of people are betting on explicit world models and Sam is not for better or worse and I wonder when that crossover might happen that's there's an open question if you guys want to take any world models discussions re where things are going based on like community questions similar to how Nikila mentioned Dr Sam 1 the like almost obvious thing that people wanted was like open concepts prompting because people are like great this model can see things but I want to tell it what I wanted to see and now with the introduction of Sam 3 you have this stepwise component which feels like a key component of you know the chat gbt era for vision is arriving as a result what's going to happen is now you've provided people with an open text box and media and so you're going to get all sorts of queries from people that maybe the model isn't primed to be able to perform particularly well on yet for example earlier we were talking about document understanding and document reasoning being a place where there's known improvements to be made and so you'll have people that will probably prompt to try to ocr things or you'll have people that will want to do work with spatial reasoning like give me the object to the left of this other object or give me a sense of where things are in relation to one another which is critical for robotics like we're discussing because that's how you navigate throughout the real world you'll also have I think people will want action recognition and vision language action models vlas like the same things that where you have these tasks where people are used to providing open text prompts and getting here's the part of the scene where the player kicked the ball or the tennis player made the serve those are interesting for the purposes of how to understand and synthesize visual inputs and so now that you've kind of given this open text box for media there's going to be a flood of the types of things users are going to want to try to do some of which sam is already going to be really well adapted to do some of which not and I think that that's going to be it's going to reveal itself of the types of things that are that are obvious one of the things that we wanted to discuss was like where to use sam and discover how to build with sam so in addition to the meta team building a tremendous playground for being able to interact with images and video and kind of apply effects with like a video emphasis I think one of the things that we're pretty excited about with sam three is how much it positively impacts each part of building a system for visual understanding so for example the very first step of historically aggregating and collecting a data set because you think that there's not a model that understands the slice of the world that you want to understand is where automating a way lots of labeling can exist basically if you collected a bunch of data of something that is already in the sam three's knowledge then you can prompt for sam three to automatically label all that data for you and so we've actually made a bet on sam three being a core part of auto label at rebel flow giving users a first pass of saying hey if you have a new image or you have a new video start providing just a text prompt and allow sam three to find and automatically label those regions of interest for you downstream I think there's areas for fine tuning like you know within a week of releasing sam three med sam three came out for adapting sam into medical contexts and I think that's a harbinger of what's to come like there will be lots of domain specific adaptations of sam in places where maybe there's a specific ontology that someone wants to understand or maybe there's a place where just the model doesn't have great awareness yet and I think we're already beginning to see that with hundreds of fine tunes that users are creating for various domains and then the last area is like okay I've got my model now I want to use it and so one of the things that we're really proud of is to be ready on launch data showcase the infrastructure we built to burst and scale like infinitely large as folks have models that they want to deploy and make it readily available having an endpoint that serves either a fine tune model or a model as is or even a model that might be able to run on edge hardware as smaller models come out or maybe distillation comes to rise is I think also an awesome place of where we're seeing sam three being impactful at each part of like the computer vision lifecycle and pipeline that's awesome yeah I think especially the impact on speeding up annotation I think we've seen that consistently on roboflow and I'm really curious to see how um sam through the introduction of sam three really helps speed up that process even further I mean just from playing around with it it's so much faster than having to manually annotate every single object so yeah you're really curious to see how that improves the experience one of the things that we were pretty excited about is we were kind of able to build an entirely new product in the world of sam three and we called it we called it rapid but basically it's like there's probably a model that already understands the objects in the world that you want to see so here I'm screen sharing an example of like these are vehicles next to our office in San Francisco that go by and you can see here's a way mo and here's like other vehicles and like if I just have like this 10 second clip and let's say you know the first thing I want to do maybe is just like count cars and I want to get a sense of like each of the vehicles what's really awesome is I can just you know of course text prompt and say I want vehicle and as I toggle through different frames in my video sam three already recognizes and understands those objects now one thing that I think is really interesting there's a conversation earlier about how much you want to rely on a model versus human's output of the model for what you care about so for example let's pretend in this scene maybe the only cars that we care about are the ones that are like before the crosswalk and maybe not far in the distance then you'd get people that would say hey you know what I actually want the objects that are like most confident and I would like you know move my slider down to like getting a fewer number of objects whereas maybe others might say hey I want like every single presence of a potential object in the scene which even gets like reflections on the building of objects as computer vision approaches this world where we increasingly have like models that can understand and improve themselves and we rely on what human output and human preference from the models is we're going to get these funny scenarios where things aren't also all like immediately deterministic of what a human cares about and I think that's where like tooling fills a big gap but it also is going to be a place where it'll be really interesting to see where users kind of start to use and apply the models and why you need so that this last mile work to put the model in context in the domain that someone is trying to solve and tackle.
So let me let me since you're here right this is one of those things where I'm like I'm not sure this concept concept the concept of labeling concepts can scale only because I don't know if I ever if this slider between less and more is the way if ultimately I need to tell you whether or not to include uh reflections right because the reflections sometimes is great that's exactly what I want most of the time is not going to be what I want I don't know if some RLHF thing is going to solve any of that because it you just need more prompting it's just just saying vehicle is not going to do it yeah I don't know feel free to disagree can you imagine start kind of such pipeline coming for example as kind of swig said that maybe kind of the reflection is exactly what I want then you need some kind of iterations with the kind of the interface or the model or to kind of to get finally what you need so you need to specify the concepts kind of more clearly through kind of multiple iterations can human not be involved in this iteration but just kind of models in just kind of do it automatically I think that's kind of something that definitely gonna it's I would see I'm quite interesting that you can imagine this workflow and then I want kind of reflections and then I can kind of with the kind of the default kind of threshold maybe kind of the kind of the model will get that output then another kind of very strong kind of perception model on they're kind of like kind of gem naturally will then kind of ask we ask gem naturally whether there's kind of sound reflections there and it says yes then we can then you can see that we can automatically it's not gonna move the threshold they can lower and they're gonna ask again gem again again just see whether gonna their reflections now included or not so somehow this process can possibly should be done kind of completely with ai gonna gonna yeah yeah exactly so for now the answer is m engine and we can we can sort of tie it tight closer I think Joseph is showing us the this little way more annotation yeah it's nice now you have a way more model yeah I was just doing an example where maybe we want to find an object that's not already represented in the training data I think I think prompting can solve yeah I think prompting could solve the problem of like reflections because maybe you could say like vehicles on the street but to your point like you would have to like see that that's a failure case right like if I was like just setting up a camera and saying count cars I wouldn't anticipate realizing that reflection could be a problem and so I think this is why like in some ways human in the loop because identifying human intention not necessarily human knowledge is what's going to be important for a lot of last mile use but yeah I'm I'm pretty excited about yeah maybe I want to echo kind of what Joseph said it's also my experience just different people have quite different kind of definition of even a visual concept for example for some kind of data set even hand and some people would like to kind of just gonna I know that they're palm kind of pad as kind of their hand and some people will can include the arm gonna also can ask hand then when we kind of first test that's way I'm gonna sound very kind of customized data set and we found okay their performance is not that good and when we can finally look into can they kind of performance we found okay this is gonna just the user have a different definition or explanation of the concepts but kind of both explanations are okay then in this case you can see that running need a few mind in the loop to do they gonna few short fun tuning or to adapt to the user's definition of this concept that's exactly right it's not always like deterministic of what someone really wants which is why I think like even if you have a fully comprehensive omniscient model putting the model into the context of what the user's trying to do is where a lot of tooling and infrastructure becomes really really helpful anyway I found I found our way most you continue to go like excellent tooling for for vision and I think the world is very grateful for that let's get to call section you know I think you know we we've sort of given it a good overview and people obviously should read the paper and try out the playground try Roboslow is there is there interested in diving deeper what is there a call section from from each of you I mean try the demo try the code we've got a lot of resources on github repo and you know it's a very long managed launch by the way like kudos I don't know this probably takes a lot of effort just on the launch itself even after the model's done yeah and actually just on that maybe one thing just shout out to the whole team I think this is um three was our biggest and most ambitious project to date and it really took a huge team of scientists engineers in turn software engineers you know across across the company so you know really huge shout out to the entire team that made not just the model successful but also the demo and then all the the launch and everything so it was a huge team effort definitely like we'd love to hear from people on what you're using the models for where it's failing you know raise github issues message us on twitter we'd you know love to hear from you on on where we should go next as well yeah and on top of that definitely kind of try out also our benchmark the cycle benchmark I would say that it's likely that it's gonna the benchmark will last longer than our sunscreen model maybe kind of next year there will be a stronger model but the benchmark is kind of the one that I hope to guide the community to kind of get better and better models kind of to get to a kind of we measure human performance on the benchmark I think maybe we are the first one to do that for this kind of very kind of segmentation and the kind of video kind of grounding in the past it's very difficult to measure human performance on this task hopefully kind of this benchmark and guides the community to achieve human performance for this task and even gonna surpass human performance there we uh we set out to be one of the best places if not the best place to build with sam3 and the sam family models so we're going to see what people build with sam and computer vision models to move the whole field forward we have infrastructure for everything from deploying sam3 zero shot to making your own fine tunes to auditing automating labeling of data with sam and we continue to see the impact with each subsequent release expand the number of use cases and the amount of use and accelerate the time to value so excited to see what folks can build on on ruble flow with sam thank you all so much like this is uh you know really great accomplished accomplished great work and just uh obviously like always expands my mind as to what is possible with machine learning yeah i mean you know we're not we're not at s s i or h e i yet but every day we're getting closer awesome thank you so much thank you thank you