NVIDIA LocateAnything-3B

← Back to Project List

LocateAnything-3B is the 3B parameter visual language grounding model released by NVIDIA on Hugging Face. it is oriented to the task of "looking at pictures and locating targets according to natural language" and can output target boxes or points. Its core value lies in unifying target detection, referent expression positioning, GUI element positioning, OCR/layout positioning, click positioning, etc. into a VLM framework, and improving positioning and decoding efficiency through Parallel Box Decoding. Pre-sales is suitable for solution discussions such as visual agent, GUI automation, industrial vision, document understanding, robot/autopilot perception, etc. However, the model license is limited to non-commercial research/evaluation purposes and cannot be used directly as a commercial delivery model.

1. Model Overview

Project	Information
Models Page	nvidia/LocateAnything-3B
Official Project Page	LocateAnything
Online Demo	Hugging Face Space
Code Entry	NVlabs/Eagle/Embodied
Paper	arXiv:2605.27365
Model Type	Transformer-based Vision-Language Model
Parameter scale	About 3B;Hugging Face safetensors shows BF16 parameters about 3.83B
Base model	Qwen/Qwen2.5-3B-Instruct
Vision encoder	MoonViT / MoonViT-SO-400M
Pipeline	'image-text-to-text'
Library	Transformers, requires 'trust_remote_code = True'
License	NVIDIA License, Non-Commercial Research/Evaluation Purposes
Hugging Face heat	about 570k downloads, 2.4k likes, inspection date: 2026-06-27
Recently Modified	2026-06-12, Inspection Date: 2026-06-27
Release Date	Model Card Mark GitHub / HF / Demo / Webpage Released on May 26, 2026
Operating platform	Linux; NVIDIA GPU recommended; Document lists A100, H100, L40, RTX 4090, Blackwell, etc.

2. Key Schematic

Capability overview vs. PBD

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-teaser.jpg]]

This picture is most suitable for pre-sales opening: the upper part shows LocateAnything covering multi-object positioning, click positioning, layout positioning, GUI grounding, detection and OCR; The second half shows the difference between it and traditional coordinate token decoding. Traditional methods require token-by-token generation of coordinates, and PBD generates a box as an atomic unit in parallel.

Model Architecture and Structured Block Output

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-method-architecture.png]]

Architecture diagram description: input is image text query; The visual encoder is Moon-ViT, the language side is Qwen2.5, and the middle is connected by two-layer MLP projector. The output is not ordinary natural language, but a structured positioning sequence containing semantic block, box block, negative block and end block.

Parallel Box Decoding method diagram

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-method-pbd.png]]

This diagram is suitable for explaining the core technological innovation: treating a set of box coordinates as coupled geometric structures rather than independent tokens. For localization tasks, this helps reduce delays and geometric inconsistencies associated with coordinate order generation.

Data Overwrite

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-data.png]]

The official data shows that the training data covers natural scenes, robots, driving, GUI, documents, OCR, open world detection and other domains. The model card mentions that the training set contains 12M unique images, about 140M natural language queries, and 785M bounding boxes.

GUI / OCR / Detection benchmark Chart

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-sspro-gui.png]]

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-layout-ocr.png]]

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-coco-lvis.png]]

These diagrams correspond to GUI grounding, layout/OCR, general-purpose target detection, etc., and are suitable for placement in customer materials to illustrate that it is not a single detector, but a general-purpose grounding model for multiple visual positioning tasks.

3 What does it do?

The core task of the LocateAnything-3B can be summarized as: give it a picture and a natural language instruction, and it outputs the position of the target in the image, usually bounding box or point.

Capabilities	Example	Business Value
Open category target detection	"Locate all person, car, bicycle"	No need to train traditional detectors separately for each category, more suitable for long-tail objects
refers to the expression grounding	"locate the person in red" and "the second car on the left"	support more natural human-computer interaction and visual question and answer landing
Multi-object intensive detection	Locate a large number of objects in crowded scenes	Suitable for security, traffic, remote sensing, storage inventory, etc.
GUI element grounding	"Locate search button" "Click on the crop tool"	Suitable for visual GUI agents, RPA enhancements, Computer Use
OCR/Text Positioning	"Detect all text" "Locate a street sign text"	Document understanding, bill recognition, scene text detection
Document layout grounding	Locate headings, paragraphs, formulas, tables and other areas	Suitable for PDF/document intelligent analysis and layout restoration
click positioning	"Point to the traffic light"	suitable for robot, remote control, body intelligent interaction
Robot/Autonomous Driving Perception	Recognizes spatial objects and outputs locations	Can be used as a candidate technology for Physical AI perception modules

4. Core Technology Highlights

4.1 Parallel Box Decoding

Traditional VLM grounding usually serialize 2D box coordinates into multiple 1D tokens, and then generate token-by-token self-regression. There are two problems: first, the coordinates are naturally coupled geometric structures, but token-by-token generation will weaken geometric consistency; Second, strict serial generation will become the bottleneck of reasoning.

LocateAnything PBDs decode boxes or points in parallel as atomic geometric units. According to the official paper abstract, PBD simultaneously improves decoding throughput and positioning accuracy, and promotes speed-accuracy frontier.

4.2 three reasoning modes

Mode	Description	Fits Scene
'fast'	MTP only, no fallback to autoregressive	Simple scenarios, speed
'slow'	Pure autoregressive decoding	Offline tasks for stability and accuracy
'hybrid'	By default, it is parallel first, and reverts to autoregressive in case of format abnormality or spatial ambiguity.	It is recommended to default, taking into account both speed and effect.

The model card recommends using max_new_tokens = 8192 'and generation_mode = "hybrid"' to avoid output truncation and balance speed and accuracy.

4.3 large-scale multi-domain data

The model card and project pages highlight the large-scale, multi-domain training data that the LocateAnything-Data contains:

Dimension	Official Description
12M unique images
Query Scale	Model Card Write About 140M natural-language queries; Project page and paper abstract emphasize 138M training samples / language queries
Box Size	785M bounding boxes
Data field	grounding, open-world grounding, dense detection, scene text, GUI, document layout, OCR, robotics/driving, etc.
Labeling Methods	Manual, Open Source Labeling, Model Assisted, Composite Labeling, and Automatic Verification

4.4 High-throughput batch inference tool

The model repository not only gives the model weights, but also provides 'batch_infer.py ', 'batch_utils,' and 'kernel_utils '. Where the la_flash backend is used to FlashAttention the varlen sparse range plan, the goal is to avoid constructing the mask attention dense '[B,H,Q,K].

Example of A100 4K image probe given by model card:

Backend	Attention Path	Time	Peak Reserved Memory
'sdpa'	Dense SDPA masks	8.2600s	35.12GB
'la_flash'	FlashAttention sparse range plan	8.0314s	11.71GB

Pre-sales explanation: It is not only a model weight, but also contains reasoning engineering optimization ideas for high-resolution, multi-target batch detection.

5. Applicable Scenario

5.1 GUI Agent / Computer Use

LocateAnything can do grounding element the GUI: given a screenshot and natural language instructions, locate a button, menu, icon, or area. It can be used as a perception module of visual GUI Agent, helping Agent to move from "understanding screen shots" to "knowing where to click".

Suitable scenarios:

Scenario	Value
Desktop/Web Automation	Locate elements visually when DOM is missing or page structure is not available
Software Testing	Positioning UI controls based on natural language testing steps
RPA Enhancements	Upgrade RPA from coordinate recording to semantic control positioning
Remote O & M	Locate operation targets in screenshots/video streams

5.2 Document Understanding and OCR/Layout

It supports layout grounding and OCR localization and can be used to locate headings, paragraphs, formulas, tables, text blocks, specific fields, and more.

Suitable scenarios:

Scenario	Value
PDF/Scan Parsing	Locating Fields and Areas in Complex Layout
Bill/Contract Processing	Find the location of the key field to assist structured extraction
Document Review	Identify document areas and combine with rules/Q & A
Knowledge Base Construction	Improve the understanding of slicing and layout of mixed-arrangement documents

5.3 industrial vision and quality inspection

Open category detection and referential expression grounding are suitable for industrial vision PoC, especially when there are many types of customer objects, many long tail defects, and high maintenance costs of traditional detectors.

Note: Commercial landing can not use the model license directly, it needs to be used as a technical verification, selection reference or communication with NVIDIA business license.

5.4 Robot/Body Intelligence/Autopilot

Both the project page and the model card mention robotics, driving, Physical, and AI. LocateAnything can be used as a bridge module for "verbal instructions to visual locations", such as "grab the left cup", "locate the red button" and "point to the intersection signal".

5.5 automatic labeling and data production

It can be used to generate grounding / detection / pointing candidate annotations, which can then be reviewed manually or by rules. Suitable for training data preparation, long tail object labeling, GUI dataset construction.

6. Not quite the scene

Scenario	Reason
DIRECT COMMERCIAL DELIVERY	NVIDIA License expressly restricts non-commercial research/evaluation use, commercial use is not permitted, except for NVIDIA and its affiliates
Deployment of CPU-only or low-end equipment	Official emphasis on NVIDIA GPU acceleration system; 3B VLM high-resolution input requires video memory and computing power
Extremely Low Latency Edge	Although PBD and batch tools are available, the 3B model still needs to be optimized for quantization, compression, distillation, etc. on the embedded platform
Production systems that require strict safety certification	The model card prompts that iterative testing and verification need to be performed according to use-case-specific data
Text-only scene without image/screenshot input	It is a visual grounding model, not a generic text LLM
	High-risk automatic decision	Positioning results may be wrong, requiring manual review or system-level safety redundancy
Multi-language grounding	Model card language annotation is English, and the main prompt/query is English task expression

How to use #7.

7.1 installation dependencies

The model card gives the base dependency:

pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5

PyTorch need to be installed separately by CUDA version. Hopper / Blackwell GPU optional installation MagiAttention for faster MTP inference; fallback to PyTorch SDPA when not installed.

7.2 Python Call Method

The model card provides a 'LocateAnythingWorker' mode: load tokenizer, processor, model at startup, and then service requests for detection, grounding, OCR, GUI grounding, etc. through the' predict()'and task methods.

Simplified example:

from PIL import Image

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")

result = worker.detect(img, ["person", "car", "bicycle"])
print(result["answer"])

result = worker.ground_gui(img, "the search button", output_type="point")
print(result["answer"])

The box/point coordinates in the output format are tokens normalized to '[0, 1000]' and need to be converted to pixel coordinates.

7.3 Supported Prompt Templates

Task	Prompt Template	Output
Object Detection	'Locate all the instances that matches the following description: [CATEGORIES].'	Box
Phrase Grounding	'Locate a single instance that matches the following description: [PHRASE].'	Single Box
Multi Phrase Grounding	'Locate all the instances that match the following description: [PHRASE].'	Multiple Boxes
Text Grounding	'Please locate the text referred as [PHRASE].'	Box
Scene Text Detection	'Detect all the text in box format.'	Box
GUI Grounding	'Locate the region that matches the following description: [PHRASE].'	Box
GUI Pointing	'Point to: [PHRASE].'	Point

7.4 Batch inference

python batch_infer.py \
  --model nvidia/LocateAnything-3B \
  --attn la_flash \
  --scheduler pipeline \
  --batch-size 4 \
  --image /path/to/image.jpg \
  --query "personcar"

This mode is suitable for offline batch detection, automatic labeling and evaluation, but not for the training path.

8. What can I say before sales

Business-oriented

Customer Concerns	Recommended Words
Visual AI can understand natural language	"LocateAnything can directly convert natural language descriptions into positions in images, such as positioning buttons, text, objects, and document areas."
Why does UI Automation need it	"When DOM is not available and the interface is remote desktop/picture/video stream, visual grounding can tell Agent where to click."
Industrial vision objects are many	"It is not a traditional fixed category detector, but an open category/natural language driven positioning model, suitable for long tail objects and fast PoC."
Document Scene Complex	"It locates layout areas, OCR text, and layout elements, and can be combined with OCR/LLM extraction processes."
Performance Highlights	"Instead of generating coordinates per token, core PBD decodes boxes in parallel, improving throughput and maintaining geometric consistency."

Technology-oriented

Technical Issues	Recommended Notes
How to connect the model	"Transformers custom code,AutoModel/AutoProcessor loading, BF16 GPU is recommended."
how to use the output	"the output is a structured token, which needs to parse the' 'coordinates and map them to the original image pixels."
How to deploy	"Can be packaged as FastAPI/gRPC workers; batch_infer, la_flash, and MagiAttention can be evaluated in high-throughput scenarios."
Commercially Available	"The current public model license does not allow commercial use. Commercial projects need to be licensed separately or only used as a technical verification reference."
and Grounding DINO class model difference	"Grounding DINO is more detection/grounding-specific model; LocateAnything is a VLM-style unified generation framework, covering GUI, OCR, layout, pointing and more task forms."

9. PoC Recommendations

9.1 recommended PoC direction

PoC	Input	Output	Validation metrics
GUI control positioning	application screenshot operation description	button/area box or point	click hit rate, task success rate
Document Layout Positioning	PDF Page Screenshot Field Description	Field/Paragraph/Table Location	IoU, Field Recall, Extraction Accuracy
Industrial Defect/Object Location	Production Line Image Object/Defect Description	Inspection Frame	mAP, IoU, Missing Detection Rate, False Detection Rate
Remote sensing/traffic intensive detection	High-resolution image category	Multi-target box	Recall rate, intensive scene throughput
Automatic annotation	Picture category/description to be annotated	Candidate annotation	Manual correction rate, annotation efficiency improvement

9.2 PoC Design Recommendations

Project	Proposal
Data volume	First prepare 50-200 representative pictures to cover simple, medium and difficult scenes
Mark	Create a small batch of artificial ground truth and evaluate it with IoU / point-in-mask / hit rate
Mode	By default, use 'hybrid'; Compare the speed/accuracy of 'fast' and 'slow'
Resources	Priority H100/A100/L40/RTX 4090; Record video memory, latency, throughput
Security	Not directly connected to the production control link, first verify it in the sandbox or offline evaluation
License	Clarify that PoC is only for research/evaluation; commercial landing requires confirmation of authorization

Example of 9.3 Acceptance Index

Indicator	Proposed target
GUI Click Hit Ratio	Common Controls> 85%, Complex/Occlusion Scenes Analyzed Separately
Box IoU@0.5	Set by business scenario, first look at the improvement compared to traditional solutions
Click hit rate	Points fall within the target mask/box
Inference delay	Record by single graph, batch, high resolution respectively
Manual labeling efficiency	Candidate box availability rate, manual correction time reduction
failure type	classification statistics of small target, occlusion, reflection, dense overlap, text blur, etc.

10. Risks and Considerations

Risk	Description	Recommendation
License restrictions	NVIDIA License restrict non-commercial research/evaluation use, commercial use is not allowed	Pre-sales must be clear; Commercial projects need to talk about authorization or commercial model.
model card non-universal production commitment	clearly research model variant, need use-case-specific test	PoC first, cannot directly promise production effect
High computing power requirements	3B VLM high-resolution images require high GPU/video memory	do hardware sizing, evaluate quantization/distillation/cropping
Custom code loading	Requires trust_remote_code = True', with supply chain security review requirements	Mirroring, code auditing, and fixed commit on the enterprise intranet
Coordinate parsing and post-processing	The output is a text token, which needs to be parsed, mapped, and filtered	Encapsulate stable parser and exception handling
Risk of Mispositioning	Visual grounding may be affected by occlusion, blurring, and small targets	Manual validation, rule checking, and multi-model cross-validation
Privacy and Compliance	Input images may contain faces, health information, and trade secrets	Desensitization, access control, and log governance
Language range	Mainly for English prompt	Chinese scenes need to be measured or do prompt translation layer

11. Relationship with related technologies

Technology	Relationship with LocateAnything
Grounding DINO	Classic open vocabulary detection/grounding model; LocateAnything more emphasis on VLM unified generation, PBD and multi-tasking coverage
SAM / SAM 3	SAM partial segmentation; LocateAnything partial natural language to box/point, can be used as SAM prompt generator
OCR Engine	OCR is responsible for text recognition; LocateAnything can supplement text area positioning and layout grounding
Multi-modal large model	Universal VLM can understand images; LocateAnything more focused on high-quality visual positioning output
RPA / GUI Agent	LocateAnything can be used as visual positioning module, combined with operation actuator and process choreographer
Legacy detector	Legacy detectors require fixed class training; LocateAnything are better suited to open classes and natural language descriptions

12. My Pre-Sales Judgment

LocateAnything-3B is a good model for "visual agent/Physical AI / GUI grounding" directional communication. It connects many customer concerns: AI not only need to understand the diagram, but also tell the system "where the target is"; Not only can it recognize common objects, but it can also locate GUI controls, document areas, OCR text and dense targets through natural language.

Its pre-sales value lies in providing a strong demo: the customer enters a natural language, and the model locates the target directly in the complex image. For GUI automation, document intelligence, industrial quality inspection, robotics and autonomous driving awareness, these capabilities are very intuitive.

However, it is not currently suitable as an open source model for direct commercial delivery because license restrictions are critical. More suitable positioning is: for research evaluation, PoC verification, solution prototype, technical route selection, or as an entry point for ecological/licensing cooperation with NVIDIA. Formal business scenarios need to address licensing, model deployment, hardware costs, privacy compliance, and stability validation in advance.

13. Common Customer Q & A


Can it be commercially available?	The current Hugging Face model is License by NVIDIA and is limited to non-commercial research/evaluation purposes and is not directly commercially available. Commercial projects require separate confirmation of authorization.
What is the difference between it and ordinary object detection?	Ordinary detectors are usually fixed categories; LocateAnything can describe objects in natural language, covering GUI, OCR, layout, pointing, etc.
What can it output?	It mainly outputs structured text tokens, including '' coordinates or points, which need to be parsed into pixel coordinates for use.
can it handle chinese instructions?	the language of the model card is marked as English, and the chinese instructions need to be measured; in the project, the translation layer can be converted to english prompt first.
How much GPU do you need?	NVIDIA GPU such as A100/H100/L40/RTX 4090 is officially listed, and the specific video memory depends on resolution, batch, mode and back end.
Does TensorRT / Triton Support?	The model card indicates that the current runtime engine is Transformers, and TensorRT, TensorRT-LLM, and Triton are not supported yet.
Can it be used for GUI automatic click?	It is responsible for positioning the control position, and also needs to be combined with click actuator, permission control, exception confirmation and business process orchestration.

14. REFERENCE

-Hugging Face model page: nvidia/LocateAnything-3B

-NVIDIA Project Page: LocateAnything

-Hugging Face Demo Space

-GitHub code: NVlabs/Eagle/Embodied

-Thesis: LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

-Model License

-NVIDIA Nemotron

-NVIDIA Cosmos