1. Model Overview
| Project | Information |
|---|---|
| Models Page | nvidia/LocateAnything-3B |
| Official Project Page | LocateAnything |
| Online Demo | Hugging Face Space |
| Code Entry | NVlabs/Eagle/Embodied |
| Paper | arXiv:2605.27365 |
| Model Type | Transformer-based Vision-Language Model |
| Parameter scale | About 3B;Hugging Face safetensors shows BF16 parameters about 3.83B |
| Base model | Qwen/Qwen2.5-3B-Instruct |
| Vision encoder | MoonViT / MoonViT-SO-400M |
| Pipeline | 'image-text-to-text' |
| Library | Transformers, requires 'trust_remote_code = True' |
| License | NVIDIA License, Non-Commercial Research/Evaluation Purposes |
| Hugging Face heat | about 570k downloads, 2.4k likes, inspection date: 2026-06-27 |
| Recently Modified | 2026-06-12, Inspection Date: 2026-06-27 |
| Release Date | Model Card Mark GitHub / HF / Demo / Webpage Released on May 26, 2026 |
| Operating platform | Linux; NVIDIA GPU recommended; Document lists A100, H100, L40, RTX 4090, Blackwell, etc. |
2. Key Schematic
Capability overview vs. PBD
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-teaser.jpg]]
This picture is most suitable for pre-sales opening: the upper part shows LocateAnything covering multi-object positioning, click positioning, layout positioning, GUI grounding, detection and OCR; The second half shows the difference between it and traditional coordinate token decoding. Traditional methods require token-by-token generation of coordinates, and PBD generates a box as an atomic unit in parallel.
Model Architecture and Structured Block Output
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-method-architecture.png]]
Architecture diagram description: input is image text query; The visual encoder is Moon-ViT, the language side is Qwen2.5, and the middle is connected by two-layer MLP projector. The output is not ordinary natural language, but a structured positioning sequence containing semantic block, box block, negative block and end block.
Parallel Box Decoding method diagram
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-method-pbd.png]]
This diagram is suitable for explaining the core technological innovation: treating a set of box coordinates as coupled geometric structures rather than independent tokens. For localization tasks, this helps reduce delays and geometric inconsistencies associated with coordinate order generation.
Data Overwrite
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-data.png]]
The official data shows that the training data covers natural scenes, robots, driving, GUI, documents, OCR, open world detection and other domains. The model card mentions that the training set contains 12M unique images, about 140M natural language queries, and 785M bounding boxes.
GUI / OCR / Detection benchmark Chart
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-sspro-gui.png]]
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-layout-ocr.png]]
! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-coco-lvis.png]]
These diagrams correspond to GUI grounding, layout/OCR, general-purpose target detection, etc., and are suitable for placement in customer materials to illustrate that it is not a single detector, but a general-purpose grounding model for multiple visual positioning tasks.
3 What does it do?
The core task of the LocateAnything-3B can be summarized as: give it a picture and a natural language instruction, and it outputs the position of the target in the image, usually bounding box or point.
| Capabilities | Example | Business Value |
|---|---|---|
| Open category target detection | "Locate all person, car, bicycle" | No need to train traditional detectors separately for each category, more suitable for long-tail objects |
| refers to the expression grounding | "locate the person in red" and "the second car on the left" | support more natural human-computer interaction and visual question and answer landing |
| Multi-object intensive detection | Locate a large number of objects in crowded scenes | Suitable for security, traffic, remote sensing, storage inventory, etc. |
| GUI element grounding | "Locate search button" "Click on the crop tool" | Suitable for visual GUI agents, RPA enhancements, Computer Use |
| OCR/Text Positioning | "Detect all text" "Locate a street sign text" | Document understanding, bill recognition, scene text detection |
| Document layout grounding | Locate headings, paragraphs, formulas, tables and other areas | Suitable for PDF/document intelligent analysis and layout restoration |
| click positioning | "Point to the traffic light" | suitable for robot, remote control, body intelligent interaction |
| Robot/Autonomous Driving Perception | Recognizes spatial objects and outputs locations | Can be used as a candidate technology for Physical AI perception modules |
4. Core Technology Highlights
4.1 Parallel Box Decoding
Traditional VLM grounding usually serialize 2D box coordinates into multiple 1D tokens, and then generate token-by-token self-regression. There are two problems: first, the coordinates are naturally coupled geometric structures, but token-by-token generation will weaken geometric consistency; Second, strict serial generation will become the bottleneck of reasoning.
LocateAnything PBDs decode boxes or points in parallel as atomic geometric units. According to the official paper abstract, PBD simultaneously improves decoding throughput and positioning accuracy, and promotes speed-accuracy frontier.
4.2 three reasoning modes
| Mode | Description | Fits Scene |
|---|---|---|
| 'fast' | MTP only, no fallback to autoregressive | Simple scenarios, speed |
| 'slow' | Pure autoregressive decoding | Offline tasks for stability and accuracy |
| 'hybrid' | By default, it is parallel first, and reverts to autoregressive in case of format abnormality or spatial ambiguity. | It is recommended to default, taking into account both speed and effect. |
The model card recommends using max_new_tokens = 8192 'and generation_mode = "hybrid"' to avoid output truncation and balance speed and accuracy.
4.3 large-scale multi-domain data
The model card and project pages highlight the large-scale, multi-domain training data that the LocateAnything-Data contains:
| Dimension | Official Description |
|---|---|
| 12M unique images | |
| Query Scale | Model Card Write About 140M natural-language queries; Project page and paper abstract emphasize 138M training samples / language queries |
| Box Size | 785M bounding boxes |
| Data field | grounding, open-world grounding, dense detection, scene text, GUI, document layout, OCR, robotics/driving, etc. |
| Labeling Methods | Manual, Open Source Labeling, Model Assisted, Composite Labeling, and Automatic Verification |
4.4 High-throughput batch inference tool
The model repository not only gives the model weights, but also provides 'batch_infer.py ', 'batch_utils,' and 'kernel_utils '. Where the la_flash backend is used to FlashAttention the varlen sparse range plan, the goal is to avoid constructing the mask attention dense '[B,H,Q,K].
Example of A100 4K image probe given by model card:
| Backend | Attention Path | Time | Peak Reserved Memory |
|---|---|---|---|
| 'sdpa' | Dense SDPA masks | 8.2600s | 35.12GB |
| 'la_flash' | FlashAttention sparse range plan | 8.0314s | 11.71GB |
Pre-sales explanation: It is not only a model weight, but also contains reasoning engineering optimization ideas for high-resolution, multi-target batch detection.
5. Applicable Scenario
5.1 GUI Agent / Computer Use
LocateAnything can do grounding element the GUI: given a screenshot and natural language instructions, locate a button, menu, icon, or area. It can be used as a perception module of visual GUI Agent, helping Agent to move from "understanding screen shots" to "knowing where to click".
Suitable scenarios:
| Scenario | Value |
|---|---|
| Desktop/Web Automation | Locate elements visually when DOM is missing or page structure is not available |
| Software Testing | Positioning UI controls based on natural language testing steps |
| RPA Enhancements | Upgrade RPA from coordinate recording to semantic control positioning |
| Remote O & M | Locate operation targets in screenshots/video streams |
5.2 Document Understanding and OCR/Layout
It supports layout grounding and OCR localization and can be used to locate headings, paragraphs, formulas, tables, text blocks, specific fields, and more.
Suitable scenarios:
| Scenario | Value |
|---|---|
| PDF/Scan Parsing | Locating Fields and Areas in Complex Layout |
| Bill/Contract Processing | Find the location of the key field to assist structured extraction |
| Document Review | Identify document areas and combine with rules/Q & A |
| Knowledge Base Construction | Improve the understanding of slicing and layout of mixed-arrangement documents |
5.3 industrial vision and quality inspection
Open category detection and referential expression grounding are suitable for industrial vision PoC, especially when there are many types of customer objects, many long tail defects, and high maintenance costs of traditional detectors.
Note: Commercial landing can not use the model license directly, it needs to be used as a technical verification, selection reference or communication with NVIDIA business license.
5.4 Robot/Body Intelligence/Autopilot
Both the project page and the model card mention robotics, driving, Physical, and AI. LocateAnything can be used as a bridge module for "verbal instructions to visual locations", such as "grab the left cup", "locate the red button" and "point to the intersection signal".
5.5 automatic labeling and data production
It can be used to generate grounding / detection / pointing candidate annotations, which can then be reviewed manually or by rules. Suitable for training data preparation, long tail object labeling, GUI dataset construction.
6. Not quite the scene
| Scenario | Reason | |
|---|---|---|
| DIRECT COMMERCIAL DELIVERY | NVIDIA License expressly restricts non-commercial research/evaluation use, commercial use is not permitted, except for NVIDIA and its affiliates | |
| Deployment of CPU-only or low-end equipment | Official emphasis on NVIDIA GPU acceleration system; 3B VLM high-resolution input requires video memory and computing power | |
| Extremely Low Latency Edge | Although PBD and batch tools are available, the 3B model still needs to be optimized for quantization, compression, distillation, etc. on the embedded platform | |
| Production systems that require strict safety certification | The model card prompts that iterative testing and verification need to be performed according to use-case-specific data | |
| Text-only scene without image/screenshot input | It is a visual grounding model, not a generic text LLM | |
| High-risk automatic decision | Positioning results may be wrong, requiring manual review or system-level safety redundancy | |
| Multi-language grounding | Model card language annotation is English, and the main prompt/query is English task expression |
How to use #7.
7.1 installation dependencies
The model card gives the base dependency:
pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5
PyTorch need to be installed separately by CUDA version. Hopper / Blackwell GPU optional installation MagiAttention for faster MTP inference; fallback to PyTorch SDPA when not installed.
7.2 Python Call Method
The model card provides a 'LocateAnythingWorker' mode: load tokenizer, processor, model at startup, and then service requests for detection, grounding, OCR, GUI grounding, etc. through the' predict()'and task methods.
Simplified example:
from PIL import Image
worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")
result = worker.detect(img, ["person", "car", "bicycle"])
print(result["answer"])
result = worker.ground_gui(img, "the search button", output_type="point")
print(result["answer"])
The box/point coordinates in the output format are tokens normalized to '[0, 1000]' and need to be converted to pixel coordinates.
7.3 Supported Prompt Templates
| Task | Prompt Template | Output |
|---|---|---|
| Object Detection | 'Locate all the instances that matches the following description: [CATEGORIES].' | Box |
| Phrase Grounding | 'Locate a single instance that matches the following description: [PHRASE].' | Single Box |
| Multi Phrase Grounding | 'Locate all the instances that match the following description: [PHRASE].' | Multiple Boxes |
| Text Grounding | 'Please locate the text referred as [PHRASE].' | Box |
| Scene Text Detection | 'Detect all the text in box format.' | Box |
| GUI Grounding | 'Locate the region that matches the following description: [PHRASE].' | Box |
| GUI Pointing | 'Point to: [PHRASE].' | Point |
7.4 Batch inference
python batch_infer.py \
--model nvidia/LocateAnything-3B \
--attn la_flash \
--scheduler pipeline \
--batch-size 4 \
--image /path/to/image.jpg \
--query "personcar"
This mode is suitable for offline batch detection, automatic labeling and evaluation, but not for the training path.
8. What can I say before sales
Business-oriented
| Customer Concerns | Recommended Words |
|---|---|
| Visual AI can understand natural language | "LocateAnything can directly convert natural language descriptions into positions in images, such as positioning buttons, text, objects, and document areas." |
| Why does UI Automation need it | "When DOM is not available and the interface is remote desktop/picture/video stream, visual grounding can tell Agent where to click." |
| Industrial vision objects are many | "It is not a traditional fixed category detector, but an open category/natural language driven positioning model, suitable for long tail objects and fast PoC." |
| Document Scene Complex | "It locates layout areas, OCR text, and layout elements, and can be combined with OCR/LLM extraction processes." |
| Performance Highlights | "Instead of generating coordinates per token, core PBD decodes boxes in parallel, improving throughput and maintaining geometric consistency." |
Technology-oriented
| Technical Issues | Recommended Notes |
|---|---|
| How to connect the model | "Transformers custom code,AutoModel/AutoProcessor loading, BF16 GPU is recommended." |
| how to use the output | "the output is a structured token, which needs to parse the' |
| How to deploy | "Can be packaged as FastAPI/gRPC workers; batch_infer, la_flash, and MagiAttention can be evaluated in high-throughput scenarios." |
| Commercially Available | "The current public model license does not allow commercial use. Commercial projects need to be licensed separately or only used as a technical verification reference." |
| and Grounding DINO class model difference | "Grounding DINO is more detection/grounding-specific model; LocateAnything is a VLM-style unified generation framework, covering GUI, OCR, layout, pointing and more task forms." |
9. PoC Recommendations
9.1 recommended PoC direction
| PoC | Input | Output | Validation metrics |
|---|---|---|---|
| GUI control positioning | application screenshot operation description | button/area box or point | click hit rate, task success rate |
| Document Layout Positioning | PDF Page Screenshot Field Description | Field/Paragraph/Table Location | IoU, Field Recall, Extraction Accuracy |
| Industrial Defect/Object Location | Production Line Image Object/Defect Description | Inspection Frame | mAP, IoU, Missing Detection Rate, False Detection Rate |
| Remote sensing/traffic intensive detection | High-resolution image category | Multi-target box | Recall rate, intensive scene throughput |
| Automatic annotation | Picture category/description to be annotated | Candidate annotation | Manual correction rate, annotation efficiency improvement |
9.2 PoC Design Recommendations
| Project | Proposal |
|---|---|
| Data volume | First prepare 50-200 representative pictures to cover simple, medium and difficult scenes |
| Mark | Create a small batch of artificial ground truth and evaluate it with IoU / point-in-mask / hit rate |
| Mode | By default, use 'hybrid'; Compare the speed/accuracy of 'fast' and 'slow' |
| Resources | Priority H100/A100/L40/RTX 4090; Record video memory, latency, throughput |
| Security | Not directly connected to the production control link, first verify it in the sandbox or offline evaluation |
| License | Clarify that PoC is only for research/evaluation; commercial landing requires confirmation of authorization |
Example of 9.3 Acceptance Index
| Indicator | Proposed target |
|---|---|
| GUI Click Hit Ratio | Common Controls> 85%, Complex/Occlusion Scenes Analyzed Separately |
| Box IoU@0.5 | Set by business scenario, first look at the improvement compared to traditional solutions |
| Click hit rate | Points fall within the target mask/box |
| Inference delay | Record by single graph, batch, high resolution respectively |
| Manual labeling efficiency | Candidate box availability rate, manual correction time reduction |
| failure type | classification statistics of small target, occlusion, reflection, dense overlap, text blur, etc. |
10. Risks and Considerations
| Risk | Description | Recommendation |
|---|---|---|
| License restrictions | NVIDIA License restrict non-commercial research/evaluation use, commercial use is not allowed | Pre-sales must be clear; Commercial projects need to talk about authorization or commercial model. |
| model card non-universal production commitment | clearly research model variant, need use-case-specific test | PoC first, cannot directly promise production effect |
| High computing power requirements | 3B VLM high-resolution images require high GPU/video memory | do hardware sizing, evaluate quantization/distillation/cropping |
| Custom code loading | Requires trust_remote_code = True', with supply chain security review requirements | Mirroring, code auditing, and fixed commit on the enterprise intranet |
| Coordinate parsing and post-processing | The output is a text token, which needs to be parsed, mapped, and filtered | Encapsulate stable parser and exception handling |
| Risk of Mispositioning | Visual grounding may be affected by occlusion, blurring, and small targets | Manual validation, rule checking, and multi-model cross-validation |
| Privacy and Compliance | Input images may contain faces, health information, and trade secrets | Desensitization, access control, and log governance |
| Language range | Mainly for English prompt | Chinese scenes need to be measured or do prompt translation layer |
11. Relationship with related technologies
| Technology | Relationship with LocateAnything |
|---|---|
| Grounding DINO | Classic open vocabulary detection/grounding model; LocateAnything more emphasis on VLM unified generation, PBD and multi-tasking coverage |
| SAM / SAM 3 | SAM partial segmentation; LocateAnything partial natural language to box/point, can be used as SAM prompt generator |
| OCR Engine | OCR is responsible for text recognition; LocateAnything can supplement text area positioning and layout grounding |
| Multi-modal large model | Universal VLM can understand images; LocateAnything more focused on high-quality visual positioning output |
| RPA / GUI Agent | LocateAnything can be used as visual positioning module, combined with operation actuator and process choreographer |
| Legacy detector | Legacy detectors require fixed class training; LocateAnything are better suited to open classes and natural language descriptions |
12. My Pre-Sales Judgment
LocateAnything-3B is a good model for "visual agent/Physical AI / GUI grounding" directional communication. It connects many customer concerns: AI not only need to understand the diagram, but also tell the system "where the target is"; Not only can it recognize common objects, but it can also locate GUI controls, document areas, OCR text and dense targets through natural language.
Its pre-sales value lies in providing a strong demo: the customer enters a natural language, and the model locates the target directly in the complex image. For GUI automation, document intelligence, industrial quality inspection, robotics and autonomous driving awareness, these capabilities are very intuitive.
However, it is not currently suitable as an open source model for direct commercial delivery because license restrictions are critical. More suitable positioning is: for research evaluation, PoC verification, solution prototype, technical route selection, or as an entry point for ecological/licensing cooperation with NVIDIA. Formal business scenarios need to address licensing, model deployment, hardware costs, privacy compliance, and stability validation in advance.
13. Common Customer Q & A
| Can it be commercially available? | The current Hugging Face model is License by NVIDIA and is limited to non-commercial research/evaluation purposes and is not directly commercially available. Commercial projects require separate confirmation of authorization. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| What is the difference between it and ordinary object detection? | Ordinary detectors are usually fixed categories; LocateAnything can describe objects in natural language, covering GUI, OCR, layout, pointing, etc. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| What can it output? | It mainly outputs structured text tokens, including ' | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| can it handle chinese instructions? | the language of the model card is marked as English, and the chinese instructions need to be measured; in the project, the translation layer can be converted to english prompt first. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| How much GPU do you need? | NVIDIA GPU such as A100/H100/L40/RTX 4090 is officially listed, and the specific video memory depends on resolution, batch, mode and back end. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Does TensorRT / Triton Support? | The model card indicates that the current runtime engine is Transformers, and TensorRT, TensorRT-LLM, and Triton are not supported yet. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can it be used for GUI automatic click? | It is responsible for positioning the control position, and also needs to be combined with click actuator, permission control, exception confirmation and business process orchestration. |