← Back to Project List
LocateAnything-3B is the 3B parameter visual language grounding model released by NVIDIA on Hugging Face. it is oriented to the task of "looking at pictures and locating targets according to natural language" and can output target boxes or points. Its core value lies in unifying target detection, referent expression positioning, GUI element positioning, OCR/layout positioning, click positioning, etc. into a VLM framework, and improving positioning and decoding efficiency through Parallel Box Decoding. Pre-sales is suitable for solution discussions such as visual agent, GUI automation, industrial vision, document understanding, robot/autopilot perception, etc. However, the model license is limited to non-commercial research/evaluation purposes and cannot be used directly as a commercial delivery model.

1. Model Overview

ProjectInformation
Models Pagenvidia/LocateAnything-3B
Official Project PageLocateAnything
Online DemoHugging Face Space
Code EntryNVlabs/Eagle/Embodied
PaperarXiv:2605.27365
Model TypeTransformer-based Vision-Language Model
Parameter scaleAbout 3B;Hugging Face safetensors shows BF16 parameters about 3.83B
Base modelQwen/Qwen2.5-3B-Instruct
Vision encoderMoonViT / MoonViT-SO-400M
Pipeline'image-text-to-text'
LibraryTransformers, requires 'trust_remote_code = True'
LicenseNVIDIA License, Non-Commercial Research/Evaluation Purposes
Hugging Face heatabout 570k downloads, 2.4k likes, inspection date: 2026-06-27
Recently Modified2026-06-12, Inspection Date: 2026-06-27
Release DateModel Card Mark GitHub / HF / Demo / Webpage Released on May 26, 2026
Operating platformLinux; NVIDIA GPU recommended; Document lists A100, H100, L40, RTX 4090, Blackwell, etc.

2. Key Schematic

Capability overview vs. PBD

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-teaser.jpg]]

This picture is most suitable for pre-sales opening: the upper part shows LocateAnything covering multi-object positioning, click positioning, layout positioning, GUI grounding, detection and OCR; The second half shows the difference between it and traditional coordinate token decoding. Traditional methods require token-by-token generation of coordinates, and PBD generates a box as an atomic unit in parallel.

Model Architecture and Structured Block Output

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-method-architecture.png]]

Architecture diagram description: input is image text query; The visual encoder is Moon-ViT, the language side is Qwen2.5, and the middle is connected by two-layer MLP projector. The output is not ordinary natural language, but a structured positioning sequence containing semantic block, box block, negative block and end block.

Parallel Box Decoding method diagram

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-method-pbd.png]]

This diagram is suitable for explaining the core technological innovation: treating a set of box coordinates as coupled geometric structures rather than independent tokens. For localization tasks, this helps reduce delays and geometric inconsistencies associated with coordinate order generation.

Data Overwrite

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/nvidia-data.png]]

The official data shows that the training data covers natural scenes, robots, driving, GUI, documents, OCR, open world detection and other domains. The model card mentions that the training set contains 12M unique images, about 140M natural language queries, and 785M bounding boxes.

GUI / OCR / Detection benchmark Chart

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-sspro-gui.png]]

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-layout-ocr.png]]

! [[17-TEMPORARY ATTACHMENT/LocateAnything-3B/hf-coco-lvis.png]]

These diagrams correspond to GUI grounding, layout/OCR, general-purpose target detection, etc., and are suitable for placement in customer materials to illustrate that it is not a single detector, but a general-purpose grounding model for multiple visual positioning tasks.

3 What does it do?

The core task of the LocateAnything-3B can be summarized as: give it a picture and a natural language instruction, and it outputs the position of the target in the image, usually bounding box or point.

CapabilitiesExampleBusiness Value
Open category target detection"Locate all person, car, bicycle"No need to train traditional detectors separately for each category, more suitable for long-tail objects
refers to the expression grounding"locate the person in red" and "the second car on the left"support more natural human-computer interaction and visual question and answer landing
Multi-object intensive detectionLocate a large number of objects in crowded scenesSuitable for security, traffic, remote sensing, storage inventory, etc.
GUI element grounding"Locate search button" "Click on the crop tool"Suitable for visual GUI agents, RPA enhancements, Computer Use
OCR/Text Positioning"Detect all text" "Locate a street sign text"Document understanding, bill recognition, scene text detection
Document layout groundingLocate headings, paragraphs, formulas, tables and other areasSuitable for PDF/document intelligent analysis and layout restoration
click positioning"Point to the traffic light"suitable for robot, remote control, body intelligent interaction
Robot/Autonomous Driving PerceptionRecognizes spatial objects and outputs locationsCan be used as a candidate technology for Physical AI perception modules

4. Core Technology Highlights

4.1 Parallel Box Decoding

Traditional VLM grounding usually serialize 2D box coordinates into multiple 1D tokens, and then generate token-by-token self-regression. There are two problems: first, the coordinates are naturally coupled geometric structures, but token-by-token generation will weaken geometric consistency; Second, strict serial generation will become the bottleneck of reasoning.

LocateAnything PBDs decode boxes or points in parallel as atomic geometric units. According to the official paper abstract, PBD simultaneously improves decoding throughput and positioning accuracy, and promotes speed-accuracy frontier.

4.2 three reasoning modes

ModeDescriptionFits Scene
'fast'MTP only, no fallback to autoregressiveSimple scenarios, speed
'slow'Pure autoregressive decodingOffline tasks for stability and accuracy
'hybrid'By default, it is parallel first, and reverts to autoregressive in case of format abnormality or spatial ambiguity.It is recommended to default, taking into account both speed and effect.

The model card recommends using max_new_tokens = 8192 'and generation_mode = "hybrid"' to avoid output truncation and balance speed and accuracy.

4.3 large-scale multi-domain data

The model card and project pages highlight the large-scale, multi-domain training data that the LocateAnything-Data contains:

DimensionOfficial Description
12M unique images
Query ScaleModel Card Write About 140M natural-language queries; Project page and paper abstract emphasize 138M training samples / language queries
Box Size785M bounding boxes
Data fieldgrounding, open-world grounding, dense detection, scene text, GUI, document layout, OCR, robotics/driving, etc.
Labeling MethodsManual, Open Source Labeling, Model Assisted, Composite Labeling, and Automatic Verification

4.4 High-throughput batch inference tool

The model repository not only gives the model weights, but also provides 'batch_infer.py ', 'batch_utils,' and 'kernel_utils '. Where the la_flash backend is used to FlashAttention the varlen sparse range plan, the goal is to avoid constructing the mask attention dense '[B,H,Q,K].

Example of A100 4K image probe given by model card:

BackendAttention PathTimePeak Reserved Memory
'sdpa'Dense SDPA masks8.2600s35.12GB
'la_flash'FlashAttention sparse range plan8.0314s11.71GB

Pre-sales explanation: It is not only a model weight, but also contains reasoning engineering optimization ideas for high-resolution, multi-target batch detection.

5. Applicable Scenario

5.1 GUI Agent / Computer Use

LocateAnything can do grounding element the GUI: given a screenshot and natural language instructions, locate a button, menu, icon, or area. It can be used as a perception module of visual GUI Agent, helping Agent to move from "understanding screen shots" to "knowing where to click".

Suitable scenarios:

ScenarioValue
Desktop/Web AutomationLocate elements visually when DOM is missing or page structure is not available
Software TestingPositioning UI controls based on natural language testing steps
RPA EnhancementsUpgrade RPA from coordinate recording to semantic control positioning
Remote O & MLocate operation targets in screenshots/video streams

5.2 Document Understanding and OCR/Layout

It supports layout grounding and OCR localization and can be used to locate headings, paragraphs, formulas, tables, text blocks, specific fields, and more.

Suitable scenarios:

ScenarioValue
PDF/Scan ParsingLocating Fields and Areas in Complex Layout
Bill/Contract ProcessingFind the location of the key field to assist structured extraction
Document ReviewIdentify document areas and combine with rules/Q & A
Knowledge Base ConstructionImprove the understanding of slicing and layout of mixed-arrangement documents

5.3 industrial vision and quality inspection

Open category detection and referential expression grounding are suitable for industrial vision PoC, especially when there are many types of customer objects, many long tail defects, and high maintenance costs of traditional detectors.

Note: Commercial landing can not use the model license directly, it needs to be used as a technical verification, selection reference or communication with NVIDIA business license.

5.4 Robot/Body Intelligence/Autopilot

Both the project page and the model card mention robotics, driving, Physical, and AI. LocateAnything can be used as a bridge module for "verbal instructions to visual locations", such as "grab the left cup", "locate the red button" and "point to the intersection signal".

5.5 automatic labeling and data production

It can be used to generate grounding / detection / pointing candidate annotations, which can then be reviewed manually or by rules. Suitable for training data preparation, long tail object labeling, GUI dataset construction.

6. Not quite the scene

ScenarioReason
DIRECT COMMERCIAL DELIVERYNVIDIA License expressly restricts non-commercial research/evaluation use, commercial use is not permitted, except for NVIDIA and its affiliates
Deployment of CPU-only or low-end equipmentOfficial emphasis on NVIDIA GPU acceleration system; 3B VLM high-resolution input requires video memory and computing power
Extremely Low Latency EdgeAlthough PBD and batch tools are available, the 3B model still needs to be optimized for quantization, compression, distillation, etc. on the embedded platform
Production systems that require strict safety certificationThe model card prompts that iterative testing and verification need to be performed according to use-case-specific data
Text-only scene without image/screenshot inputIt is a visual grounding model, not a generic text LLM
High-risk automatic decisionPositioning results may be wrong, requiring manual review or system-level safety redundancy
Multi-language groundingModel card language annotation is English, and the main prompt/query is English task expression

How to use #7.

7.1 installation dependencies

The model card gives the base dependency:

pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5

PyTorch need to be installed separately by CUDA version. Hopper / Blackwell GPU optional installation MagiAttention for faster MTP inference; fallback to PyTorch SDPA when not installed.

7.2 Python Call Method

The model card provides a 'LocateAnythingWorker' mode: load tokenizer, processor, model at startup, and then service requests for detection, grounding, OCR, GUI grounding, etc. through the' predict()'and task methods.

Simplified example:

from PIL import Image

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")

result = worker.detect(img, ["person", "car", "bicycle"])
print(result["answer"])

result = worker.ground_gui(img, "the search button", output_type="point")
print(result["answer"])

The box/point coordinates in the output format are tokens normalized to '[0, 1000]' and need to be converted to pixel coordinates.

7.3 Supported Prompt Templates

TaskPrompt TemplateOutput
Object Detection'Locate all the instances that matches the following description: [CATEGORIES].'Box
Phrase Grounding'Locate a single instance that matches the following description: [PHRASE].'Single Box
Multi Phrase Grounding'Locate all the instances that match the following description: [PHRASE].'Multiple Boxes
Text Grounding'Please locate the text referred as [PHRASE].'Box
Scene Text Detection'Detect all the text in box format.'Box
GUI Grounding'Locate the region that matches the following description: [PHRASE].'Box
GUI Pointing'Point to: [PHRASE].'Point

7.4 Batch inference

python batch_infer.py \
  --model nvidia/LocateAnything-3B \
  --attn la_flash \
  --scheduler pipeline \
  --batch-size 4 \
  --image /path/to/image.jpg \
  --query "personcar"

This mode is suitable for offline batch detection, automatic labeling and evaluation, but not for the training path.

8. What can I say before sales

Business-oriented

Customer ConcernsRecommended Words
Visual AI can understand natural language"LocateAnything can directly convert natural language descriptions into positions in images, such as positioning buttons, text, objects, and document areas."
Why does UI Automation need it"When DOM is not available and the interface is remote desktop/picture/video stream, visual grounding can tell Agent where to click."
Industrial vision objects are many"It is not a traditional fixed category detector, but an open category/natural language driven positioning model, suitable for long tail objects and fast PoC."
Document Scene Complex"It locates layout areas, OCR text, and layout elements, and can be combined with OCR/LLM extraction processes."
Performance Highlights"Instead of generating coordinates per token, core PBD decodes boxes in parallel, improving throughput and maintaining geometric consistency."

Technology-oriented

Technical IssuesRecommended Notes
How to connect the model"Transformers custom code,AutoModel/AutoProcessor loading, BF16 GPU is recommended."
how to use the output"the output is a structured token, which needs to parse the' 'coordinates and map them to the original image pixels."
How to deploy"Can be packaged as FastAPI/gRPC workers; batch_infer, la_flash, and MagiAttention can be evaluated in high-throughput scenarios."
Commercially Available"The current public model license does not allow commercial use. Commercial projects need to be licensed separately or only used as a technical verification reference."
and Grounding DINO class model difference"Grounding DINO is more detection/grounding-specific model; LocateAnything is a VLM-style unified generation framework, covering GUI, OCR, layout, pointing and more task forms."

9. PoC Recommendations

9.1 recommended PoC direction

PoCInputOutputValidation metrics
GUI control positioningapplication screenshot operation descriptionbutton/area box or pointclick hit rate, task success rate
Document Layout PositioningPDF Page Screenshot Field DescriptionField/Paragraph/Table LocationIoU, Field Recall, Extraction Accuracy
Industrial Defect/Object LocationProduction Line Image Object/Defect DescriptionInspection FramemAP, IoU, Missing Detection Rate, False Detection Rate
Remote sensing/traffic intensive detectionHigh-resolution image categoryMulti-target boxRecall rate, intensive scene throughput
Automatic annotationPicture category/description to be annotatedCandidate annotationManual correction rate, annotation efficiency improvement

9.2 PoC Design Recommendations

ProjectProposal
Data volumeFirst prepare 50-200 representative pictures to cover simple, medium and difficult scenes
MarkCreate a small batch of artificial ground truth and evaluate it with IoU / point-in-mask / hit rate
ModeBy default, use 'hybrid'; Compare the speed/accuracy of 'fast' and 'slow'
ResourcesPriority H100/A100/L40/RTX 4090; Record video memory, latency, throughput
SecurityNot directly connected to the production control link, first verify it in the sandbox or offline evaluation
LicenseClarify that PoC is only for research/evaluation; commercial landing requires confirmation of authorization

Example of 9.3 Acceptance Index

IndicatorProposed target
GUI Click Hit RatioCommon Controls> 85%, Complex/Occlusion Scenes Analyzed Separately
Box IoU@0.5Set by business scenario, first look at the improvement compared to traditional solutions
Click hit ratePoints fall within the target mask/box
Inference delayRecord by single graph, batch, high resolution respectively
Manual labeling efficiencyCandidate box availability rate, manual correction time reduction
failure typeclassification statistics of small target, occlusion, reflection, dense overlap, text blur, etc.

10. Risks and Considerations

RiskDescriptionRecommendation
License restrictionsNVIDIA License restrict non-commercial research/evaluation use, commercial use is not allowedPre-sales must be clear; Commercial projects need to talk about authorization or commercial model.
model card non-universal production commitmentclearly research model variant, need use-case-specific testPoC first, cannot directly promise production effect
High computing power requirements3B VLM high-resolution images require high GPU/video memorydo hardware sizing, evaluate quantization/distillation/cropping
Custom code loadingRequires trust_remote_code = True', with supply chain security review requirementsMirroring, code auditing, and fixed commit on the enterprise intranet
Coordinate parsing and post-processingThe output is a text token, which needs to be parsed, mapped, and filteredEncapsulate stable parser and exception handling
Risk of MispositioningVisual grounding may be affected by occlusion, blurring, and small targetsManual validation, rule checking, and multi-model cross-validation
Privacy and ComplianceInput images may contain faces, health information, and trade secretsDesensitization, access control, and log governance
Language rangeMainly for English promptChinese scenes need to be measured or do prompt translation layer

11. Relationship with related technologies

TechnologyRelationship with LocateAnything
Grounding DINOClassic open vocabulary detection/grounding model; LocateAnything more emphasis on VLM unified generation, PBD and multi-tasking coverage
SAM / SAM 3SAM partial segmentation; LocateAnything partial natural language to box/point, can be used as SAM prompt generator
OCR EngineOCR is responsible for text recognition; LocateAnything can supplement text area positioning and layout grounding
Multi-modal large modelUniversal VLM can understand images; LocateAnything more focused on high-quality visual positioning output
RPA / GUI AgentLocateAnything can be used as visual positioning module, combined with operation actuator and process choreographer
Legacy detectorLegacy detectors require fixed class training; LocateAnything are better suited to open classes and natural language descriptions

12. My Pre-Sales Judgment

LocateAnything-3B is a good model for "visual agent/Physical AI / GUI grounding" directional communication. It connects many customer concerns: AI not only need to understand the diagram, but also tell the system "where the target is"; Not only can it recognize common objects, but it can also locate GUI controls, document areas, OCR text and dense targets through natural language.

Its pre-sales value lies in providing a strong demo: the customer enters a natural language, and the model locates the target directly in the complex image. For GUI automation, document intelligence, industrial quality inspection, robotics and autonomous driving awareness, these capabilities are very intuitive.

However, it is not currently suitable as an open source model for direct commercial delivery because license restrictions are critical. More suitable positioning is: for research evaluation, PoC verification, solution prototype, technical route selection, or as an entry point for ecological/licensing cooperation with NVIDIA. Formal business scenarios need to address licensing, model deployment, hardware costs, privacy compliance, and stability validation in advance.

13. Common Customer Q & A

Can it be commercially available?The current Hugging Face model is License by NVIDIA and is limited to non-commercial research/evaluation purposes and is not directly commercially available. Commercial projects require separate confirmation of authorization.
What is the difference between it and ordinary object detection?Ordinary detectors are usually fixed categories; LocateAnything can describe objects in natural language, covering GUI, OCR, layout, pointing, etc.
What can it output?It mainly outputs structured text tokens, including '' coordinates or points, which need to be parsed into pixel coordinates for use.
can it handle chinese instructions?the language of the model card is marked as English, and the chinese instructions need to be measured; in the project, the translation layer can be converted to english prompt first.
How much GPU do you need?NVIDIA GPU such as A100/H100/L40/RTX 4090 is officially listed, and the specific video memory depends on resolution, batch, mode and back end.
Does TensorRT / Triton Support?The model card indicates that the current runtime engine is Transformers, and TensorRT, TensorRT-LLM, and Triton are not supported yet.
Can it be used for GUI automatic click?It is responsible for positioning the control position, and also needs to be combined with click actuator, permission control, exception confirmation and business process orchestration.