ACL 2026 Main

GuideDog:
A Real-World Egocentric Multimodal Dataset
for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim♣* Jaewoo Park♣* Junhee Park Sangeyl Lee
Jiwan Chung Jisung Kim♢† Ji Hoon Joung♡† Youngjae Yu

Yonsei University · LG AI Research · Euler Robotics · Seoul National University

*Equal contribution  ·  Work done while at SK Telecom

Recent MLLMs open new possibilities for Blind and Low Vision (BLV) assistance, but the difficulty of accessibility-aware annotation has kept benchmarks small. We propose GuideDog, a 22K dataset of real-world egocentric walking scenes across 183 locations in 46 countries, with labels structured by three BLV guidance standards (S1/S2/S3) and a 2,106-example human-verified gold subset. We also provide GuideDogQA, an 818-sample benchmark for fine-grained object recognition and depth perception. Today's MLLMs still struggle with the spatial half of the task, especially depth-heavy obstacle guidance.

The accessibility-aware guidance task: an MLLM reads a pedestrian scene and produces an S1 description, S2 obstacle info, and S3 summary direction.
The task. Given an egocentric walking frame, an MLLM produces guidance in three parts: S1 describe the surroundings, S2 identify obstacles with location and proximity, S3 summarize a safe direction.

How GuideDog differs from prior BLV datasets

Dataset # Modality Source Geo-diverse Annotation BLV involvement Task
VizWiz 31K Image User photos Human Captured by BLV VQA
VIALM 200 Image Web Human Expert manual Guidance
Merchant et al. 48 Image VizWiz Human Guidance
WalkVLM 12K Video Web + recorded 10 locations Human Survey Guidance
EgoBlind 1.3K Video Web Human + AI Annotated partial Video QA
GuideDog (ours) 22K Image Web 183 locations Human + AI Standards-based Guidance + QA

Three BLV guidance standards

Distilled from BLV organizations and prior research, then used as the scaffolding for every annotation in the dataset.

S1 Describe the surroundings: orient the user.

S2 Provide obstacle information: type, location, proximity (e.g. "12 o'clock, 4 steps").

S3 Provide a summary and direction: concise, low cognitive load.

Structured labels, then human verification

GuideDog is built around three BLV guidance standards (S1/S2/S3). An MLLM first drafts silver labels structured by these standards, and trained annotators then verify and refine them into gold labels. This shifts human effort from free-form writing to checking accessibility-critical details, making large-scale annotation feasible while preserving the BLV-specific structure.

Four-stage GuideDog pipeline: scene image collection, scene information extraction, silver label generation, and human verification into gold labels.
Pipeline. Web-sourced walking videos → balanced frame sampling → standards-grounded silver labels → human verification.

Where the data lives

World map and bar chart showing the GuideDog sample distribution across 46 countries and 183 cities.
Geographic coverage. Frames span 183 cities across 46 countries, sampled with per-city/region caps to keep the distribution balanced.

One annotated example

A GuideDog gold-labeled sample: pedestrian image with S1 surroundings description, S2 obstacle list with positions, and S3 summary direction.
A gold-labeled datapoint. Each annotation is structured by S1 / S2 / S3 rather than free-form prose.

Probing perception: GuideDogQA

GuideDogQA strips away the language overhead of guidance generation and tests two perceptual skills essential for BLV navigation: object recognition (which objects are truly in the scene) and relative depth (which is closer).

GuideDogQA tasks. Top: object recognition with four multiple-choice options; bottom: relative depth comparison between two bounding boxes.
GuideDogQA. 435 object questions (which object is in the scene?) and 383 depth questions (which bounding box is closer?), all multiple-choice and drawn from the gold split.

What we found

① GPT-4o leads zero-shot guidance, with the largest gap on S2, exactly where depth references matter most.

② Open-source models recognize objects, proprietary models judge depth. On QA, several open-source MLLMs cannot beat random chance at depth comparison.

③ Fine-tuning closes the gap. LoRA fine-tuning Qwen2.5-VL on our silver labels gives +19.3% on depth comparison while losing only −1.8% on object recognition, and tops every model on guidance generation.

Spatial grounding, especially depth and obstacle localization, remains the hard part for BLV-grade MLLM guidance.

BibTeX

@inproceedings{kim2026guidedog,
  title     = {GuideDog: A Real-World Egocentric Multimodal Dataset for
               Blind and Low-Vision Accessibility-Aware Guidance},
  author    = {Kim, Junhyeok and Park, Jaewoo and Park, Junhee and
               Lee, Sangeyl and Chung, Jiwan and Kim, Jisung and
               Joung, Ji Hoon and Yu, Youngjae},
  booktitle = {Proceedings of the 64th Annual Meeting of the
               Association for Computational Linguistics (ACL)},
  year      = {2026}
}