ACL 2026 Main

GuideDog:
A Real-World Egocentric Multimodal Dataset
for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim^♣* Jaewoo Park^♣* Junhee Park^♣ Sangeyl Lee^♣
Jiwan Chung^♣ Jisung Kim^♢† Ji Hoon Joung^♡† Youngjae Yu^♠

^♣Yonsei University · ^♢LG AI Research · ^♡Euler Robotics · ^♠Seoul National University

^*Equal contribution · ^†Work done while at SK Telecom

Recent MLLMs open new possibilities for Blind and Low Vision (BLV) assistance, but the difficulty of accessibility-aware annotation has kept benchmarks small. We propose GuideDog, a 22K dataset of real-world egocentric walking scenes across 183 locations in 46 countries, with labels structured by three BLV guidance standards (S1/S2/S3) and a 2,106-example human-verified gold subset. We also provide GuideDogQA, an 818-sample benchmark for fine-grained object recognition and depth perception. Today's MLLMs still struggle with the spatial half of the task, especially depth-heavy obstacle guidance.

The accessibility-aware guidance task: an MLLM reads a pedestrian scene and produces an S1 description, S2 obstacle info, and S3 summary direction. — **The task.** Given an egocentric walking frame, an MLLM produces guidance in three parts: S1 describe the surroundings, S2 identify obstacles with location and proximity, S3 summarize a safe direction.

22,084image–description pairs
2,106human-verified gold
818QA samples
46 / 183countries / locations

How GuideDog differs from prior BLV datasets

Dataset	#	Modality	Source	Geo-diverse	Annotation	BLV involvement	Task
VizWiz	31K	Image	User photos	—	Human	Captured by BLV	VQA
VIALM	200	Image	Web	—	Human	Expert manual	Guidance
Merchant et al.	48	Image	VizWiz	—	Human	—	Guidance
WalkVLM	12K	Video	Web + recorded	10 locations	Human	Survey	Guidance
EgoBlind	1.3K	Video	Web	—	Human + AI	Annotated partial	Video QA
GuideDog (ours)	22K	Image	Web	183 locations	Human + AI	Standards-based	Guidance + QA

Three BLV guidance standards

Distilled from BLV organizations and prior research, then used as the scaffolding for every annotation in the dataset.

S1 Describe the surroundings: orient the user.

S2 Provide obstacle information: type, location, proximity (e.g. "12 o'clock, 4 steps").

S3 Provide a summary and direction: concise, low cognitive load.

Structured labels, then human verification

GuideDog is built around three BLV guidance standards (S1/S2/S3). An MLLM first drafts silver labels structured by these standards, and trained annotators then verify and refine them into gold labels. This shifts human effort from free-form writing to checking accessibility-critical details, making large-scale annotation feasible while preserving the BLV-specific structure.

Four-stage GuideDog pipeline: scene image collection, scene information extraction, silver label generation, and human verification into gold labels. — **Pipeline.** Web-sourced walking videos → balanced frame sampling → standards-grounded silver labels → human verification.

Where the data lives

World map and bar chart showing the GuideDog sample distribution across 46 countries and 183 cities. — **Geographic coverage.** Frames span 183 cities across 46 countries, sampled with per-city/region caps to keep the distribution balanced.

One annotated example

A GuideDog gold-labeled sample: pedestrian image with S1 surroundings description, S2 obstacle list with positions, and S3 summary direction. — **A gold-labeled datapoint.** Each annotation is structured by S1 / S2 / S3 rather than free-form prose.

Probing perception: GuideDogQA

GuideDogQA strips away the language overhead of guidance generation and tests two perceptual skills essential for BLV navigation: object recognition (which objects are truly in the scene) and relative depth (which is closer).

GuideDogQA tasks. Top: object recognition with four multiple-choice options; bottom: relative depth comparison between two bounding boxes. — **GuideDogQA.** 435 *object* questions (which object is in the scene?) and 383 *depth* questions (which bounding box is closer?), all multiple-choice and drawn from the gold split.

What we found

① GPT-4o leads zero-shot guidance, with the largest gap on S2, exactly where depth references matter most.

② Open-source models recognize objects, proprietary models judge depth. On QA, several open-source MLLMs cannot beat random chance at depth comparison.

③ Fine-tuning closes the gap. LoRA fine-tuning Qwen2.5-VL on our silver labels gives +19.3% on depth comparison while losing only −1.8% on object recognition, and tops every model on guidance generation.

Spatial grounding, especially depth and obstacle localization, remains the hard part for BLV-grade MLLM guidance.

BibTeX

@inproceedings{kim2026guidedog,
  title     = {GuideDog: A Real-World Egocentric Multimodal Dataset for
               Blind and Low-Vision Accessibility-Aware Guidance},
  author    = {Kim, Junhyeok and Park, Jaewoo and Park, Junhee and
               Lee, Sangeyl and Chung, Jiwan and Kim, Jisung and
               Joung, Ji Hoon and Yu, Youngjae},
  booktitle = {Proceedings of the 64th Annual Meeting of the
               Association for Computational Linguistics (ACL)},
  year      = {2026}
}