๐Ÿ’ก VGGT์˜ ์•„ํ‚คํ…์ณ ๋ฆฌ๋ทฐ

VGGT๋Š” Backbone(์ธ์ฝ”๋”) + ์—ฌ๋Ÿฌ Head(๋””์ฝ”๋”)ย ๊ตฌ์กฐ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ.

image.png

๊ตฌ์„ฑ ์—ญํ• 
Aggregatorย (๋ฐฑ๋ณธ, ์ธ์ฝ”๋”) ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๊ณ ์ฐจ์› feature ์ถ”์ถœ
Headย (๋””์ฝ”๋”๋“ค) ์ด feature๋ฅผ ๋ฐ›์•„ camera pose, depth ๋“ฑ task๋ณ„ ์ถœ๋ ฅ ์ƒ์„ฑ

VGGT๋Š”ย ๊ณตํ†ต ์ธ์ฝ”๋”(Transformer)ย ๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๋””์ฝ”๋”(Head)๋กœ ๋ถ„๊ธฐํ•˜๋Š”ย multi-head encoder-decoder ๊ตฌ์กฐ

<aside> ๐Ÿ“Œ

์ž…๋ ฅ ์ด๋ฏธ์ง€ ์‹œํ€€์Šค (N์žฅ)

โ†“

Patch Embed

โ†“

[์นด๋ฉ”๋ผ ํ† ํฐ + ๋ ˆ์ง€์Šคํ„ฐ ํ† ํฐ + ํŒจ์น˜ ํ† ํฐ]

โ†“

Aggregator (Alternating Attention) โ† ๋ฐฑ๋ณธ (Backbone)

โ†“

[Frame Feature (2 ร— embed_dim)]

โ†“

Head (Task๋ณ„ ์˜ˆ์ธก ๋ชจ๋“ˆ)

โ””โ”€ Camera Head โ†’ Camera Pose ์˜ˆ์ธก

โ””โ”€ Point Head โ†’ Feature Point ์˜ˆ์ธก

โ””โ”€ Depth Head โ†’ Depth Map ์˜ˆ์ธก

โ””โ”€ DPT Head โ†’ Multi-scale Representation ๋“ฑ๋“ฑ

</aside>

์ฝ”๋“œ ์†Œ์Šค: https://github.com/facebookresearch/vggt/tree/main


VGGT class