Llama-3.2-11B-Vision-Instruct combines visual and language processing at 11B parameters. This multimodal model excels in image captioning, visual QA, and complex image analysis through integrated visual-linguistic understanding.
anthropic