Visual Prompts: Images as Input
Modern AI can see. Send an image alongside your text prompt, and the AI analyses what's in it. Learn how to use visual input with multimodal models in Foundry.
Sending images to AI
You can show a picture to AI and ask questions about it — just like showing a photo to a friend.
”What’s in this image?” “Is there anything unusual?” “Read the text on this sign.” “How many people are in this photo?” The AI looks at the image and gives you an intelligent answer.
This works because multimodal models like GPT-4o can process both text AND images simultaneously.
Sending an image with your prompt
import base64
# Read image file
with open("xray.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = chat.complete(
model="gpt4o-deployment",
messages=[
{"role": "system", "content": "You are a medical image analysis assistant. Describe what you observe but never provide diagnoses."},
{"role": "user", "content": [
{"type": "text", "text": "What do you observe in this chest X-ray?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]}
]
)
print(response.choices[0].message.content)
What’s happening:
- The user message contains BOTH text and an image
- The image is base64-encoded and embedded in the message
- GPT-4o processes both together, understanding the question AND the visual content
Image input methods
| Method | How It Works | Best For |
|---|---|---|
| Base64 encoding | Embed the image data directly in the API call | Local files, private images |
| URL reference | Provide a public URL to the image | Publicly accessible images, web content |
# Method 2: URL reference
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}}
What you can do with visual prompts
| Task | Example Prompt | Use Case |
|---|---|---|
| Describe | ”What’s in this image?” | Accessibility, cataloguing |
| Analyse | ”What trends do you see in this chart?” | Business intelligence, reporting |
| Read text | ”Read all the text in this document” | OCR alternative, document processing |
| Compare | ”What’s different between these two images?” | Quality control, before/after analysis |
| Count | ”How many people are in this photo?” | Event monitoring, crowd analysis |
| Classify | ”Is this a defective or normal product?” | Manufacturing quality control |
GreenLeaf scenario: GreenLeaf farmers photograph their crops and ask the AI:
- “Are there signs of disease in this tomato plant?”
- “What type of pest damage do you see?”
- “Compare this week’s growth to last week’s photo”
Limitations of visual prompts
Visual prompts are powerful but have limitations:
- Not a medical diagnostic tool — the model can describe what it sees, but shouldn’t make diagnoses
- May misidentify fine details — small text, distant objects, or subtle differences may be missed
- No real-time video — processes individual images, not live video streams
- Token cost — images consume tokens, with higher-resolution images using more tokens
- Content filtering — harmful or sensitive images are blocked
Exam tip: The exam may test your understanding of when visual prompts are appropriate vs when a dedicated vision service (Azure AI Vision) is better.
🎬 Video walkthrough
🎬 Video coming soon
Visual Prompts — AI-901 Module 20
Visual Prompts — AI-901 Module 20
~12 minFlashcards
Knowledge Check
MediSpark wants doctors to upload X-ray images and get a description of what the AI observes. The system prompt should ensure the AI never provides diagnoses. Which implementation is correct?
GreenLeaf wants to process 10,000 field photos per day to detect crop disease. The analysis needs to be fast and cost-effective with a simple 'healthy/diseased' classification. What's the best approach?
Next up: Generating Images with AI — creating new visual content from text descriptions using GPT-image.