Visual Prompts: Images as Input

Sending images to AI

Simple explanation

You can show a picture to AI and ask questions about it — just like showing a photo to a friend.

”What’s in this image?” “Is there anything unusual?” “Read the text on this sign.” “How many people are in this photo?” The AI looks at the image and gives you an intelligent answer.

This works because multimodal models like GPT-4o can process both text AND images simultaneously.

Sending an image with your prompt

import base64

# Read image file
with open("xray.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = chat.complete(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a medical image analysis assistant. Describe what you observe but never provide diagnoses."},
        {"role": "user", "content": [
            {"type": "text", "text": "What do you observe in this chest X-ray?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
        ]}
    ]
)

print(response.choices[0].message.content)

What’s happening:

The user message contains BOTH text and an image
The image is base64-encoded and embedded in the message
GPT-4o processes both together, understanding the question AND the visual content

Image input methods

Method	How It Works	Best For
Base64 encoding	Embed the image data directly in the API call	Local files, private images
URL reference	Provide a public URL to the image	Publicly accessible images, web content

# Method 2: URL reference
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}}

What you can do with visual prompts

Task	Example Prompt	Use Case
Describe	”What’s in this image?”	Accessibility, cataloguing
Analyse	”What trends do you see in this chart?”	Business intelligence, reporting
Read text	”Read all the text in this document”	OCR alternative, document processing
Compare	”What’s different between these two images?”	Quality control, before/after analysis
Count	”How many people are in this photo?”	Event monitoring, crowd analysis
Classify	”Is this a defective or normal product?”	Manufacturing quality control

GreenLeaf scenario: GreenLeaf farmers photograph their crops and ask the AI:

“Are there signs of disease in this tomato plant?”
“What type of pest damage do you see?”
“Compare this week’s growth to last week’s photo”

Limitations of visual prompts

Visual prompts are powerful but have limitations:

Not a medical diagnostic tool — the model can describe what it sees, but shouldn’t make diagnoses
May misidentify fine details — small text, distant objects, or subtle differences may be missed
No real-time video — processes individual images, not live video streams
Token cost — images consume tokens, with higher-resolution images using more tokens
Content filtering — harmful or sensitive images are blocked

Exam tip: The exam may test your understanding of when visual prompts are appropriate vs when a dedicated vision service (Azure AI Vision) is better.

🎬 Video walkthrough

Flashcards

Question

How do you send an image to GPT-4o for analysis?

Click or press Enter to reveal answer

Answer

Include it in the user message as a content array item with type 'image_url'. The image can be base64-encoded (for local files) or referenced by URL (for public images). The model processes both text and image together.

Click to flip back

Question

What are the two methods for providing images to a multimodal model?

Click or press Enter to reveal answer

Answer

Base64 encoding (embed image data directly in the API call, best for local/private images) and URL reference (provide a public URL, best for web-accessible images).

Click to flip back

Question

What types of tasks can visual prompts handle?

Click or press Enter to reveal answer

Answer

Image description, chart/diagram analysis, text reading (OCR), image comparison, object counting, classification, and visual question answering.

Click to flip back

Knowledge Check

MediSpark wants doctors to upload X-ray images and get a description of what the AI observes. The system prompt should ensure the AI never provides diagnoses. Which implementation is correct?

Knowledge Check

GreenLeaf wants to process 10,000 field photos per day to detect crop disease. The analysis needs to be fast and cost-effective with a simple 'healthy/diseased' classification. What's the best approach?

Next up: Generating Images with AI — creating new visual content from text descriptions using GPT-image.