Prompt Engineering & Model Tuning
The difference between a good AI response and a great one is often the prompt. Learn how to tune generation behaviour, engineer effective prompts, and implement self-critique techniques like chain-of-thought and reflection.
Tuning AI behaviour
A model is like a talented musician — it can play anything, but it needs direction. Prompt engineering is the sheet music. Model parameters are the volume and tempo knobs.
The same model can give wildly different responses depending on how you ask (prompt) and what settings you use (temperature, max tokens, etc.). Mastering these controls is what separates a demo from a production AI app.
Model parameters
| Parameter | What It Controls | Range | Default | When to Adjust |
|---|---|---|---|---|
| Temperature | Randomness/creativity | 0.0 - 2.0 | ~1.0 | Lower for factual tasks, higher for creative |
| Top P | Diversity of token selection | 0.0 - 1.0 | ~1.0 | Lower to constrain vocabulary, higher for variety |
| Max tokens | Maximum response length | 1 - model limit | Varies | Set to prevent runaway responses |
| Frequency penalty | Reduces repetition of tokens | 0.0 - 2.0 | 0 | Increase if responses are repetitive |
| Presence penalty | Encourages new topics | 0.0 - 2.0 | 0 | Increase for more diverse content |
| Stop sequences | Tokens that end generation | Custom strings | None | Use to control output format |
Exam tip: Temperature is the most tested parameter
Temperature exam questions follow a pattern:
- Temperature 0 = deterministic, same input gives same output. Best for: factual Q&A, extraction, classification
- Temperature 0.3-0.7 = balanced. Best for: most production applications
- Temperature 1.0+ = creative, varied. Best for: brainstorming, creative writing, diverse options
If the scenario needs consistency and accuracy, the answer is low temperature. If it needs creativity and variety, the answer is higher temperature.
Prompt engineering techniques
| Technique | What It Does | Example |
|---|---|---|
| System prompt | Sets the model’s role, rules, and context | ”You are a compliance analyst. Always cite regulations.” |
| Few-shot | Provides example input/output pairs | ”Q: What is DLP? A: Data Loss Prevention prevents…” |
| Chain-of-thought | Asks model to show reasoning steps | ”Think step by step before answering.” |
| Output formatting | Specifies response structure | ”Respond in JSON format with fields: answer, confidence, sources” |
| Grounding instruction | Constrains model to use provided context | ”Answer ONLY from the provided documents.” |
| Persona | Gives the model a specific expert identity | ”You are a senior Azure architect with 15 years experience.” |
Chain-of-thought and self-critique
Advanced reasoning techniques that improve output quality:
| Feature | Chain-of-Thought | Self-Critique | Reflection |
|---|---|---|---|
| What it is | Model explains its reasoning step by step | Model reviews its own response and identifies errors | Model evaluates whether it achieved the task goal |
| How to trigger | 'Think step by step' | 'Review your response. Are there any errors?' | 'Did your answer fully address the question? What did you miss?' |
| Best for | Complex reasoning, math, multi-step problems | Catching factual errors and inconsistencies | Ensuring completeness and accuracy |
| Cost | More tokens (reasoning + answer) | Double the tokens (answer + review) | Additional tokens for evaluation step |
Real-world example: Atlas Financial's self-critique loop
Atlas Financial’s compliance agent uses a two-pass approach:
Pass 1: Generate assessment
- Agent reviews loan application against regulations
- Produces initial compliance assessment with citations
Pass 2: Self-critique
- Same agent reviews its own assessment with the prompt: “Review your compliance assessment. Check: (1) Are all citations accurate? (2) Did you miss any applicable regulations? (3) Is the risk assessment justified?”
- Agent corrects errors and fills gaps
Result: 23% reduction in false compliance flags after adding the self-critique loop. The extra tokens are worth it for high-stakes financial decisions.
Key terms
Knowledge check
MediaForge's content generation tool produces the same headline every time for similar briefs. The marketing team wants more creative variety. Which parameter should they adjust?
NeuralMed's patient chatbot sometimes makes reasoning errors when answering multi-step medical questions (e.g., 'If the patient has condition A AND takes medication B, what are the risks?'). Which technique would most improve accuracy?
🎬 Video coming soon