Tell me about an AI feature you've shipped to production.

Gating question. Strong answers cover: the user problem (specifically), the model choice and why, the prompt engineering or RAG architecture you built, the evaluation methodology, at least one production incident and how you handled it, and the outcome with numbers. The kill-shot mistake is describing a demo or prototype. Panels can tell within thirty seconds whether you've operated an LLM in production. Have one strong story ready, plus a backup if asked for variety. The strongest candidates can name the feature, the model, the eval set size, the production metric, and the cost — all in the first ninety seconds.

How would you architect a RAG system for a customer-support assistant?

Strong answers cover four layers: (1) ingestion — chunking strategy, embedding model, vector DB choice; (2) retrieval — query embedding, hybrid search, reranking; (3) generation — prompt construction, structured output, judge-model verification; (4) evaluation and feedback — eval set, online instrumentation, retraining triggers. The strongest candidates name specific trade-offs — chunk size 256 vs 1024 tokens, semantic vs hybrid retrieval, when reranking is worth the latency. Mention specific tools (Pinecone vs Qdrant vs pgvector, Cohere reranker, Anthropic structured output) and explain WHY for a given trade-off. Wrong answer: listing tools without explaining trade-offs. The strongest signal: mentioning that retrieval quality is the bottleneck of any RAG system, and most candidates underestimate it until production.

How do you handle hallucination in production?

Four levers, in order of leverage: (1) structured output — constrain what the model can say; (2) RAG with high-quality retrieval — give the model the right context; (3) post-generation verification — rule-based or judge-model second pass on edge cases; (4) UI design — flag low confidence to users and let them verify. Strong candidates mention measuring hallucination rate explicitly via a labelled eval set rather than waiting for user reports. They also acknowledge the trade-off: high-stakes features need all four; creative-writing features need almost none. Wrong answer: 'we'd use a better model' — that's a non-answer. Hallucination is a product surface, not a model problem.

Walk me through your prompt engineering process.

Panels are testing whether you treat prompts as production code, not afterthoughts. Strong answers cover: prompt versioning (each prompt has an ID, a version, an author, an eval-set score), prompt A/B testing (compare two prompts on the same eval set), prompt templating (separate the system prompt, the few-shot examples, the user input), and the discipline of NOT touching a prompt without re-running the eval set. Mention specific tooling — LangSmith, Helicone, custom prompt-management. Strong candidates acknowledge the trap: prompts feel like writing, but the moment you treat them like writing, you stop being able to ship safely. Wrong answer: 'I iterate until it works.' Right answer: 'I version every prompt, every change goes through the eval set, and I have a rollback plan for every prompt deployed.'

Tell me about a time you reduced AI inference cost in production.

Strong answers describe the specific lever and the outcome. The cost levers in order of impact: model routing (small model on easy queries, frontier on hard), prompt caching, response caching, shorter outputs via prompt engineering, fine-tuning a smaller model once volume justifies it, batch inference. Strong candidates name numbers — 'reduced cost-per-conversation from $0.18 to $0.04 by routing 70% of traffic to Haiku.' Weak candidates list techniques without numbers. The strongest candidates also describe the eval-set discipline that let them confidently route — without an eval set, model routing is gambling, not engineering.

How do you decide between calling an API and self-hosting an open-source model?

Cost-quality-latency-control trade-off. API wins when: latency-sensitive features, frontier capability needed, low-volume usage, or the team doesn't have ML platform headcount. Self-hosting wins when: high-volume usage (hundreds of thousands of inferences per day), data residency or compliance requirements (private hosted), specific fine-tuned model needs, or you're spending more than ~£15k/month on API inference. Strong candidates mention specific volumes where the math flips. They also acknowledge the hidden cost of self-hosting: GPU procurement, engineer time, observability stack, model updates. Wrong answer: 'self-hosting is always cheaper' — only true at very high volume.

Walk me through how you'd evaluate a complex AI feature.

Strong answers separate three layers: (1) offline eval — a curated set of representative inputs scored against rubrics, run on every prompt or model change; (2) online eval — instrumentation on production traffic, tracking task-completion rate, user-flagged issues, safety triggers; (3) continuous eval — eval set re-runs on a schedule, with regression alerts. Mention specific failure modes — eval set staleness, judge-model contamination, Goodhart's law. The strongest candidates also mention HOW they keep the eval set fresh — sampling production failures, weekly review of low-confidence outputs, regular human re-rating. Wrong answer: 'we'd run accuracy benchmarks.' That's research framing. Production AI evaluation is a continuous engineering discipline.

Tell me about a time you killed an AI feature that was technically working.

Highest-signal question in 2026 because most AI features are technically interesting and product-marginal. Strong answers describe a feature that passed eval, shipped to a small cohort, and either failed to drive the metric you cared about OR created a worse user experience for unexpected reasons. The detail that lands well: what you LEARNED — that the underlying user problem wasn't solved by AI, that latency made it feel worse than the manual alternative, that users didn't trust the AI output enough to act on it. Candidates who can't give an example get scored as having low product judgment — everyone in this field has shipped AI features that didn't work.

How do you think about AI safety in production features?

Three layers of practical safety: (1) input filtering — content moderation, rate limiting, abuse detection; (2) output filtering — content policy, harmful-output detection, jailbreak defence; (3) downstream impact — what does this feature change in user behaviour, are there harms we're not modelling. Mention specific frameworks: red-teaming, structured policy taxonomies, incident response runbooks. The wrong answer is 'safety is engineering's problem' — that's a tell. Equally wrong is performative concern-trolling — 'we shouldn't ship until we've solved alignment' — that signals you can't ship. The middle ground: take it seriously while still shipping product.

What's your view on agents in 2026?

Loaded question. Panels are testing whether you can navigate hype vs reality. Strong answers acknowledge the maturity gap: simple tool-use (single-step retrieval, function calls) works reliably; multi-step autonomous agents that operate over long tasks are still unreliable for most production use cases. Strong candidates name specific successful agent patterns (well-scoped tasks with verification gates, agents with strong human-in-the-loop oversight) and specific failure patterns (open-ended autonomous agents that drift, agents without verification that compound errors). Avoid blanket optimism or blanket dismissal. The strongest signal: a candidate who can name a specific agent pattern they've shipped vs avoided, and explain why.

How would you handle scaling an AI feature from internal alpha to public launch?

Strong answers cover the full progression: (1) internal alpha — small group, full instrumentation, daily review; (2) limited beta — 5-10% of traffic, eval set re-running on production samples, on-call rotation for incidents; (3) public launch — full traffic, automated rollback if metrics regress, formal incident response. Mention specific go/no-go criteria — eval set scores must hold across launch, p99 latency must stay below threshold, hallucination rate must stay below ceiling. Strong candidates also describe the failure modes that derail launches: traffic patterns that don't match alpha, prompt injection at scale, edge cases that only appear at volume. Wrong answer: 'we'd ship if it works in alpha' — panels score that as someone who hasn't seen a public launch fail.

Interview Q's · Tech · UK 2026

AI Engineer Interview Questions UK

AI Engineer interviews in 2026 are shaped by how new the role is — most panels haven't standardised what they're testing for, and the bar varies wildly between companies. The candidates I see succeed share one trait: they sound like they've actually shipped LLM features into production, not built them in a notebook. Panels at OpenAI London, Anthropic London, Cohere, Synthesia, ElevenLabs, Wayve, Builder.ai and the AI-platform teams inside UK fintech are searching for shipped work, evaluation literacy, and the engineering judgment to ship features fast under capability uncertainty. The questions below come up across two-thirds of UK AI Engineer interviews. I have written each from the panel's perspective so you understand what they're scoring and where most candidates lose the room.

By Alex · 12-year UK recruiter · 12 questions + recruiter answers

Question 1

Tell me about an AI feature you've shipped to production.

Gating question. Strong answers cover: the user problem (specifically), the model choice and why, the prompt engineering or RAG architecture you built, the evaluation methodology, at least one production incident and how you handled it, and the outcome with numbers. The kill-shot mistake is describing a demo or prototype. Panels can tell within thirty seconds whether you've operated an LLM in production. Have one strong story ready, plus a backup if asked for variety. The strongest candidates can name the feature, the model, the eval set size, the production metric, and the cost — all in the first ninety seconds.
Question 2

How would you architect a RAG system for a customer-support assistant?

Strong answers cover four layers: (1) ingestion — chunking strategy, embedding model, vector DB choice; (2) retrieval — query embedding, hybrid search, reranking; (3) generation — prompt construction, structured output, judge-model verification; (4) evaluation and feedback — eval set, online instrumentation, retraining triggers. The strongest candidates name specific trade-offs — chunk size 256 vs 1024 tokens, semantic vs hybrid retrieval, when reranking is worth the latency. Mention specific tools (Pinecone vs Qdrant vs pgvector, Cohere reranker, Anthropic structured output) and explain WHY for a given trade-off. Wrong answer: listing tools without explaining trade-offs. The strongest signal: mentioning that retrieval quality is the bottleneck of any RAG system, and most candidates underestimate it until production.
Question 3

How do you handle hallucination in production?

Four levers, in order of leverage: (1) structured output — constrain what the model can say; (2) RAG with high-quality retrieval — give the model the right context; (3) post-generation verification — rule-based or judge-model second pass on edge cases; (4) UI design — flag low confidence to users and let them verify. Strong candidates mention measuring hallucination rate explicitly via a labelled eval set rather than waiting for user reports. They also acknowledge the trade-off: high-stakes features need all four; creative-writing features need almost none. Wrong answer: 'we'd use a better model' — that's a non-answer. Hallucination is a product surface, not a model problem.
Question 4

Walk me through your prompt engineering process.

Panels are testing whether you treat prompts as production code, not afterthoughts. Strong answers cover: prompt versioning (each prompt has an ID, a version, an author, an eval-set score), prompt A/B testing (compare two prompts on the same eval set), prompt templating (separate the system prompt, the few-shot examples, the user input), and the discipline of NOT touching a prompt without re-running the eval set. Mention specific tooling — LangSmith, Helicone, custom prompt-management. Strong candidates acknowledge the trap: prompts feel like writing, but the moment you treat them like writing, you stop being able to ship safely. Wrong answer: 'I iterate until it works.' Right answer: 'I version every prompt, every change goes through the eval set, and I have a rollback plan for every prompt deployed.'
Question 5

Tell me about a time you reduced AI inference cost in production.

Strong answers describe the specific lever and the outcome. The cost levers in order of impact: model routing (small model on easy queries, frontier on hard), prompt caching, response caching, shorter outputs via prompt engineering, fine-tuning a smaller model once volume justifies it, batch inference. Strong candidates name numbers — 'reduced cost-per-conversation from $0.18 to $0.04 by routing 70% of traffic to Haiku.' Weak candidates list techniques without numbers. The strongest candidates also describe the eval-set discipline that let them confidently route — without an eval set, model routing is gambling, not engineering.
Question 6

How do you decide between calling an API and self-hosting an open-source model?

Cost-quality-latency-control trade-off. API wins when: latency-sensitive features, frontier capability needed, low-volume usage, or the team doesn't have ML platform headcount. Self-hosting wins when: high-volume usage (hundreds of thousands of inferences per day), data residency or compliance requirements (private hosted), specific fine-tuned model needs, or you're spending more than ~£15k/month on API inference. Strong candidates mention specific volumes where the math flips. They also acknowledge the hidden cost of self-hosting: GPU procurement, engineer time, observability stack, model updates. Wrong answer: 'self-hosting is always cheaper' — only true at very high volume.
Question 7

Walk me through how you'd evaluate a complex AI feature.

Strong answers separate three layers: (1) offline eval — a curated set of representative inputs scored against rubrics, run on every prompt or model change; (2) online eval — instrumentation on production traffic, tracking task-completion rate, user-flagged issues, safety triggers; (3) continuous eval — eval set re-runs on a schedule, with regression alerts. Mention specific failure modes — eval set staleness, judge-model contamination, Goodhart's law. The strongest candidates also mention HOW they keep the eval set fresh — sampling production failures, weekly review of low-confidence outputs, regular human re-rating. Wrong answer: 'we'd run accuracy benchmarks.' That's research framing. Production AI evaluation is a continuous engineering discipline.
Question 8

Tell me about a time you killed an AI feature that was technically working.

Highest-signal question in 2026 because most AI features are technically interesting and product-marginal. Strong answers describe a feature that passed eval, shipped to a small cohort, and either failed to drive the metric you cared about OR created a worse user experience for unexpected reasons. The detail that lands well: what you LEARNED — that the underlying user problem wasn't solved by AI, that latency made it feel worse than the manual alternative, that users didn't trust the AI output enough to act on it. Candidates who can't give an example get scored as having low product judgment — everyone in this field has shipped AI features that didn't work.
Question 9

How do you think about AI safety in production features?

Three layers of practical safety: (1) input filtering — content moderation, rate limiting, abuse detection; (2) output filtering — content policy, harmful-output detection, jailbreak defence; (3) downstream impact — what does this feature change in user behaviour, are there harms we're not modelling. Mention specific frameworks: red-teaming, structured policy taxonomies, incident response runbooks. The wrong answer is 'safety is engineering's problem' — that's a tell. Equally wrong is performative concern-trolling — 'we shouldn't ship until we've solved alignment' — that signals you can't ship. The middle ground: take it seriously while still shipping product.
Question 10

What's your view on agents in 2026?

Loaded question. Panels are testing whether you can navigate hype vs reality. Strong answers acknowledge the maturity gap: simple tool-use (single-step retrieval, function calls) works reliably; multi-step autonomous agents that operate over long tasks are still unreliable for most production use cases. Strong candidates name specific successful agent patterns (well-scoped tasks with verification gates, agents with strong human-in-the-loop oversight) and specific failure patterns (open-ended autonomous agents that drift, agents without verification that compound errors). Avoid blanket optimism or blanket dismissal. The strongest signal: a candidate who can name a specific agent pattern they've shipped vs avoided, and explain why.
Question 11

How would you handle scaling an AI feature from internal alpha to public launch?

Strong answers cover the full progression: (1) internal alpha — small group, full instrumentation, daily review; (2) limited beta — 5-10% of traffic, eval set re-running on production samples, on-call rotation for incidents; (3) public launch — full traffic, automated rollback if metrics regress, formal incident response. Mention specific go/no-go criteria — eval set scores must hold across launch, p99 latency must stay below threshold, hallucination rate must stay below ceiling. Strong candidates also describe the failure modes that derail launches: traffic patterns that don't match alpha, prompt injection at scale, edge cases that only appear at volume. Wrong answer: 'we'd ship if it works in alpha' — panels score that as someone who hasn't seen a public launch fail.
Question 12

What questions do you have for us?

Strong AI Engineer questions are technical and operational: what does your evaluation pipeline look like, how does the team handle prompt versioning, what's the most contentious technical decision you've made in the last quarter, how do you think about model dependency risk when a provider deprecates a model. Avoid generic questions about culture or growth. Avoid asking about safety in a way that signals you haven't done the homework. The strongest signal is a question that couldn't be answered by anyone outside the team — that shows research and serious intent.

How to use these answers

AI Engineer interviews reward concrete shipped work: feature names, eval methodology, model swaps, things that broke, things you killed. Build six to eight personal AI-engineering stories before the interview, each tagged for the competencies above (RAG architecture, prompt engineering discipline, hallucination handling, cost optimisation, agent judgment, safety thinking). Quantify outcomes — accuracy lifts, cost reductions, latency improvements. The candidates I place at OpenAI London, Anthropic London, the top fintech AI teams and the AI-native scale-ups all share one trait: they sound like operators, not enthusiasts. Sound like you've already been doing the job for two years and panels will believe you can keep doing it.

Related across UK Rights & Guides

Keep reading

Explore the full UK guides index (20 clusters) →

Pillars + free tools

Related job-search guides + calculators

Pillars

→ UK Career Change — into-tech pillar — sector-switch playbook
→ UK CV pillar — recruiter playbook — CV format + ATS-safe AI prompts
→ AI Interview Prep playbook — STAR + 4-stage UK process
→ AI-assisted UK Cover Letter pillar — five UK opening patterns

Free recruiter-built tools