Walk me through a model you've taken from prototype to production.

This is the gating question. Strong answers cover six things: the original product problem, why ML was the right approach (vs heuristics or rules), the dataset (size, quality, gaps), the model choice and why, the evaluation methodology you used to validate it before launch, and at least one production failure mode and how you handled it. Panels are testing whether you've operated a model in production end-to-end, not whether you've trained one in a Colab notebook. The kill-shot mistake is describing a Kaggle competition or a research project. If you can't name the production metric and the operational headaches, panels close out early.

How would you debug a model whose performance has degraded in production?

Strong answers walk through a structured triage: (1) check for data drift — has input distribution changed; (2) check for label drift — has the ground-truth distribution changed; (3) check the serving stack — model version, feature pipeline, infrastructure issues; (4) check user-facing changes that might mask the issue. The candidate should mention specific tools (W&B, Arize, custom drift dashboards) and specific decisions about when to roll back vs investigate. The wrong answer is jumping straight to retraining — panels score that as someone who hasn't dealt with production. The strongest signal is a candidate who can describe a specific drift incident they handled in production with the resolution time.

Explain how you'd design an evaluation set for an LLM-powered feature.

Panels are testing whether you understand evaluation as a continuous engineering discipline, not a one-time benchmark. Strong answers cover: representative input sampling from real production traffic (not synthetic prompts), human-rated rubrics with 3-5 categories of evaluation, judge-model fallback for scaled evaluation with explicit understanding of judge-model contamination risk, and continuous evaluation that re-runs on every model swap or prompt change. Mention specific failure modes — eval-set staleness, Goodhart's law, label noise. The mistake I see most often is candidates describing accuracy metrics — that's a research framing, not a production framing. The strongest candidates talk about eval as something you ship, version, and maintain.

How do you choose between fine-tuning and prompting for a given task?

Strong answers acknowledge the cost-quality-latency triangle. Fine-tuning makes sense when: the task is narrow and stable, you have at least 1,000-10,000 high-quality examples, the prompted model isn't passing your eval set, and the production volume justifies the engineering cost (tens of thousands of inferences per day minimum). Prompting wins when: the task evolves quickly, you have less than 1,000 examples, latency constraints permit a frontier model, or the eval set is small enough that the marginal cost of better prompting is lower than the marginal cost of building a fine-tuning pipeline. Specific examples — fine-tuning Llama 3.1 8B for a structured-extraction task that runs 50,000 times a day vs prompting Sonnet for a creative-writing feature that runs 200 times a day — score highest. Wrong answer: 'fine-tuning is always better.'

Tell me about a time you partnered with a non-technical stakeholder on an ML project.

Behavioural round, but ML-specific. Panels are testing whether you can translate ML judgment into product decisions without losing nuance. Strong answers describe a specific situation where you had to set realistic expectations on accuracy (the model won't be 100%), time-to-value (the data pipeline takes longer than the model), or product trade-offs (latency vs accuracy). The strongest candidates describe shaping the product requirement BEFORE building the model — talking the PM out of an unrealistic ask, or shaping the UI to handle low-confidence predictions gracefully. Weak candidates describe themselves as 'translating' or 'educating' which usually means they over-rode the stakeholder. Both directions matter — you should be able to push back AND adjust.

How would you reduce inference cost on a production LLM feature without hurting quality?

Six levers, in rough order of impact: (1) model routing — run smaller cheaper models on easy queries and route hard queries to frontier models; (2) prompt caching — re-use system-prompt computation; (3) response caching for repeated queries; (4) shorter outputs via prompt engineering (max_tokens, structured output); (5) fine-tune a smaller open-source model for the specific task once volume justifies it; (6) batch inference where latency permits. Strong candidates name specific cost-reduction targets achieved — 'we reduced cost-per-conversation from $0.18 to $0.04 by moving 70% of traffic to Haiku via a router.' Weak answers list techniques without numbers.

Walk me through how you'd evaluate fairness in a model used for hiring decisions.

This is increasingly an interview standard — partly because of UK Equality Act exposure for ML systems making high-stakes decisions. Strong answers cover three layers: (1) data audit — what protected characteristics are present or proxied in training data; (2) outcome audit — measure model performance disparities across protected groups using specific metrics (equalised odds, demographic parity, calibration); (3) impact audit — what happens when the model is deployed, is there a human in the loop. Mention specific tooling — Aequitas, Fairlearn, Google's What-If Tool. The wrong answer is 'we'd remove protected features from training data' — panels score that as naïve because protected characteristics are usually proxied by other features.

How do you manage technical debt in an ML codebase?

Panels are testing whether you've worked in a real ML codebase long enough to see the debt accumulate. Strong answers acknowledge specific kinds of ML debt — data-pipeline debt (one-off ETL scripts that nobody owns), feature-store debt (multiple feature definitions for the same business concept), model-versioning debt (three production models nobody can reproduce), eval-set debt (eval sets that don't match production data anymore). The strongest candidates describe incremental clean-up patterns: dual-running new and old pipelines for a sprint, tagging code for deprecation, deprecating models with sunset dates. Wrong answer: 'we'd rewrite from scratch' — panels score that as someone who hasn't shipped at scale.

How would you architect a RAG system at scale?

Strong answers cover four components: (1) ingestion — chunking strategy, embedding model selection, embedding storage (which vector DB and why); (2) retrieval — query embedding, hybrid search (semantic + keyword), reranking, retrieval evaluation; (3) generation — prompt construction, structured output, evaluation; (4) feedback — instrumentation, retraining triggers, A/B testing of changes. Strong candidates mention specific trade-offs — chunk size vs retrieval precision, embedding-model latency vs quality, vector-DB choice (pgvector vs Pinecone vs Qdrant) and why. Wrong answer: listing tools without trade-offs. The strongest signal is mentioning the Achilles heel of RAG — retrieval quality bottlenecks the entire system, and most candidates don't realise it until production.

Tell me about a model you deployed that had unexpected emergent behaviour.

Panels want a specific story, not theory. Strong answers describe what was unexpected, when you noticed (production usage, monitoring, user reports), how you diagnosed root cause, and what you changed. The detail that lands well is what you LEARNED about the model class as a result — emergent behaviour from LLMs is qualitatively different from emergent behaviour from supervised classifiers. Candidates who can describe a specific incident (model started role-playing as another character mid-conversation, model leaked training data verbatim, model degraded after a routine fine-tuning) score highest. Candidates who give theoretical answers without an incident get scored as having not yet seen production at scale.

How would you explain a complex model decision to a non-technical executive?

Strong answers structure the explanation: (1) the business outcome the model affects; (2) the variables that drive the model's decision in this case; (3) the model's confidence; (4) what would change the decision. Mention specific tools — SHAP values, counterfactual explanations, attention visualisations for transformers — but only as supporting evidence, not as the whole answer. The strongest signal is a candidate who can explain WITHOUT the visualisation, then offer the visualisation as backup. Wrong answer: dumping technical detail without a business framing. Right answer: 'this loan was declined because the applicant's debt-to-income ratio sits above the 0.42 threshold the model has learned, where 84% of approvals fall below.'

Interview Q's · Tech · UK 2026

Machine Learning Engineer Interview Questions UK

ML Engineer interviews in 2026 are the most technically demanding I run candidates through, regardless of role family. Panels at OpenAI London, Anthropic London, DeepMind, Wayve, Cohere, Stability and the AI-platform teams inside fintech expect production-grade Python, distributed-training intuition, evaluation-engineering judgment, and the specific debugging instincts you only develop after taking real models from notebook to production. The questions below come up across two-thirds of UK ML Engineer interviews I've staffed in 2025-2026. I have written each from the panel's perspective so you understand what they're actually scoring, what a strong response looks like, and where most candidates lose the room.

By Alex · 12-year UK recruiter · 12 questions + recruiter answers

Question 1

Walk me through a model you've taken from prototype to production.

This is the gating question. Strong answers cover six things: the original product problem, why ML was the right approach (vs heuristics or rules), the dataset (size, quality, gaps), the model choice and why, the evaluation methodology you used to validate it before launch, and at least one production failure mode and how you handled it. Panels are testing whether you've operated a model in production end-to-end, not whether you've trained one in a Colab notebook. The kill-shot mistake is describing a Kaggle competition or a research project. If you can't name the production metric and the operational headaches, panels close out early.
Question 2

How would you debug a model whose performance has degraded in production?

Strong answers walk through a structured triage: (1) check for data drift — has input distribution changed; (2) check for label drift — has the ground-truth distribution changed; (3) check the serving stack — model version, feature pipeline, infrastructure issues; (4) check user-facing changes that might mask the issue. The candidate should mention specific tools (W&B, Arize, custom drift dashboards) and specific decisions about when to roll back vs investigate. The wrong answer is jumping straight to retraining — panels score that as someone who hasn't dealt with production. The strongest signal is a candidate who can describe a specific drift incident they handled in production with the resolution time.
Question 3

Explain how you'd design an evaluation set for an LLM-powered feature.

Panels are testing whether you understand evaluation as a continuous engineering discipline, not a one-time benchmark. Strong answers cover: representative input sampling from real production traffic (not synthetic prompts), human-rated rubrics with 3-5 categories of evaluation, judge-model fallback for scaled evaluation with explicit understanding of judge-model contamination risk, and continuous evaluation that re-runs on every model swap or prompt change. Mention specific failure modes — eval-set staleness, Goodhart's law, label noise. The mistake I see most often is candidates describing accuracy metrics — that's a research framing, not a production framing. The strongest candidates talk about eval as something you ship, version, and maintain.
Question 4

How do you choose between fine-tuning and prompting for a given task?

Strong answers acknowledge the cost-quality-latency triangle. Fine-tuning makes sense when: the task is narrow and stable, you have at least 1,000-10,000 high-quality examples, the prompted model isn't passing your eval set, and the production volume justifies the engineering cost (tens of thousands of inferences per day minimum). Prompting wins when: the task evolves quickly, you have less than 1,000 examples, latency constraints permit a frontier model, or the eval set is small enough that the marginal cost of better prompting is lower than the marginal cost of building a fine-tuning pipeline. Specific examples — fine-tuning Llama 3.1 8B for a structured-extraction task that runs 50,000 times a day vs prompting Sonnet for a creative-writing feature that runs 200 times a day — score highest. Wrong answer: 'fine-tuning is always better.'
Question 5

Tell me about a time you partnered with a non-technical stakeholder on an ML project.

Behavioural round, but ML-specific. Panels are testing whether you can translate ML judgment into product decisions without losing nuance. Strong answers describe a specific situation where you had to set realistic expectations on accuracy (the model won't be 100%), time-to-value (the data pipeline takes longer than the model), or product trade-offs (latency vs accuracy). The strongest candidates describe shaping the product requirement BEFORE building the model — talking the PM out of an unrealistic ask, or shaping the UI to handle low-confidence predictions gracefully. Weak candidates describe themselves as 'translating' or 'educating' which usually means they over-rode the stakeholder. Both directions matter — you should be able to push back AND adjust.
Question 6

How would you reduce inference cost on a production LLM feature without hurting quality?

Six levers, in rough order of impact: (1) model routing — run smaller cheaper models on easy queries and route hard queries to frontier models; (2) prompt caching — re-use system-prompt computation; (3) response caching for repeated queries; (4) shorter outputs via prompt engineering (max_tokens, structured output); (5) fine-tune a smaller open-source model for the specific task once volume justifies it; (6) batch inference where latency permits. Strong candidates name specific cost-reduction targets achieved — 'we reduced cost-per-conversation from $0.18 to $0.04 by moving 70% of traffic to Haiku via a router.' Weak answers list techniques without numbers.
Question 7

Walk me through how you'd evaluate fairness in a model used for hiring decisions.

This is increasingly an interview standard — partly because of UK Equality Act exposure for ML systems making high-stakes decisions. Strong answers cover three layers: (1) data audit — what protected characteristics are present or proxied in training data; (2) outcome audit — measure model performance disparities across protected groups using specific metrics (equalised odds, demographic parity, calibration); (3) impact audit — what happens when the model is deployed, is there a human in the loop. Mention specific tooling — Aequitas, Fairlearn, Google's What-If Tool. The wrong answer is 'we'd remove protected features from training data' — panels score that as naïve because protected characteristics are usually proxied by other features.
Question 8

How do you manage technical debt in an ML codebase?

Panels are testing whether you've worked in a real ML codebase long enough to see the debt accumulate. Strong answers acknowledge specific kinds of ML debt — data-pipeline debt (one-off ETL scripts that nobody owns), feature-store debt (multiple feature definitions for the same business concept), model-versioning debt (three production models nobody can reproduce), eval-set debt (eval sets that don't match production data anymore). The strongest candidates describe incremental clean-up patterns: dual-running new and old pipelines for a sprint, tagging code for deprecation, deprecating models with sunset dates. Wrong answer: 'we'd rewrite from scratch' — panels score that as someone who hasn't shipped at scale.
Question 9

How would you architect a RAG system at scale?

Strong answers cover four components: (1) ingestion — chunking strategy, embedding model selection, embedding storage (which vector DB and why); (2) retrieval — query embedding, hybrid search (semantic + keyword), reranking, retrieval evaluation; (3) generation — prompt construction, structured output, evaluation; (4) feedback — instrumentation, retraining triggers, A/B testing of changes. Strong candidates mention specific trade-offs — chunk size vs retrieval precision, embedding-model latency vs quality, vector-DB choice (pgvector vs Pinecone vs Qdrant) and why. Wrong answer: listing tools without trade-offs. The strongest signal is mentioning the Achilles heel of RAG — retrieval quality bottlenecks the entire system, and most candidates don't realise it until production.
Question 10

Tell me about a model you deployed that had unexpected emergent behaviour.

Panels want a specific story, not theory. Strong answers describe what was unexpected, when you noticed (production usage, monitoring, user reports), how you diagnosed root cause, and what you changed. The detail that lands well is what you LEARNED about the model class as a result — emergent behaviour from LLMs is qualitatively different from emergent behaviour from supervised classifiers. Candidates who can describe a specific incident (model started role-playing as another character mid-conversation, model leaked training data verbatim, model degraded after a routine fine-tuning) score highest. Candidates who give theoretical answers without an incident get scored as having not yet seen production at scale.
Question 11

How would you explain a complex model decision to a non-technical executive?

Strong answers structure the explanation: (1) the business outcome the model affects; (2) the variables that drive the model's decision in this case; (3) the model's confidence; (4) what would change the decision. Mention specific tools — SHAP values, counterfactual explanations, attention visualisations for transformers — but only as supporting evidence, not as the whole answer. The strongest signal is a candidate who can explain WITHOUT the visualisation, then offer the visualisation as backup. Wrong answer: dumping technical detail without a business framing. Right answer: 'this loan was declined because the applicant's debt-to-income ratio sits above the 0.42 threshold the model has learned, where 84% of approvals fall below.'
Question 12

What questions do you have for us?

Strong questions for ML Engineer interviews are technical and operational: what does your evaluation pipeline look like, how does your team partner with research scientists, what's the most contentious technical decision you've made in the last six months, what's the team's stance on model governance and documentation. Avoid generic questions about culture or growth. Avoid asking about safety in a way that signals you haven't done the homework. The strongest signal is a question that couldn't be answered by anyone outside the team — that shows research and serious intent. At AI-native companies in particular, panels remember which candidates asked the sharpest questions.

How to use these answers

ML Engineer interviews reward technical depth and operational specificity. Build six to eight personal stories before the interview — each tagged for the competencies above (production debugging, evaluation engineering, cost optimisation, stakeholder partnership, fairness reasoning). Quantify outcomes — accuracy lifts, latency reductions, cost savings, incident resolution times. The single mistake I see kill the most ML Engineer offers is theoretical answers without production specifics. "In general we'd..." loses to "At my last company, we reduced p99 latency from 480ms to 95ms by switching from a Python serving stack to vLLM with batch inference." Specifics get hired.

Related across UK Rights & Guides

Keep reading

See the master UK careers index (20 clusters) →

Pillars + free tools

Related job-search guides + calculators

Pillars

→ UK Career Pivot pillar — sector-switch playbook
→ UK CV — AI prompt library — CV format + ATS-safe AI prompts
→ UK Interview Prep — STAR method pillar — STAR + 4-stage UK process
→ UK Cover Letter walkthrough — five UK opening patterns

Free recruiter-built tools