Walk me through a data pipeline you've built end-to-end.

Gating question. Strong answers cover six things: the business problem, the data sources and quality issues you encountered, the schema design (dimensional? denormalised? medallion?), the orchestration choice and why, the monitoring you put in place, and at least one production issue you handled. The kill-shot mistake is describing a one-off ETL script. Panels can tell within sixty seconds whether you've operated a pipeline in production beyond the happy path. Strong candidates name the daily row volume, the SLA they hit, and the cost. Weak candidates describe a notebook script they wrote once.

How would you design a data warehouse for a fast-growing startup?

Strong answers acknowledge it depends on stage. Pre-product-market-fit: keep it simple, single warehouse, raw + staging + marts pattern in dbt, postpone any architectural complexity. Series A-B: introduce a feature store if ML-heavy, separate the analytics warehouse from the operational store, get the orchestration robust. Series C+: think about real-time pipelines, data contracts with upstream services, multi-region. Strong candidates name specific tools they'd reach for at each stage (Fivetran for ingestion early; Airbyte or custom Python at scale; Snowflake or BigQuery as warehouse; dbt for modelling; Airflow or Dagster for orchestration). Wrong answer: starting with the most complex architecture before justifying it.

Explain the difference between OLTP and OLAP, and when you'd use each.

Panels ask this to filter junior candidates fast. OLTP — transactional databases (PostgreSQL, MySQL) — optimised for many small reads/writes, normalised schemas, strong consistency. OLAP — analytical databases (Snowflake, BigQuery, Databricks) — optimised for large scans across denormalised tables, columnar storage, eventual consistency acceptable. Strong answers ALSO mention the modern blur — Postgres extensions for analytical workloads (pgvector, columnar Citus), Trino for federated queries across both, real-time CDC pipelines that mirror OLTP into OLAP for analytics. Wrong answer: defining the terms without naming when each fits.

Walk me through how you'd debug a slow Airflow DAG.

Strong answers structure the triage: (1) check task-level timing in the UI to find which task is slow; (2) check the executor — is it the worker, the queue, or the task itself; (3) check the upstream data — has volume spiked, has the source query become unindexable; (4) check the SQL — is there a join that's degenerated, a missing partition pruning. Strong candidates mention specific tools — Airflow's Gantt view, query plans in the warehouse, dbt's --threads flag, Snowflake's query profile. Wrong answer: 'I'd add more workers' — that's a non-answer if the worker isn't the bottleneck.

How do you handle schema changes in upstream systems that break downstream tables?

This is increasingly the most-asked operational question. Strong answers cover four layers: (1) data contracts — explicit schema agreements with upstream services, ideally enforced in CI; (2) backwards-compatible additions only — never rename, never drop, deprecate explicitly; (3) staging-environment validation — every schema change runs through staging before production; (4) downstream observability — dbt source freshness checks, alerting on row-count anomalies, broken-test budgets. Strong candidates mention specific tools (dbt source freshness, Great Expectations, Soda, Monte Carlo) and acknowledge the cultural challenge — getting engineering teams to respect data contracts is harder than the tooling. Wrong answer: 'we'd communicate better' — too soft, too generic.

Explain how you'd set up data quality monitoring.

Strong answers cover three layers of checks: (1) volumetric — row counts, ingestion freshness, file arrival timing; (2) structural — schema conformance, NULL rates, distribution drift; (3) semantic — business-rule validation (revenue can't be negative, customer IDs must reference an existing customer, dates must be within plausible ranges). Mention specific tools (dbt tests, Great Expectations, Soda, Monte Carlo, custom Python checks), and acknowledge the trade-off — too many checks creates alert fatigue and noise; too few means you find issues from stakeholder reports, which destroys trust. Strong candidates also mention error budgets and on-call rotations for data quality. Wrong answer: 'we'd run dbt tests' — that's one layer, not the system.

How do you decide between batch and streaming pipelines?

Cost-complexity-latency trade-off. Batch wins when: latency requirements are >15 minutes, transformations are complex, source data arrives in batches anyway, the team doesn't have dedicated streaming expertise. Streaming wins when: latency requirements are <1 minute, source data is genuinely event-driven, the value of fresh data justifies the operational complexity. Strong candidates name specific examples — fraud detection (streaming), monthly financial reporting (batch), real-time recommendations (streaming), customer-segmentation refresh (batch). They also mention the hybrid pattern — Lambda or Kappa architectures, micro-batches as a middle ground. Wrong answer: 'streaming is always better.'

Tell me about a time you handled a data incident.

Behavioural with technical depth. Strong answers describe a specific incident — schema change broke a downstream dashboard, data drift caused a model to misbehave, an upstream service started sending duplicate events. Cover: how you noticed, the triage path, the fix, the post-mortem and the systemic change you made. Strong candidates mention runbooks, incident timelines, and what they learned about the data system as a whole. Weak candidates describe themselves as having 'fixed it' without naming the systemic improvement. The strongest signal: a candidate who can describe the SECOND or third related incident they prevented because of the post-mortem from the first.

How do you partner with analysts and data scientists?

Strong answers acknowledge the structural tension: Data Engineers are often deeper in infrastructure, while analysts and scientists are deeper in business context. Good Data Engineers translate between the two — they make data products that analysts can self-serve from, they write documentation analysts can use, they invest in semantic-layer tooling that hides infrastructure complexity. Strong candidates describe specific patterns — dbt model documentation as the contract, regular office hours, embedded analytics-engineer relationships within squads. Wrong answer: 'I tell them what they can have and they ask for it' — panels score that as low collaboration aptitude.

How do you optimise a slow Snowflake / BigQuery query?

Strong answers cover: (1) query plan inspection — Snowflake's QUERY_HISTORY and query profile, BigQuery's execution graph; (2) partition pruning — making sure the WHERE clause hits the partition key; (3) clustering keys — relevant for high-cardinality filters; (4) sort order in window functions; (5) materialisation strategy — view vs table vs incremental dbt model; (6) warehouse / slot sizing — sometimes the right answer is a bigger warehouse for the right query, but only after the SQL has been optimised. Strong candidates name specific cost reductions they achieved. Wrong answer: 'I'd add an index' — Snowflake doesn't do indexes the way Postgres does.

How do you think about data governance in 2026?

Panels ask this increasingly because UK GDPR enforcement and the AI Act have raised the stakes. Strong answers cover: (1) data discovery — catalogues that let people find what they need (Atlan, Collibra, OpenMetadata); (2) access control — role-based access at warehouse level, ideally tied to identity provider; (3) PII handling — masking or tokenisation of sensitive columns, with policies enforced in the data layer not application layer; (4) lineage — knowing where every column comes from and where it flows to. Strong candidates mention specific UK regulatory context (DPIA requirements, retention rules, right-to-erasure implementations). Wrong answer: 'we lock down access' — that's defensive, not governed.

Interview Q's · Tech · UK 2026

Data Engineer Interview Questions UK

Q: Tell me about a time you handled a data incident.

Behavioural with technical depth. Strong answers describe a specific incident — schema change broke a downstream dashboard, data drift caused a model to misbehave, an upstream service started sending duplicate events. Cover: how you noticed, the triage path, the fix, the post-mortem and the systemic change you made. Strong candidates mention runbooks, incident timelines, and what they learned about the data system as a whole. Weak candidates describe themselves as having 'fixed it' without naming the systemic improvement. The strongest signal: a candidate who can describe the SECOND or third related incident they prevented because of the post-mortem from the first.

Q: How do you partner with analysts and data scientists?

Strong answers acknowledge the structural tension: Data Engineers are often deeper in infrastructure, while analysts and scientists are deeper in business context. Good Data Engineers translate between the two — they make data products that analysts can self-serve from, they write documentation analysts can use, they invest in semantic-layer tooling that hides infrastructure complexity. Strong candidates describe specific patterns — dbt model documentation as the contract, regular office hours, embedded analytics-engineer relationships within squads. Wrong answer: 'I tell them what they can have and they ask for it' — panels score that as low collaboration aptitude.

Q: How do you optimise a slow Snowflake / BigQuery query?

Strong answers cover: (1) query plan inspection — Snowflake's QUERY_HISTORY and query profile, BigQuery's execution graph; (2) partition pruning — making sure the WHERE clause hits the partition key; (3) clustering keys — relevant for high-cardinality filters; (4) sort order in window functions; (5) materialisation strategy — view vs table vs incremental dbt model; (6) warehouse / slot sizing — sometimes the right answer is a bigger warehouse for the right query, but only after the SQL has been optimised. Strong candidates name specific cost reductions they achieved. Wrong answer: 'I'd add an index' — Snowflake doesn't do indexes the way Postgres does.

Data Engineer interviews in 2026 sit between software engineering and analytics in a way that catches candidates off-guard. Panels at Monzo, Wise, Revolut, Stripe London, Snowflake UK, Databricks UK and the data teams at AI-native companies expect production-grade Python, deep SQL, hands-on dbt or Airflow, and an ability to talk credibly about the trade-offs across the data platform — not just one tool. The questions below come up across two-thirds of UK Data Engineer interviews I've staffed in 2025-2026. Each is written from the panel's perspective so you understand what they're scoring and where most candidates lose the room.

By Alex · 12-year UK recruiter · 12 questions + recruiter answers

Question 1

Walk me through a data pipeline you've built end-to-end.

Gating question. Strong answers cover six things: the business problem, the data sources and quality issues you encountered, the schema design (dimensional? denormalised? medallion?), the orchestration choice and why, the monitoring you put in place, and at least one production issue you handled. The kill-shot mistake is describing a one-off ETL script. Panels can tell within sixty seconds whether you've operated a pipeline in production beyond the happy path. Strong candidates name the daily row volume, the SLA they hit, and the cost. Weak candidates describe a notebook script they wrote once.
Question 2

How would you design a data warehouse for a fast-growing startup?

Strong answers acknowledge it depends on stage. Pre-product-market-fit: keep it simple, single warehouse, raw + staging + marts pattern in dbt, postpone any architectural complexity. Series A-B: introduce a feature store if ML-heavy, separate the analytics warehouse from the operational store, get the orchestration robust. Series C+: think about real-time pipelines, data contracts with upstream services, multi-region. Strong candidates name specific tools they'd reach for at each stage (Fivetran for ingestion early; Airbyte or custom Python at scale; Snowflake or BigQuery as warehouse; dbt for modelling; Airflow or Dagster for orchestration). Wrong answer: starting with the most complex architecture before justifying it.
Question 3

Explain the difference between OLTP and OLAP, and when you'd use each.

Panels ask this to filter junior candidates fast. OLTP — transactional databases (PostgreSQL, MySQL) — optimised for many small reads/writes, normalised schemas, strong consistency. OLAP — analytical databases (Snowflake, BigQuery, Databricks) — optimised for large scans across denormalised tables, columnar storage, eventual consistency acceptable. Strong answers ALSO mention the modern blur — Postgres extensions for analytical workloads (pgvector, columnar Citus), Trino for federated queries across both, real-time CDC pipelines that mirror OLTP into OLAP for analytics. Wrong answer: defining the terms without naming when each fits.
Question 4

Walk me through how you'd debug a slow Airflow DAG.

Strong answers structure the triage: (1) check task-level timing in the UI to find which task is slow; (2) check the executor — is it the worker, the queue, or the task itself; (3) check the upstream data — has volume spiked, has the source query become unindexable; (4) check the SQL — is there a join that's degenerated, a missing partition pruning. Strong candidates mention specific tools — Airflow's Gantt view, query plans in the warehouse, dbt's --threads flag, Snowflake's query profile. Wrong answer: 'I'd add more workers' — that's a non-answer if the worker isn't the bottleneck.
Question 5

How do you handle schema changes in upstream systems that break downstream tables?

This is increasingly the most-asked operational question. Strong answers cover four layers: (1) data contracts — explicit schema agreements with upstream services, ideally enforced in CI; (2) backwards-compatible additions only — never rename, never drop, deprecate explicitly; (3) staging-environment validation — every schema change runs through staging before production; (4) downstream observability — dbt source freshness checks, alerting on row-count anomalies, broken-test budgets. Strong candidates mention specific tools (dbt source freshness, Great Expectations, Soda, Monte Carlo) and acknowledge the cultural challenge — getting engineering teams to respect data contracts is harder than the tooling. Wrong answer: 'we'd communicate better' — too soft, too generic.
Question 6

Explain how you'd set up data quality monitoring.

Strong answers cover three layers of checks: (1) volumetric — row counts, ingestion freshness, file arrival timing; (2) structural — schema conformance, NULL rates, distribution drift; (3) semantic — business-rule validation (revenue can't be negative, customer IDs must reference an existing customer, dates must be within plausible ranges). Mention specific tools (dbt tests, Great Expectations, Soda, Monte Carlo, custom Python checks), and acknowledge the trade-off — too many checks creates alert fatigue and noise; too few means you find issues from stakeholder reports, which destroys trust. Strong candidates also mention error budgets and on-call rotations for data quality. Wrong answer: 'we'd run dbt tests' — that's one layer, not the system.
Question 7

How do you decide between batch and streaming pipelines?

Cost-complexity-latency trade-off. Batch wins when: latency requirements are >15 minutes, transformations are complex, source data arrives in batches anyway, the team doesn't have dedicated streaming expertise. Streaming wins when: latency requirements are <1 minute, source data is genuinely event-driven, the value of fresh data justifies the operational complexity. Strong candidates name specific examples — fraud detection (streaming), monthly financial reporting (batch), real-time recommendations (streaming), customer-segmentation refresh (batch). They also mention the hybrid pattern — Lambda or Kappa architectures, micro-batches as a middle ground. Wrong answer: 'streaming is always better.'
Question 8

Tell me about a time you handled a data incident.

Behavioural with technical depth. Strong answers describe a specific incident — schema change broke a downstream dashboard, data drift caused a model to misbehave, an upstream service started sending duplicate events. Cover: how you noticed, the triage path, the fix, the post-mortem and the systemic change you made. Strong candidates mention runbooks, incident timelines, and what they learned about the data system as a whole. Weak candidates describe themselves as having 'fixed it' without naming the systemic improvement. The strongest signal: a candidate who can describe the SECOND or third related incident they prevented because of the post-mortem from the first.
Question 9

How do you partner with analysts and data scientists?

Strong answers acknowledge the structural tension: Data Engineers are often deeper in infrastructure, while analysts and scientists are deeper in business context. Good Data Engineers translate between the two — they make data products that analysts can self-serve from, they write documentation analysts can use, they invest in semantic-layer tooling that hides infrastructure complexity. Strong candidates describe specific patterns — dbt model documentation as the contract, regular office hours, embedded analytics-engineer relationships within squads. Wrong answer: 'I tell them what they can have and they ask for it' — panels score that as low collaboration aptitude.
Question 10

How do you optimise a slow Snowflake / BigQuery query?

Strong answers cover: (1) query plan inspection — Snowflake's QUERY_HISTORY and query profile, BigQuery's execution graph; (2) partition pruning — making sure the WHERE clause hits the partition key; (3) clustering keys — relevant for high-cardinality filters; (4) sort order in window functions; (5) materialisation strategy — view vs table vs incremental dbt model; (6) warehouse / slot sizing — sometimes the right answer is a bigger warehouse for the right query, but only after the SQL has been optimised. Strong candidates name specific cost reductions they achieved. Wrong answer: 'I'd add an index' — Snowflake doesn't do indexes the way Postgres does.
Question 11

How do you think about data governance in 2026?

Panels ask this increasingly because UK GDPR enforcement and the AI Act have raised the stakes. Strong answers cover: (1) data discovery — catalogues that let people find what they need (Atlan, Collibra, OpenMetadata); (2) access control — role-based access at warehouse level, ideally tied to identity provider; (3) PII handling — masking or tokenisation of sensitive columns, with policies enforced in the data layer not application layer; (4) lineage — knowing where every column comes from and where it flows to. Strong candidates mention specific UK regulatory context (DPIA requirements, retention rules, right-to-erasure implementations). Wrong answer: 'we lock down access' — that's defensive, not governed.
Question 12

What questions do you have for us?

Strong Data Engineer questions are technical and operational: what's the team's data quality bar, how do you handle schema breakage incidents, how does the team partner with analytics and ML, what's the most contentious technical decision the team has made in the last six months, what's your stance on data contracts and backwards compatibility. Avoid generic questions about culture or growth. The strongest signal is a question that couldn't be answered by anyone outside the team — that shows research and serious intent.

How to use these answers

Data Engineer interviews reward technical depth across the stack and operational specificity. Build six to eight personal stories before the interview — each tagged for the competencies above (pipeline ownership, debugging, schema-change handling, data quality, stakeholder partnership). Quantify outcomes — pipeline runtime reductions, cost savings, incident resolution times, schema-error rates. The mistake I see kill the most Data Engineer offers is theoretical answers without production specifics. "In general we'd..." loses to "At my last company, we cut Airflow DAG runtime from 4h to 22min by switching to incremental dbt models with proper clustering keys." Specifics get hired.

Related across UK Rights & Guides

Keep reading

JobLabs UK careers — every cluster index →

Pillars + free tools

Related job-search guides + calculators

Pillars

→ UK Career Change — interview-pivot pillar — sector-switch playbook
→ UK CV scoring pillar — CV format + ATS-safe AI prompts
→ UK Interview Prep — final-stage pillar — STAR + 4-stage UK process
→ UK Cover Letter polish pillar — five UK opening patterns

Free recruiter-built tools