Data Engineer

Updated for 2026: Data Engineer interview questions and answers covering core skills, tools, and best practices for roles in the US, Europe & Canada.

20 Questions
mediumde-etl-vs-elt

ETL vs ELT: what’s the difference and when do you choose each?

ETL transforms data before loading into the warehouse. ELT loads raw data first, then transforms inside the warehouse. ELT is common with modern warehouses (Snowflake/BigQuery) and tools like dbt. Choose based on data volume, governance needs, transformation complexity, and cost. Always design for incremental processing and observability.

ETLData WarehousingArchitecture
easyde-warehouse-vs-lake

Data warehouse vs data lake: what are the differences and use cases?

Warehouses store structured, curated data for analytics. Lakes store raw or semi-structured data cheaply. Warehouses optimize SQL analytics and governance. Lakes optimize flexibility and storage cost. Many teams build lakehouse architectures to combine both. Choose based on query patterns, governance, and cost constraints.

Data WarehouseData LakeArchitecture
mediumde-lakehouse

What is a lakehouse and what problems does it solve?

A lakehouse combines lake storage with warehouse-style table formats and governance. It enables: - Reliable ACID tables on object storage - Faster analytics on semi-structured data - A single source for BI + ML Common technologies include Delta Lake, Apache Iceberg, and Apache Hudi.

LakehouseArchitectureData Platforms
hardde-airflow-best-practices

What are Airflow DAG best practices for reliable data pipelines?

Reliable DAGs are idempotent, observable, and easy to backfill. Best practices: - Use small tasks and clear dependencies - Avoid heavy logic in schedulers - Add retries/backoff and SLAs - Use parameters for backfills - Version DAG changes and test locally Monitor failures, duration, and data quality to detect silent pipeline breaks.

AirflowOrchestrationReliability
hardde-spark-partitioning

How do you optimize Spark jobs (partitioning, shuffles, caching)?

Spark performance depends on minimizing shuffles and choosing good partitioning. Key tactics: - Partition by common filter/join keys - Reduce wide transformations - Use broadcast joins when appropriate - Cache only reused datasets Profile jobs with Spark UI and tune based on skew, shuffle size, and executor utilization.

SparkPerformanceBig Data
mediumde-kafka-streaming-basics

Kafka basics: topics, partitions, offsets, and consumer groups explained.

Kafka stores events in topics, split into partitions. - Partitions provide ordering per partition - Consumers track offsets - Consumer groups share work across partitions Design decisions include partition keys, retention, and delivery semantics (at-least-once). Idempotent processing and DLQs help handle retries safely.

KafkaStreamingDistributed Systems
hardde-cdc

What is Change Data Capture (CDC) and when should you use it?

CDC captures database changes (inserts/updates/deletes) as a stream. Use CDC for: - Near real-time analytics - Replication into a warehouse/lake - Event-driven integrations Common tools: Debezium, database logs, managed CDC connectors. Key challenges include schema evolution, ordering, and exactly-once-like processing semantics.

CDCStreamingArchitecture
mediumde-star-schema

What is a star schema and why is it common in analytics warehouses?

A star schema uses a fact table (events/measures) linked to dimension tables (entities like user, product, time). It simplifies BI queries, improves performance, and supports consistent metrics. Design tips: choose correct grain for facts, manage slowly changing dimensions, and document metric definitions to avoid inconsistent reporting.

Data ModelingWarehousingAnalytics
mediumde-scd

Slowly Changing Dimensions (SCD): what are Type 1 and Type 2 and when do you use them?

SCDs handle dimension changes over time. - Type 1: overwrite history (keep latest) - Type 2: preserve history with version rows (effective dates) Type 2 is used when historical accuracy matters (e.g., customer segment at purchase time). Type 1 is simpler when history is not needed.

Data ModelingWarehousingSCD
mediumde-data-quality

How do you implement data quality checks in pipelines?

Data quality checks detect issues before they reach dashboards. Common checks: - Schema validation - Null/uniqueness constraints - Freshness and volume anomalies - Referential integrity Automate alerts and quarantines, and track incidents to reduce repeated failures. Great Expectations and dbt tests are common approaches.

Data QualityReliabilityPipelines
hardde-schema-evolution

How do you handle schema evolution in data pipelines without breaking consumers?

Schema evolution needs compatibility rules. Strategies: - Prefer additive changes (new nullable fields) - Version schemas when breaking changes are unavoidable - Validate producers/consumers in CI - Use contracts for events/tables Document changes and coordinate rollouts so downstream jobs and BI dashboards don’t silently break.

SchemaGovernanceReliability
hardde-incremental-loads

How do incremental loads work and how do you avoid duplicates?

Incremental loads process only new/changed data. Common patterns: - Watermark columns (updated_at) - CDC streams - Partition-based loads Avoid duplicates by using idempotent merges (upserts), stable keys, and exactly-once-like processing semantics where possible. Always support safe backfills.

ETLReliabilityData Engineering
mediumde-pipeline-monitoring

What should you monitor in data pipelines (freshness, latency, failures)?

Monitoring prevents silent data outages. Track: - Pipeline success/failure - Runtime and lag - Data freshness and volume - Quality test failures Set SLOs for critical datasets and build alerting that’s actionable. Include lineage so teams can see downstream impact when a dataset breaks.

MonitoringReliabilityData Engineering
mediumde-idempotent-jobs

What does it mean for a data pipeline job to be idempotent and why is it important?

Idempotent jobs can run multiple times without changing the final result. This matters because retries and reprocessing are normal. Techniques: - Use deterministic outputs - Write to staging then swap - Use upserts/merge with keys Idempotency reduces data corruption risk and makes backfills safer and faster.

ReliabilityBest PracticesPipelines
hardde-backfills

How do you run safe backfills and reprocess historical data?

Backfills should be controlled and observable. Best practices: - Parameterize time ranges - Use staging tables/partitions - Rate limit to protect warehouses - Validate quality before promoting Always communicate downstream impact (dashboards, ML features) and ensure you can roll back if results are wrong.

BackfillReliabilityData Engineering
hardde-security-pii

How do you protect PII in data pipelines and analytics systems?

PII protection requires governance and technical controls. Practices: - Data classification and access controls - Encryption at rest/in transit - Masking/tokenization - Auditing and least privilege Also ensure retention policies and consent requirements. Build privacy by design to avoid accidental exposure in logs and dashboards.

SecurityGovernanceCompliance
mediumde-dbt-best-practices

What are dbt best practices for analytics engineering and maintainable models?

dbt best practices focus on modular, tested transformations. - Use sources and staging layers - Keep models small and reusable - Add tests and documentation - Use incremental models where needed - Version control and CI A strong dbt project improves trust in metrics and reduces time spent debugging broken dashboards.

dbtTransformationsAnalytics Engineering
hardde-sql-performance

How do you tune SQL performance in warehouses (partitioning, clustering, pruning)?

Warehouse performance depends on scanning less data. Tactics: - Partition by common filters - Cluster/sort by join keys - Write queries that enable pruning - Avoid wide SELECT * Always measure with query plans and cost metrics. Optimize the highest-cost queries that affect dashboards and SLAs.

SQLPerformanceWarehousing
mediumde-orchestration-vs-choreography

Orchestration vs choreography in data pipelines: what’s the difference?

Orchestration uses a central scheduler (Airflow) to control task order. Choreography uses events where each step triggers the next. Orchestration is easier to reason about and monitor. Choreography can scale better for event-driven systems but is harder to debug. Choose based on workflow complexity, team maturity, and observability capabilities.

ArchitectureOrchestrationPipelines
hardde-warehouse-cost-optimization

How do you control and optimize costs in cloud data warehouses?

Cost optimization focuses on scanning less data and using compute efficiently. Tactics: - Partition/clustering to reduce scanned bytes - Incremental models instead of full refresh - Set workload budgets/quotas - Schedule heavy jobs off-peak - Monitor top expensive queries and fix them Treat cost as a first-class metric alongside freshness and latency.

CostWarehousingPerformance