Machine Learning Engineer
Updated for 2026: Machine Learning Engineer interview questions and answers covering core skills, tools, and best practices for roles in the US, Europe & Canada.
How do you design a model serving system for low latency and high reliability?
Model serving is production engineering, not just ML. Key decisions: - Batch vs real-time inference - Model format and runtime (ONNX/TensorRT) - Caching and request shaping - Autoscaling and timeouts - Canary releases and rollback Monitor latency (p95/p99), errors, and input drift. Keep model/version metadata so you can reproduce and debug predictions quickly.
How do you build reliable training pipelines that are reproducible?
Reliable training pipelines are deterministic and versioned. Key practices: - Version data snapshots and labels - Version code + config - Track features and preprocessing - Log metrics and artifacts Use pipeline orchestration (Airflow/Kubeflow) and ensure the same transformations run in training and inference to avoid training-serving skew.
What is a feature store and when does it make sense to use one?
A feature store centralizes feature definitions and provides consistent features for offline training and online inference. It helps reduce duplication, prevent training-serving skew, and improve governance. Use it when many models share features, feature computation is complex, or you need strict consistency and monitoring.
How do you monitor model drift and data drift in production?
Drift monitoring tracks changes in inputs and outputs over time. Monitor: - Input feature distributions - Prediction distribution - Performance proxy metrics - Ground-truth metrics when labels arrive Use alerts with thresholds and investigate root causes (product changes, seasonality, data bugs). Drift monitoring should trigger analysis, not automatic retraining by default.
How do you version and promote models safely (dev → staging → prod)?
Treat models like deployable artifacts. Use: - Model registry with semantic versioning - Automated eval gates (quality, bias, latency) - Canary/shadow deployments - Rollback to known-good versions Always attach metadata: training data hash, code version, feature set, and evaluation results for auditability.
Online vs offline evaluation: how do you decide if a model is better?
Offline metrics (AUC, RMSE) measure historical performance, but online metrics measure real product impact. Best practice: - Start with offline gating - Run shadow/canary - Validate with A/B tests when possible Watch for distribution shift and metric mismatch: improving offline scores doesn’t guarantee better user outcomes.
Batch inference: how do you design scalable offline scoring pipelines?
Batch inference scores many entities on a schedule. Design points: - Partition data and parallelize compute - Ensure idempotent outputs (upserts) - Track model version per score - Monitor runtime, cost, and failures Use Spark/Beam for large jobs and keep outputs consistent with online features when you have both paths.
What should an ML CI/CD pipeline include beyond standard software CI?
ML CI/CD must validate both code and model behavior. Include: - Data and schema checks - Reproducible training - Automated evaluation gates - Bias/fairness checks - Packaging + deploy Also test inference latency, memory, and correctness on representative inputs before promotion.
What is training-serving skew and how do you prevent it?
Training-serving skew happens when features or preprocessing differ between training and production inference. Prevent it by: - Sharing feature code (feature store or libraries) - Versioning transforms - Validating feature distributions - Running end-to-end tests on sample requests Skew is a top cause of “great offline, bad in prod” failures.
How do you optimize models for latency and cost (quantization, distillation, batching)?
Optimize the full system, not just the model. Techniques: - Quantization (INT8) - Distillation to smaller models - Efficient runtimes (ONNX/TensorRT) - Request batching Trade-off accuracy vs latency vs cost. Validate with representative traffic and monitor p95/p99 latency.
Why is experiment tracking important and what should you log?
Experiment tracking makes results reproducible and comparable. Log: - Dataset version/hash - Code + config - Metrics and plots - Model artifacts - Runtime environment Without tracking, teams can’t reliably reproduce “best” results or debug production regressions.
How do you manage labeling workflows and reduce label noise?
Label quality often matters more than model choice. Practices: - Clear labeling guidelines - Inter-annotator agreement checks - Active learning to focus effort - Audits and gold sets Track label drift over time and revisit definitions when product requirements change.
What ML security risks should engineers consider (poisoning, adversarial, prompt injection)?
ML systems have unique attack surfaces. Risks: - Data poisoning (training data) - Adversarial inputs - Model extraction - Prompt injection for LLM apps Mitigations include input validation, rate limiting, monitoring, access controls, and separating untrusted inputs from privileged tools.
What changes when deploying ML models on edge devices (mobile/IoT)?
Edge deployment constraints include limited CPU/GPU, memory, and battery. You often need: - Smaller models (distillation) - Quantization - On-device caching - Privacy-safe logging Measure latency on real devices and plan updates carefully (versioning, rollback, compatibility).
What is a shadow deployment for ML models and when is it useful?
A shadow deployment runs a new model in production alongside the current model, but its predictions don’t affect users. It’s useful to validate latency, stability, and distribution differences on real traffic before a canary or full rollout. Compare predictions, monitor drift, and catch integration bugs safely.
How do ML engineers validate data and schemas to prevent pipeline breakages?
Data validation prevents silent failures. Checks include: - Schema/type validation - Range and null constraints - Cardinality and uniqueness - Drift against reference distributions Run checks at ingestion and before training/serving. Fail fast on critical breaks and alert with clear ownership.
Why do you log features and predictions in production, and what should you avoid logging?
Logging features and predictions enables debugging, drift monitoring, and offline evaluation when labels arrive. Avoid logging sensitive PII and secrets. Use sampling, hashing, and access controls. Include model/version IDs and request correlation IDs so you can trace issues end-to-end.
When should you use GPU inference and what are the operational trade-offs?
GPU inference helps for large models and high throughput, but adds cost and scheduling complexity. Trade-offs: - Cold starts and batching needs - Capacity planning - Multi-tenant isolation Measure end-to-end latency and cost per request; sometimes CPU + quantization is cheaper and simpler.
How do you balance model quality, latency, and cost in production?
Treat it as a product trade-off. Approaches: - Smaller models or distillation - Quantization - Caching and batching - Multi-model routing (fast model first, fallback to strong model) Define SLOs (p95 latency) and cost budgets, then tune architecture and model choice to meet them.
How do you measure and mitigate bias in deployed ML systems?
Bias mitigation starts with measurement. Steps: - Evaluate metrics across groups - Audit data collection and labels - Add constraints or reweighting - Monitor fairness over time Document decisions and involve stakeholders. Fairness is an ongoing process, not a one-time test.