Data Scientist
Updated for 2026: Data Scientist interview questions and answers covering core skills, tools, and best practices for roles in the US, Europe & Canada.
What is the bias–variance tradeoff and how does it affect model performance?
Bias is error from overly simple assumptions; variance is error from sensitivity to training data. High bias → underfitting. High variance → overfitting. You balance this using: - Model choice and regularization - More data - Feature engineering - Cross-validation Interview tip: explain how you detect overfitting via training vs validation metrics.
How do you split data into train/validation/test sets and why does it matter?
Train data fits the model, validation tunes choices, and test evaluates final performance. Good splits prevent leakage and ensure realistic evaluation. For time-based data, split by time (not random). For grouped data (users), split by group to avoid the same entity appearing in multiple sets.
What is cross-validation and when should you use it?
Cross-validation estimates model performance by training on multiple folds and validating on the remaining fold. Use it when: - Data is limited - You need robust comparison across models For time series, use time-series CV (rolling windows) instead of random K-fold to preserve temporal order.
Which classification metrics should you use (accuracy, precision, recall, F1, AUC) and why?
Metric choice depends on the business cost of errors. - Accuracy: misleading on imbalanced data - Precision: minimize false positives - Recall: minimize false negatives - F1: balance precision and recall - ROC-AUC/PR-AUC: ranking quality; PR-AUC is better for imbalance Always tie metrics to business impact and choose thresholds explicitly.
MAE vs RMSE vs R²: how do you choose regression metrics?
MAE measures average absolute error (robust to outliers). RMSE penalizes large errors more. R² measures explained variance but can be misleading outside context. Choose based on: - Whether large errors are especially costly - Interpretability in business units - Outlier sensitivity Always compare against a baseline (mean/naive model).
What is feature engineering and what are common high-impact techniques?
Feature engineering transforms raw data into signals models can learn. High-impact techniques: - Aggregations (counts, rolling windows) - Encoding categoricals (one-hot, target encoding) - Text features (TF-IDF, embeddings) - Interaction terms Measure impact with validation and watch for leakage (features that contain future information).
How do you handle missing data and when is imputation risky?
First understand why data is missing (MCAR/MAR/MNAR). Options: - Drop rows/columns (if small impact) - Impute (mean/median, KNN, model-based) - Add missingness indicators Imputation is risky when missingness is informative (MNAR) or when it creates false confidence. Always validate downstream impact.
How do you detect and handle outliers in a dataset?
Outliers may be errors or real rare events. Detection: - Visuals (box plots, scatter) - Z-score/IQR rules - Model-based (isolation forest) Handling: - Fix data issues - Cap/winsorize - Use robust models/metrics Avoid blindly removing outliers—confirm whether they represent meaningful business cases.
How do you explain model predictions to stakeholders (SHAP, LIME, feature importance)?
Use explanation methods appropriate to the audience. - Global importance: which features matter overall - Local explanations: why a single prediction happened - SHAP: consistent additive attributions - LIME: local surrogate approximations Also validate explanations with domain experts and warn about correlation vs causation.
What is regularization (L1/L2) and how does it help prevent overfitting?
Regularization adds a penalty to model complexity. - L2 (Ridge): shrinks weights, keeps all features - L1 (Lasso): can push weights to zero (feature selection) It reduces variance and improves generalization. Choose strength via validation and consider elastic net for mixed behavior.
How do you handle imbalanced classification problems?
Imbalance needs the right metrics and training strategy. Approaches: - Use PR-AUC, precision/recall, and cost-based evaluation - Resample (SMOTE, undersampling) - Class weights / focal loss - Threshold tuning Always validate on a realistic test set and confirm improvements match business costs.
What is data leakage and how do you prevent it in ML projects?
Data leakage happens when training uses information not available at prediction time. Common causes: - Using future data in features - Leakage through target encoding - Improper train/test splitting (same user in both) Prevent with strict splitting rules, feature audits, and pipeline design that mirrors production inference.
How do you tune hyperparameters efficiently (grid search vs random vs Bayesian)?
Grid search is simple but expensive. Random search covers space more efficiently. Bayesian optimization uses past trials to pick better candidates. Use: - Random search for quick wins - Bayesian for expensive models - Early stopping where possible Always avoid tuning on the test set and track experiments for reproducibility.
How do you design an A/B test and avoid common statistical mistakes?
A/B tests require clear hypotheses and measurement. Steps: - Define success metrics and guardrails - Estimate sample size and duration - Randomize properly and avoid bias - Control for multiple testing and peeking Common mistakes: stopping early, ignoring seasonality, and choosing metrics after seeing results.
What SQL skills are most important for data scientists?
Core SQL skills enable fast analysis. Important topics: - Joins and window functions - Aggregations and grouping - CTEs and subqueries - Data cleaning (CASE, COALESCE) Also understand performance basics (indexes, partitions) so queries scale and don’t overload production systems.
What changes when you put an ML model into production?
Production ML needs reliability beyond accuracy. Consider: - Feature pipelines and consistency at inference - Monitoring drift and data quality - Latency and scaling - Retraining cadence - Explainability and governance A good handoff includes model versioning, reproducible training, and clear rollback strategies.
What’s different about time series forecasting compared to standard ML?
Time series has temporal dependency. Differences: - Use time-based splits - Handle seasonality and trends - Create lag/rolling features - Evaluate with horizon-aware metrics Forecasting failures often come from leakage and ignoring non-stationarity. Always baseline with naive forecasts.
What are the core steps in an NLP pipeline for text classification?
Typical NLP pipeline: - Clean and normalize text - Tokenize - Represent text (TF-IDF, embeddings) - Train model (linear, tree, transformer) - Evaluate with appropriate metrics Modern systems often use pretrained transformers but still need careful labeling, bias checks, and monitoring.
How do you think about fairness and ethics in machine learning?
Ethical ML considers harm, bias, and transparency. Practices: - Evaluate performance across groups - Audit training data for bias - Use fairness constraints when needed - Document assumptions and limitations Also consider privacy, consent, and the consequences of errors. Communicate trade-offs clearly to stakeholders.
How do you do feature selection and know which features actually help?
Feature selection aims to reduce noise and improve generalization. Approaches: - Filter methods (correlation, mutual information) - Wrapper methods (RFE) - Embedded methods (L1 regularization, tree importance) Always validate with cross-validation, watch leakage, and prefer simpler models when performance is similar for better explainability and maintenance.