Machine learning, statistical modelling and SQL analysis projects demonstrating end-to-end pipeline work.
Heart Disease Prediction using Machine Learning
Random Forest: 96.7% accuracy, 97.9% recall, ROC AUC 0.996 — tuned via GridSearchCV.
Traditional risk scores struggle with non-linear clinical interactions. Cleaned a 968-record medical dataset, encoded and scaled features, then trained and compared five classifiers (Logistic Regression, Random Forest, SVM, Decision Tree, Gaussian Naive Bayes). SVC F1-score improved from 89% to 96% post-tuning. Performance caveats around retained duplicate records are documented in the project notes.
Pythonscikit-learnClassificationGridSearchCVROC AUC
View on GitHub
Machine Learning for House Price Prediction: A Comparative Study
XGBoost outperformed all linear baselines — R² = 0.84 on the California housing dataset.
Compared statistical and ML models for predicting house prices. Applied IQR outlier detection, log transformation, one-hot encoding and GridSearchCV tuning across Linear Regression, Ridge, KNN, Random Forest and XGBoost. Evaluated using R², RMSE, residual plots and Q-Q plots.
PythonXGBoostscikit-learnRegressionModel comparison
View on GitHub
Covid-19 Patient Survival Analysis (SQL)
Mortality rates varied significantly by age, comorbidity and ICU admission source — surfaced entirely through SQL.
Hospitals needed a structured view of which patient characteristics drove Covid-19 mortality. Wrote advanced SQL queries across a hospital patient dataset to calculate mortality rates stratified by age, gender, ethnicity and comorbid conditions (diabetes, cirrhosis). Investigated ICU admission types, BMI distributions and physiological metrics across survivors and non-survivors. Output provides a foundation for clinical risk stratification.
SQLHealthcareRisk stratificationPostgreSQL
View on GitHub