







This project aimed to develop a machine learning–based classification model that accurately predicts cancer subtypes from microarray gene expression data. By analyzing curated datasets from public repositories, the study sought to uncover key gene expression signatures and optimize classifiers for biological interpretation and diagnostic application.
Dataset Collection & Preprocessing
• Applied preprocessing steps:
⚬ Background correction ⚬ RMA normalization
• Used Affymetrix and Agilent microarray platforms.
• Datasets obtained from GEO (Gene Expression Omnibus) – e.g., breast, lung, or leukemia subtype expression profiles.
Log2 transformation
Feature Selection
• Reduced dimensionality using:
• Variance thresholding
• Mutual Information Score
• Recursive Feature Elimination (RFE) with cross-validation
Selected top 100–200 informative genes
• Applied multiple classification models:
• Support Vector Machine (SVM)
• Random Forest (RF)
• K-Nearest Neighbors (KNN)
• Logistic Regression
• Gradient Boosting (XGBoost)
• Used 5-fold stratified cross-validation
• Performance metrics: Accuracy, Precision, Recall, F1-score,
Classification Accuracy:
• XGBoost achieved highest accuracy: 94.3%
• SVM and Random Forest followed with ~92% accuracy
• Feature selection improved performance significantly vs. raw data
• Identified key biomarkers associated with subtype-specific profiles (e.g., HER2, ESR1, PGR in breast cancer)
Feature importance ranked genes based on contribution to classification
Shows expression patterns across all samples, grouped by subtype.
Displays number of true vs. predicted classes.
Demonstrates high sensitivity and specificity across all cancer subtypes.
Highlights the most predictive genes used by the classifier.
This project successfully established a robust and interpretable ML pipeline for classifying cancer subtypes using gene expression data. It demonstrates the potential of machine learning to support precision oncology and early diagnosis.
1. Delivered accurate and reproducible subtype predictions
2. Identified candidate biomarkers for downstream validation Built a model that could be retrained for NGS-based RNA-seq data in future iterations