Cancer Biology Project Case Study 4

Project4:MachineLearning

ClassificationofMicroarray

GeneExpressionDatafor CancerSubtypeIdentification

This project aimed to develop a machine learning–based classification model that accurately predicts cancer subtypes from microarray gene expression data. By analyzing curated datasets from public repositories, the study sought to uncover key gene expression signatures and optimize classifiers for biological interpretation and diagnostic application.

AdvancedMethodology Pipeline:

Dataset Collection & Preprocessing

• Applied preprocessing steps:

⚬ Background correction ⚬ RMA normalization

• Used Affymetrix and Agilent microarray platforms.

• Datasets obtained from GEO (Gene Expression Omnibus) – e.g., breast, lung, or leukemia subtype expression profiles.

Log2 transformation

Feature Selection

• Reduced dimensionality using:

• Variance thresholding

• Mutual Information Score

• Recursive Feature Elimination (RFE) with cross-validation

Selected top 100–200 informative genes

Model Development & Tuning

• Applied multiple classification models:

• Support Vector Machine (SVM)

• Random Forest (RF)

• K-Nearest Neighbors (KNN)

• Logistic Regression

• Gradient Boosting (XGBoost)

• Used 5-fold stratified cross-validation

• Performance metrics: Accuracy, Precision, Recall, F1-score,

Project Outcomes:

Classification Accuracy:

• XGBoost achieved highest accuracy: 94.3%

• SVM and Random Forest followed with ~92% accuracy

• Feature selection improved performance significantly vs. raw data

Top Discriminatory Genes:

• Identified key biomarkers associated with subtype-specific profiles (e.g., HER2, ESR1, PGR in breast cancer)

Feature importance ranked genes based on contribution to classification

Figure 1: Heatmap of Top 50 Differentially Expressed Genes

Visualization Results:

Shows expression patterns across all samples, grouped by subtype.

Figure 2: Confusion Matrix of Best Classifier (XGBoost)

Displays number of true vs. predicted classes.

Demonstrates high sensitivity and specificity across all cancer subtypes.

Figure 3: ROC Curve for Multiclass Classification

Figure 4: Feature Importance Plot

Highlights the most predictive genes used by the classifier.

Conclusion:

This project successfully established a robust and interpretable ML pipeline for classifying cancer subtypes using gene expression data. It demonstrates the potential of machine learning to support precision oncology and early diagnosis.

Key Contributions:

1. Delivered accurate and reproducible subtype predictions

2. Identified candidate biomarkers for downstream validation Built a model that could be retrained for NGS-based RNA-seq data in future iterations

Turn static files into dynamic content formats.

Create a flipbook