Cancer Biology Project Case Study 1

Cancer Biology Projects

Project1:Identification ofDisease-Causing

Mutations:A BioinformaticsDatabase Approach(Breast Cancer)

This project focused on predicting the pathogenicity of genetic mutations associated with breast cancer susceptibility genes (BRCA1, BRCA2, PALB2, and CHEK2). Leveraging bioinformatics databases and advanced machine learning algorithms, we developed a high-precision variant analysis pipeline integrating conservation scores, amino acid substitution matrices, and functional impact predictions.

AdvancedMethodology Pipeline:

Data Acquisition & Preprocessing

• Genomic variant data collected from gnomAD for BRCA1 (2630), BRCA2 (5077), PALB2 (1783), and CHEK2 (1012).

• Protein sequences from Ensembl, conservation scores from UCSC PhyloP.

• Filtered variants to retain only high-confidence coding mutations (e.g., excluding intronic and conflicting variants).

Variant

Annotation and Parsing

• HGVS-format mutation data parsed to extract:

⚬ Reference Amino Acid (Ref_aa)

⚬ Alternate Amino Acid (Alt_aa)

⚬ Mutation Position

• Frameshift and stop codon analysis performed for functional context.

BLOSUM Score Calculation

1: BLOSUM62 matrix showing amino acid substitution scores.

Used BLOSUM62 matrix to quantify evolutionary distance between wild-type and mutant amino acids.

Figure

Conservation Analysis with PhyloP

Integrated PhyloP conservation scores to determine evolutionary pressure at nucleotide positions.

5

Functional Impact Prediction

Used SIFT and PolyPhen2 (via Ensembl VEP API)

to evaluate whether the amino acid changes were likely benign or damaging.

Machine Learning Model Ensemble

• Models used:

⚬ Random Forest

⚬ Support Vector Machine (SVM)

⚬ K-Nearest Neighbors (KNN)

⚬ Logistic Regression

⚬ XGBoost

⚬ LightGBM

• Input: Feature set combining substitution scores, conservation scores, and functional impact.

• Output: Pathogenicity classification.

• Train-Test Split: 80:20

ProjectOutcomes:

High accuracy and precision in variant classification.

Precision: 97% across benign and pathogenic classes.

Strong F1-score for binary classes; intermediate variants were less separable due to data imbalance.

The ensemble learning approach outperformed individual algorithms in prediction robustness and stability.

GraphicalResults:

Per class classification metrics plot

Figure 2 - Per-class classification metrics—precision, recall, and F1-score—across four variant categories (benign, likely benign, likely pathogenic, and pathogenic) for six machine learning models (RF, SVM, KNN, LR, XGBoost, and LightGBM). The bar charts highlight each model’s performance in correctly classifying variants within each pathogenicity class.

Model Performance

Figure 3 - Overall accuracy of six machine learning models Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression, XGBoost (XGB), and LightGBM (LGB)-in classifying variant pathogenicity. Logistic Regression achieved the highest accuracy (0.90), while all models exhibited performance above 0.85.

ROC Curve – Ensemble Classifier

Figure 4 - Receiver operating characteristic (ROC) curves for the machine learning classifiers used in this study to predict mutation pathogenicity.

Shown are the ROC curves for the KNN, Logistic Regression, Naive Bayes, SVM, Random Forest, XGBoost, and LightGBM models. These

curves illustrate the trade-off between sensitivity and specificity across various decision thresholds. Higher AUC values, as observed for the SVM, LR, and RF models, indicate superior discriminative performance in clearly distinguishing benign from pathogenic variants, while the relatively lower AUCs for KNN and NB highlight challenges in classifying intermediate cases.

Conclusion

This project successfully delivered a state-of-the-art integrative framework combining evolutionary, structural, and functional bioinformatics features with machine learning algorithms for robust mutation classification. The approach offers a powerful strategy for early risk assessment in breast cancer and can be adapted to other genetic diseases.

We demonstrated:

• An advanced, reproducible bioinformatics pipeline.

• Superior classification power using ensemble learning.

• Scientific validation through conservation, substitution, and functional metrics

Turn static files into dynamic content formats.

Create a flipbook