SMS SPAM CLASSIFIER FOR ROMAN URDU AND ENGLISH SMS
Rana Muhammad Ammar, Hamna Shahid, Mudassar Niaz
Rana Muhammad Ammar Khan is currently pursuing Bacholer degree program in Software engineering in COMSAT University, Pakistan, PH-+92-308-8741046. Email: ranaammar046@gmail.com
Hamna Shahid is currently pursuing Bacholer degree program in Software engineering in COMSAT University, Pakistan, PH-+92-304-0617983. E-mail: hamnashahidkhanewal@gmail.com
Mudassar Niaz is currently pursuing Bacholer degree program in Software engineering in COMSAT University, Pakistan, PH-+92-304-6788995. E-mail: niaz.mudasar1122@gmail.com
KeyWords
SMS spam, spam classifier, Roman Urdu, English, machine learning, language variations, informal language, slang, limited text length, imbalanced datasets, preprocessing, feature selection, evaluation metrics, n-gram analysis, text normalization, phishing attacks, mobile network security.
ABSTRACT
The proliferation of SMS (Short Message Service) spam poses a significant challenge in maintaining a positive user experience and ensuring mobile network security. This research article focuses on developing an SMS spam classifier specifically designed for Roman Urdu and English SMS messages. The classifier employs machine learning techniques to accurately differentiate between spam and legitimate messages. The research explores the challenges associated with language variations, informal language and slang, limited text length, and imbalanced datasets. Preprocessing techniques for both Roman Urdu and English SMS are discussed, along with feature selection strategies. Evaluation metrics, including accuracy, precision, recall, and F1 score, are utilized to assess the classifier's performance. Techniques such as n-gram analysis, text normalization, and handling imbalanced datasets are examined to enhance the accuracy of the SMS spam classifier. The real-world applications of SMS spam classification, including protecting users from phishing attacks and enhancing mobile network security, are highlighted. The findings of this research contribute to a safer and more secure mobile communication environment by effectively classifying SMS spam in both Roman Urdu and English languages.
SMS (Short Message Service) spam refers to the unwanted and unsolicited messages sent to mobile phone users. It is a prevalent issue that can be both irritating and potentially harmful. To combat this problem effectively, the development of SMS spam classifiers has become essential. These classifiers leverage machine learning techniques to accurately identify and filter out spam messages, ensuring that users receive only legitimate and desired SMSs.
Importance of SMS Spam Classification
The proliferation of mobile devices and the widespread use of SMS as a communication channel have led to an increase in SMS spam. These spam messages can range from promotional offers and phishing attempts to scams and fraudulent activities. SMS spam classification is crucial for several reasons:
1. User Experience: SMS spam can disrupt the user experience, leading to annoyance and frustration. By filtering out spam messages, users can have a more enjoyable SMS communication experience.
2. Security: Many spam messages are designed to deceive and defraud users. SMS spam classifiers help protect users from phishing attacks, scams, and other fraudulent activities.
3. Network Efficiency: SMS spam consumes network resources, causing congestion and potential delays in delivering legitimate messages. By efficiently classifying and filtering spam, network efficiency can be improved.
Challenges in SMS Spam Classification
Developing effective SMS spam classifiers comes with several challenges:
1. Language Variations
SMS spam can be present in multiple languages, making it challenging to create a unified classification system. Roman Urdu and English are two commonly used languages in SMS communication. The classifiers need to handle the language variations and accurately classify spam across different languages.
2. Informal Language and Slang
SMS messages often contain informal language, slang, and abbreviations, making it difficult to interpret them accurately. Understanding and deciphering the meaning behind these messages require classifiers that can handle informal language and slang effectively.
3. Limited Text Length
SMSs have character limitations, typically allowing only 160 characters per message. This limited text length poses a challenge as classifiers need to extract relevant features and make accurate predictions using only a small amount of information.
4. Unbalanced Datasets
Spam messages are relatively rare compared to legitimate messages, resulting in imbalanced datasets for training classifiers. This imbalance can lead to biased models that struggle to accurately classify spam messages. Handling the class imbalance is crucial to ensure balanced performance.
Understanding SMS Spam Classification
SMS spam classification involves several steps to develop an effective model:
1. Data Collection: A diverse and representative dataset of SMS messages is collected, containing both spam and legitimate messages.
2. Preprocessing: The collected SMS messages undergo preprocessing steps, including removing special characters, handling capitalization, and converting text to a consistent format. This step helps in normalizing the data and preparing it for feature extraction.
3. Feature Extraction: Relevant features are extracted from the preprocessed SMS messages. These features can include word frequencies, presence of specific keywords, character n-grams, and contextual information. Feature extraction plays a vital role in capturing the distinguishing characteristics of spam and non-spam messages.
4. Model Selection: Various machine learning algorithms, such as Naive Bayes, Support Vector Machines (SVM), Random Forest, and deep learning techniques like Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN), can be employed for SMS spam classification. The selection of the appropriate model depends on the characteristics of the dataset and the desired performance metrics.
5. Model Training and Evaluation: The classifier is trained using the labeled dataset, and its performance is evaluated using metrics such as accuracy, precision, recall, and F1 score. The dataset is typically split into training and testing sets to assess the generalization capability of the model.
Roman Urdu SMS Spam Classification
Roman Urdu is a hybrid language that combines Roman characters with Urdu words, commonly used in SMS communication in Pakistan. Classifying Roman Urdu SMS spam presents its own set of challenges:
1. Preprocessing Techniques: Preprocessing steps for Roman Urdu SMS include removing special characters, normalizing Roman Urdu text, and converting it to a standard format. These steps help ensure consistent representation and improve classification accuracy.
2. Feature Selection for Roman Urdu SMS: In addition to standard features used for SMS spam classification, custom features specific to Roman Urdu can be employed. These features may include the presence of Urdu words, Romanized Urdu words, or language-specific patterns.
English SMS Spam Classification
Preprocessing techniques for English SMS are like those used for Roman Urdu:
1. Preprocessing Techniques: Preprocessing steps include removing punctuation, converting text to lowercase, handling stop words, and normalizing abbreviations. These steps help standardize the text and improve feature extraction.
2. Feature Selection for English SMS: Features such as word frequencies, presence of specific keywords, and contextual information can be used to effectively identify spam messages in English SMS.
Evaluation Metrics for SMS Spam Classification
To measure the performance of SMS spam classifiers, several evaluation metrics are used:
1. Accuracy: Measures the overall correctness of the classifier's predictions by calculating the ratio of correctly classified messages to the total number of messages.
2. Precision: Indicates the proportion of correctly classified spam messages out of all messages classified as spam. It measures the classifier's ability to avoid false positives.
3. Recall: Measures the proportion of correctly classified spam messages out of all actual spam messages. It assesses the classifier's ability to avoid false negatives.
4. F1 Score: Combines precision and recall into a single metric, providing a balanced evaluation of the classifier's performance. These metrics help assess the effectiveness of the classifier in correctly classifying spam and non-spam messages.
Techniques to Improve SMS Spam Classification
Several techniques can be employed to enhance the accuracy and performance of SMS spam classifiers:
1. N-gram Analysis: N-grams capture the contextual information by considering the sequence of words. Employing n-gram analysis can improve the classifier's ability to identify spam messages accurately.
2. Text Normalization: Techniques such as stemming and lemmatization can normalize words, reducing vocabulary size and improving feature extraction. Text normalization helps in capturing the semantic meaning of words.
3. Handling Imbalanced Datasets: Imbalanced datasets, where the number of spam messages is significantly smaller than legitimate messages, can lead to biased models. Techniques like oversampling the minority class instances or under sampling the majority class instances can help balance the dataset for training and improve classification performance.
Real-World Applications of SMS Spam Classification
SMS spam classification has several practical applications beyond improving user experience and network efficiency:
1. Protecting Users from Phishing Attacks: Phishing attacks through SMS can trick users into revealing sensitive information or clicking on malicious links. SMS spam classifiers help identify and block phishing attempts, safeguarding users' personal data and privacy.
2. Enhancing Mobile Network Security: By filtering out spam messages, mobile network operators can improve the security and quality of their services. Effective spam classification contributes to a safer and more reliable mobile network environment.
Conclusion
SMS spam classification is an essential task to ensure a positive user experience, enhance security, and improve network efficiency. By leveraging machine learning techniques, appropriate feature selection, and preprocessing methods, SMS spam classifiers can accurately identify and filter out spam messages. The continuous development of SMS spam classifiers contributes to a safer and more secure mobile communication environment.
Acknowledgment
We would also like to acknowledge the support and guidance of our research advisor and mentor Ms Madiha Fatima who provided their expertise and valuable feedback throughout the research process. Their insightful comments and suggestions greatly contributed to the quality and rigor of this research.
Furthermore, we would like to thank the COMSATS UNIVERSITY ISLAMABAD, SAHIWAL CAMPUS that provided the necessary resources and infrastructure to carry out this study. Their support was invaluable in facilitating the data collection, analysis, and experimentation phases.
Without the collective efforts and contributions of all those involved, this research article would not have been possible. We are truly grateful for their support and collaboration.