Bengali News Headline Categorization: A Comprehensive Analysis of Machine Learning and Deep Learning Approach

Ovi Chowdhury, Mamun Ahmed*, Mst. Tasmim Ara, Saha Reno, and Ankhi Alam

Abstract: Text classification, a prominent application of natural language processing, is gaining popularity in the Bengali language, much like in many other languages. A significant effort in this domain involves categorizing various unlabeled news items, spanning topics like national, international, and IT news. Bengali news platforms are experiencing growth, supported by effortless internet accessibility, leading to widespread engagement in online news consumption. These platforms commonly encompass a broad spectrum of news genres. This article presents an approach for classifying news headlines sourced from websites or news portals through the utilization of a machine learning algorithm. The acquired data underwent thorough evaluation and training, which encompassed preprocessing procedures such as tokenization, numeric character elimination, exclamation mark removal, symbol removal, and the exclusion of stop words. Additionally, we compiled a list of stop phrases to further enhance performance, recognizing the importance of effective stop word elimination in feature selection. Rather than scrutinizing news articles from diverse online sources, our research exclusively concentrates on categorizing Bengali news headlines. We consider eight distinct news categories, and our model is trained to categorize input data accordingly. Remarkably, our comprehensive model attained its peak performance employing the GRU technique, resulting in an accuracy rate of 84% in this specific case.

Keywords: News portal, Bengali news headline categorizing, online publication, text classification, word elimination

PDF

References

Al-Tahrawi, M. M. (2015). Arabic text categorization using logistic regression. International Journal of Intelligent Systems and Applications, 7(6), 71–78.

Bangladesh protidin, (2021). https://www.bd-protidin.com

Cai, J., Li, J., Li, W., & Wang, J. (2018). Deeplearning model used in text classification. In 2018 15th international computer conference on wavelet active media technology and information processing (ICCWAMTIP) (pp. 123–126). IEEE. https://ieeexplore.ieee.org/document/8632592

Daily Inqilab. (2021). https://www.dailyinqilab.com 

Dhar, P., & Abedin, M. (2021). Bengali news headline categorization using optimized machine learning pipeline. International Journal of Information Engineering & Electronic Business, 13(1), 15–24.

Doinik Jugantor, (2021). https://www.jugantor.com 

Elgabry, O. (2019). The ultimate guide to data cleaning. Towards to data science. https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4

Gambäck, B., & Sikdar, U. K. (2017). Using convolutional neural networks to classify hate-speech. In Proceedings of the first workshop on abusive language online (pp. 85-90). https://aclanthology.org/W17-3013/

Khushbu, S. A., Masum, A. K. M., Abujar, S., & Hossain, S. A. (2020). Neural network based Bengali news headline multi classification system: Selection of features describes comparative performance. In 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp. 1–6). IEEE. https://ieeexplore.ieee.org/document/9225611

Kostadinov, S. (2017). Understanding GRU networks. Towards Data Science. Towards Data Science, Towards Data Science, 16. https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

Omidvar, A., Jiang, H., & An, A. (2018). Using the neural network for identifying click baits in online news media. In Annual International Symposium on Information Management and Big Data (pp. 220–232). Springer, Cham. https://www.semanticscholar.org/paper/Using-Neural-Network-for-Identifying-Clickbaits-in-Omidvar-Jiang/b0c63783250a0d4c3fad34e2a72f5c3dea459132  

Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015). Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4580-4584). IEEE. https://ieeexplore.ieee.org/document/7178838

Shahi, T. B., & Pant, A. K. (2018). Nepali news classification using naïve bayes, support vector machines and neural networks. In 2018 International Conference on Communication Information and Computing Technology (ICCICT) (pp. 1–5). IEEE. https://ieeexplore.ieee.org/document/8325883

Shahin, M. M. H., Ahmmed, T., Piyal, S. H., & Shopon, M. (2020). Classification of Bangla news articles using bidirectional long short-term memory. In 2020 IEEE Region 10 Symposium (TENSYMP) (pp. 1547-1551). IEEE. https://ieeexplore.ieee.org/document/9230737 

Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216–232. https://www.sciencedirect.com/science/article/abs/pii/S0020025518306935

Yang, Y., & Joachims, T. (2008). Text categorization. Scholarpedia, 3(5), 4242. http://www.scholarpedia.org/article/Text_categorization

Yuslee, N. S., & Abdullah, N. A. S. (2021). Fake News Detection using Naive Bayes. In 2021 IEEE 11th International Conference on System Engineering and Technology (ICSET) (pp. 112-117). IEEE. https://ieeexplore.ieee.org/abstract/document/9612540 Zia, T., Abbas, Q., & Akhtar, M. P. (2015). Evaluation of feature selection approaches for Urdu text categorization. International Journal of Intelligent Systems & Applications, 7(6), 33–40.