A Multilevel N-gram Model with Naïve Bayes Classification of Personal Web History Datasets

Authors

Kho Lee Chin Faculty of Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Ngu Sze Song Faculty of Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Annie Jospeh Faculty of Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Nordiana Rajaee Faculty of Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Siti Kudnie Sahari Faculty of Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Hau Ji Liang

DOI:

https://doi.org/10.33736/jese.10735.2026

Keywords:

Big Data Analytics, Naive Bayes Classfication, N-gram modeling, personal datasets

Abstract

The rapid expansion of 4G and 5G networks has accelerated the proliferation of Internet-connected devices, leading to a massive increase in Internet of Things (IoT) traffic and the generation of diverse big data. Big data analytics has been widely adopted across healthcare, gaming, cybersecurity, business, and other sectors to extract actionable insights, uncover hidden patterns, and support informed decision-making on their targeted datasets. However, big data analytics of personal history datasets is a passive and underrated field. This paper addressed the gap by proposing a novel multilevel N-gram model combined with a Naïve Bayes classifier to classify the personal website history datasets. In the first stage, URL strings are decomposed into multiple N-gram levels (unigrams, bigrams, trigrams) to capture both simple lexical features and contextual patterns. In the second stage, the extracted features are classified using the Naïve Bayes algorithm, which applies Bayes’ theorem under the assumption of conditional independence to compute category probabilities. Empirical validation on a standardised dataset demonstrates that the proposed approach achieves an average F1-score of 88%, outperforming existing baseline methods documented in prior literature. These findings highlight the effectiveness of the proposed method for big data analysis of web usage, particularly for personal history datasets.

References

Ida Afriliana and Nurohim. (2021). Classification of Teachers and Lecturers Engagement on Webinar during the Pandemic using the Utilization of Big Data, International Journal of Science, Technology & Management, vol. 2, no. 3, pp. 673–684. doi: 10.46729/ijstm.v2i3.224.

D. Ayata, Y. Yaslan, and M. E. Kamasak. (2020). Emotion Recognition from Multimodal Physiological Signals for Emotion Aware Healthcare Systems. J Med Biol Eng, vol. 40, no. 2, pp. 149–157. doi: 10.1007/s40846-019-00505-7.

S. Ren. (2022). Optimization of Enterprise Financial Management and Decision‐Making Systems Based on Big Data. Journal of Mathematics, vol. 2022, no. 1. doi: 10.1155/2022/1708506.

H. Scells, Jimmy, and G. Zuccon. (2021, Jul). Big Brother: A Drop-In Website Interaction Logging Service. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, pp. 2590–2594. doi: 10.1145/3404835.3462781.

Md. S. Rahman and H. Reza. (2022). A Systematic Review Towards Big Data Analytics in Social Media, Big Data Mining and Analytics, vol. 5, no. 3, pp. 228–244, doi: 10.26599/BDMA.2022.9020009.

B. Chandramouli, J. Goldstein, and S. Duan. (2012, Apr). Temporal Analytics on Big Data for Web Advertising. In 2012 IEEE 28th International Conference on Data Engineering, IEEE, pp. 90–101. doi: 10.1109/ICDE.2012.55.

K. Maladkar. (2019, Dec). Content-Based Hierarchical URL Classification with Convolutional Neural Networks. In 2019 International Conference on Information Technology (ICIT), IEEE, pp. 263–266. doi: 10.1109/ICIT48102.2019.00053.

H. Gomez, I. Markov, J. Baptista, G. Sidorov, and D. Pinto. (2017). Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 137–145. doi: 10.18653/v1/W17-1217.

U. Mahor and A. Kumar. (2023). Authorship Attribution using Tf-Idf weight with Machine Learning Approaches. doi: 10.21203/rs.3.rs-2707585/v1.

F. Ullah, X. Cheng, L. Mostarda, and S. Jabbar. (2023). Android-IoT Malware Classification and Detection Approach Using Deep URL Features Analysis. Journal of Database Management, vol. 34, no. 2, pp. 1–26. doi: 10.4018/JDM.318414.

T. A. Abdallah and B. de La Iglesia. (2015). URL-Based Web Page Classification: With n-Gram Language Models. pp. 19–33. doi: 10.1007/978-3-319-25840-9_2.

H. F. Mustika, A. F. Syafiandini, L. P. Manik, and Y. Rianto, “Evaluating Naïve Bayes Automated Classification for GBAORD,” Computer Engineering and Applications Journal, vol. 9, no. 1, pp. 29–37, Feb. 2020, doi: 10.18495/comengapp.v9i1.320.

D. Fahrudy and S. ’Uyun. (2022). Classification of Student Graduation using Navie Bayes by Comparing between Random Oversampling and Feature Selections of Information Gain and Forward Selection. JOIV : International Journal on Informatics Visualization, vol. 6, no. 4, p. 798. doi: 10.30630/joiv.6.4.982.

Z. Xue, J. Wei, and W. Guo. (2020). A Real-Time Naive Bayes Classifier Accelerator on FPGA. IEEE Access, vol. 8, pp. 40755–40766. doi: 10.1109/ACCESS.2020.2976879.

R. Rajalakshmi and C. Aravindan. (2013, Dec). Web page classification using n-gram based URL features. In 2013 Fifth International Conference on Advanced Computing (ICoAC), IEEE, pp. 15–21. doi: 10.1109/ICoAC.2013.6921920.

D. Shen, Z. Chen, Q. Yang, H.J. Zeng, B. Zhang, Y. Lu, and W. Y.Wa. (2004, Jul). Web-page classification through summarization. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA: ACM, pp. 242–249. doi: 10.1145/1008992.1009035.

E. Baykan, M. Henzinger, L. Marian, and I. Weber. (2009, Apr). Purely URL-based topic classification. In Proceedings of the 18th international conference on World wide web, New York, NY, USA: ACM, pp. 1109–1110. doi: 10.1145/1526709.1526880.

Downloads

Published

2026-04-30

How to Cite

Kho, L. C., Ngu, S. S., Joseph, A., Rajaee, N. ., Sahari, S. K., & Hau, J. L. (2026). A Multilevel N-gram Model with Naïve Bayes Classification of Personal Web History Datasets . Journal of Engineering Science and Energy Sustainability, 1(1), 26–36. https://doi.org/10.33736/jese.10735.2026

Download Citation

Issue

Vol. 1 No. 1 (2026): Journal of Engineering Science and Energy Sustainability

Section

Articles