From Metadata to Meaning: A Hybrid Clustering and Interpretable Rating Analysis of the Netflix Library

Authors

  • Siew Mooi Lim Department of Computer Science and Data Science, Faculty of Computing and Information Technology, Tunku Abdul Rahman University of Management and Technology, 53300 Kuala Lumpur, Malaysia
  • Xue Kang Chok Department of Computer Science and Data Science, Faculty of Computing and Information Technology, Tunku Abdul Rahman University of Management and Technology, 53300 Kuala Lumpur, Malaysia
  • Qi Xiang Choo Department of Computer Science and Data Science, Faculty of Computing and Information Technology, Tunku Abdul Rahman University of Management and Technology, 53300 Kuala Lumpur, Malaysia

DOI:

https://doi.org/10.33736/jcsi.11660.2026

Keywords:

Unsupervised Learning, Hybrid Clustering, Semantic Analysis, Topic Modelling, Explainable AI, Streaming Content Analytics

Abstract

The rapid growth of streaming platforms has created vast, heterogeneous content libraries, posing challenges for effective content organisation and understanding of audience preferences. This study aims to uncover multi-faceted patterns in Netflix's content structure, categorisation, and audience reception by employing a hybrid analytical framework. Three distinct clustering methodologies—Latent Dirichlet Allocation (LDA), spectral clustering, and K-Prototypes—were applied alongside IMDb rating analysis, genre inference with TF-IDF, and advanced semantic clustering enhanced by interpretable XGBoost and SHAP values. LDA topic modelling identified three distinct thematic areas in content descriptions. Spectral clustering, using Nearest Neighbours affinity and unnormalized Laplacian regularisation, distinguished three clusters based on geographical origin and content maturity. K-Prototypes clustering identified five segments characterised by distinct format, duration, and regional patterns. Furthermore, IMDb rating analysis provided external validation of content quality, genre inference with TF-IDF elucidated textual markers for genres, and semantic clustering revealed high-value genres and low-value tropes. These findings demonstrate that Netflix maintains a carefully balanced portfolio of content spanning different formats, regions, and themes, catering to diverse audience preferences. The complementary nature of the clustering approaches provides a multi-dimensional understanding of streaming content libraries, offering actionable insights for content acquisition strategies, recommendation systems, and data-driven content personalisation on streaming platforms.

References

Aslanyan, T. K., & Frasincar, F. (2021). Utilizing textual reviews in latent factor models for recommender systems. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC '21) (pp. 1931–1940). Association for Computing Machinery. https://doi.org/10.1145/3412841.3442065

Bamman, D., O'Connor, B., & Smith, N. A. (2013). Learning latent personas of film characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 352–361). Association for Computational Linguistics. https://aclanthology.org/P13-1035

Bansal, S. (2023). Netflix movies and TV shows [Dataset]. Kaggle. https://www.kaggle.com/datasets/shivamb/netflix-shows/data

Bhandarkar, S., Wolff, M., & Webb, A. (n.d.). Genre classifications using book and film descriptions [Stanford CS224N project report]. Retrieved November 24, 2025, from https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/final-reports/final-report-169839493.pdf

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. https://jmlr.org/papers/volume3/blei03a/blei03a.pdf

Chatterjee, K. (2023, June 12). The rise of streaming platforms: A revolution in entertainment. Medium. https://medium.com/@kasturichatterjee1108/the-rise-of-streaming-platforms-a-revolution-in-entertainment-3553b094d799

Cutting, J. E., DeLong, J. E., & Nothelfer, C. E. (2010). Attention and the evolution of Hollywood film. Psychological Science, 21(3), 432–439. https://doi.org/10.1177/0956797610361679

Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv. https://doi.org/10.48550/arXiv.1507.07998

Dixit, P., Hussain, S., & Singh, G. (2020). Predicting the IMDB rating by using EDA and machine learning algorithms. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 6(4), 441–446. https://doi.org/10.32628/cseit206481

Fan, H. (2024). Leader in the digital entertainment market: Netflix's continued success in a fiercely competitive environment. Advances in Economics, Management and Political Sciences, 73(1), 61–65. https://doi.org/10.54254/2754-1169/73/20231226

Guo, H., Liu, X., & Zhang, Q. (2024). Identifying daily water consumption patterns based on K-means clustering, agglomerative hierarchical clustering, and spectral clustering algorithms. AQUA - Water Infrastructure, Ecosystems and Society, 73(5), 870–887. https://doi.org/10.2166/aqua.2024.294

Hernández, H., Alberdi, E., Goti, A., & Oyarbide-Zubillaga, A. (2023). Application of the k-prototype clustering approach for the definition of geostatistical estimation domains. Mathematics, 11(3), 740. https://doi.org/10.3390/math11030740

Hess, S., Duivesteijn, W., Honysz, P., & Morik, K. (2019). The SpectACl of nonconvex clustering: A spectral approach to density-based clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 3788–3795. https://doi.org/10.1609/aaai.v33i01.33013788

Krishnan, A. (2023). Exploring the power of topic modeling techniques in analyzing customer reviews: A comparative analysis. arXiv. https://doi.org/10.48550/arXiv.2308.11520

Lash, M. T., & Zhao, K. (2016). Early predictions of movie success: The who, what, and when of profitability. Journal of Management Information Systems, 33(3), 874–903. https://doi.org/10.1080/07421222.2016.1243969

Liu, S. X., Yin, J., Wang, X., Cui, W., Cao, K., & Pei, J. (2016). Online visual analytics of text streams. IEEE Transactions on Visualization and Computer Graphics, 22(11), 2451–2466. https://doi.org/10.1109/tvcg.2015.2509990

Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774. https://doi.org/10.48550/arXiv.1705.07874

Matrix, S. (2014). The Netflix effect: Teens, binge watching, and on-demand digital media trends. Jeunesse: Young People, Texts, Cultures, 6(1), 119–138. https://doi.org/10.1353/jeu.2014.0002

Mohd, A., Teoh, L. E., & Khoo, H. L. (2024). Passengers' requests clustering with k-prototype algorithm for the first-mile and last-mile (FMLM) shared-ride taxi service. Multimodal Transportation, 3(2), 100132. https://doi.org/10.1016/j.multra.2024.100132

Palacios, G. A., Valencia, D. J. L., & Villeta, L. M. (2023). Time series clustering using trend, seasonal and autoregressive components to identify maximum temperature patterns in the Iberian Peninsula. Environmental and Ecological Statistics, 30(3), 421–442. https://doi.org/10.1007/s10651-023-00572-9

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv. https://doi.org/10.48550/arXiv.1908.10084

von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. https://doi.org/10.1007/s11222-007-9033-z

Wang, S., Yabes, J. G., & Chang, C.-C. H. (2021). Hybrid density- and partition-based clustering algorithm for data with mixed-type variables. Journal of Data Science, 19(1), 15–36. https://doi.org/10.6339/21-jds996

Downloads

Published

2026-03-26

How to Cite

Lim, S. M., Chok, X. K., & Choo, Q. X. (2026). From Metadata to Meaning: A Hybrid Clustering and Interpretable Rating Analysis of the Netflix Library. Journal of Computing and Social Informatics, 5(2), 1–13. https://doi.org/10.33736/jcsi.11660.2026