From Metadata to Meaning: A Hybrid Clustering and Interpretable Rating Analysis of the Netflix Library
DOI:
https://doi.org/10.33736/jcsi.11660.2026Keywords:
Unsupervised Learning, Hybrid Clustering, Semantic Analysis, Topic Modelling, Explainable AI, Streaming Content AnalyticsAbstract
The rapid growth of streaming platforms has created vast, heterogeneous content libraries, posing challenges for effective content organisation and understanding of audience preferences. This study aims to uncover multi-faceted patterns in Netflix's content structure, categorisation, and audience reception by employing a hybrid analytical framework. Three distinct clustering methodologies—Latent Dirichlet Allocation (LDA), spectral clustering, and K-Prototypes—were applied alongside IMDb rating analysis, genre inference with TF-IDF, and advanced semantic clustering enhanced by interpretable XGBoost and SHAP values. LDA topic modelling identified three distinct thematic areas in content descriptions. Spectral clustering, using Nearest Neighbours affinity and unnormalized Laplacian regularisation, distinguished three clusters based on geographical origin and content maturity. K-Prototypes clustering identified five segments characterised by distinct format, duration, and regional patterns. Furthermore, IMDb rating analysis provided external validation of content quality, genre inference with TF-IDF elucidated textual markers for genres, and semantic clustering revealed high-value genres and low-value tropes. These findings demonstrate that Netflix maintains a carefully balanced portfolio of content spanning different formats, regions, and themes, catering to diverse audience preferences. The complementary nature of the clustering approaches provides a multi-dimensional understanding of streaming content libraries, offering actionable insights for content acquisition strategies, recommendation systems, and data-driven content personalisation on streaming platforms.
References
Aslanyan, T. K., & Frasincar, F. (2021). Utilizing textual reviews in latent factor models for recommender systems. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC '21) (pp. 1931–1940). Association for Computing Machinery. https://doi.org/10.1145/3412841.3442065
Bamman, D., O'Connor, B., & Smith, N. A. (2013). Learning latent personas of film characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 352–361). Association for Computational Linguistics. https://aclanthology.org/P13-1035
Bansal, S. (2023). Netflix movies and TV shows [Dataset]. Kaggle. https://www.kaggle.com/datasets/shivamb/netflix-shows/data
Bhandarkar, S., Wolff, M., & Webb, A. (n.d.). Genre classifications using book and film descriptions [Stanford CS224N project report]. Retrieved November 24, 2025, from https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/final-reports/final-report-169839493.pdf
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. https://jmlr.org/papers/volume3/blei03a/blei03a.pdf
Chatterjee, K. (2023, June 12). The rise of streaming platforms: A revolution in entertainment. Medium. https://medium.com/@kasturichatterjee1108/the-rise-of-streaming-platforms-a-revolution-in-entertainment-3553b094d799
Cutting, J. E., DeLong, J. E., & Nothelfer, C. E. (2010). Attention and the evolution of Hollywood film. Psychological Science, 21(3), 432–439. https://doi.org/10.1177/0956797610361679
Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv. https://doi.org/10.48550/arXiv.1507.07998
Dixit, P., Hussain, S., & Singh, G. (2020). Predicting the IMDB rating by using EDA and machine learning algorithms. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 6(4), 441–446. https://doi.org/10.32628/cseit206481
Fan, H. (2024). Leader in the digital entertainment market: Netflix's continued success in a fiercely competitive environment. Advances in Economics, Management and Political Sciences, 73(1), 61–65. https://doi.org/10.54254/2754-1169/73/20231226
Guo, H., Liu, X., & Zhang, Q. (2024). Identifying daily water consumption patterns based on K-means clustering, agglomerative hierarchical clustering, and spectral clustering algorithms. AQUA - Water Infrastructure, Ecosystems and Society, 73(5), 870–887. https://doi.org/10.2166/aqua.2024.294
Hernández, H., Alberdi, E., Goti, A., & Oyarbide-Zubillaga, A. (2023). Application of the k-prototype clustering approach for the definition of geostatistical estimation domains. Mathematics, 11(3), 740. https://doi.org/10.3390/math11030740
Hess, S., Duivesteijn, W., Honysz, P., & Morik, K. (2019). The SpectACl of nonconvex clustering: A spectral approach to density-based clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 3788–3795. https://doi.org/10.1609/aaai.v33i01.33013788
Krishnan, A. (2023). Exploring the power of topic modeling techniques in analyzing customer reviews: A comparative analysis. arXiv. https://doi.org/10.48550/arXiv.2308.11520
Lash, M. T., & Zhao, K. (2016). Early predictions of movie success: The who, what, and when of profitability. Journal of Management Information Systems, 33(3), 874–903. https://doi.org/10.1080/07421222.2016.1243969
Liu, S. X., Yin, J., Wang, X., Cui, W., Cao, K., & Pei, J. (2016). Online visual analytics of text streams. IEEE Transactions on Visualization and Computer Graphics, 22(11), 2451–2466. https://doi.org/10.1109/tvcg.2015.2509990
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774. https://doi.org/10.48550/arXiv.1705.07874
Matrix, S. (2014). The Netflix effect: Teens, binge watching, and on-demand digital media trends. Jeunesse: Young People, Texts, Cultures, 6(1), 119–138. https://doi.org/10.1353/jeu.2014.0002
Mohd, A., Teoh, L. E., & Khoo, H. L. (2024). Passengers' requests clustering with k-prototype algorithm for the first-mile and last-mile (FMLM) shared-ride taxi service. Multimodal Transportation, 3(2), 100132. https://doi.org/10.1016/j.multra.2024.100132
Palacios, G. A., Valencia, D. J. L., & Villeta, L. M. (2023). Time series clustering using trend, seasonal and autoregressive components to identify maximum temperature patterns in the Iberian Peninsula. Environmental and Ecological Statistics, 30(3), 421–442. https://doi.org/10.1007/s10651-023-00572-9
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv. https://doi.org/10.48550/arXiv.1908.10084
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. https://doi.org/10.1007/s11222-007-9033-z
Wang, S., Yabes, J. G., & Chang, C.-C. H. (2021). Hybrid density- and partition-based clustering algorithm for data with mixed-type variables. Journal of Data Science, 19(1), 15–36. https://doi.org/10.6339/21-jds996
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Journal of Computing and Social Informatics

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright Transfer Statement for Journal
1) In signing this statement, the author(s) grant UNIMAS Publisher an exclusive license to publish their original research papers. The author(s) also grant UNIMAS Publisher permission to reproduce, recreate, translate, extract or summarise, and to distribute and display in any forms, formats, and media. The author(s) can reuse their papers in their future printed work without first requiring permission from UNIMAS Publisher, provided that the author(s) acknowledge and reference publication in the Journal.
2) For open access articles, the author(s) agree that their articles published under UNIMAS Publisher are distributed under the terms of the CC-BY-NC-SA (Creative Commons Attribution-Non Commercial-Share Alike 4.0 International License) which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original work of the author(s) is properly cited.
3) For subscription articles, the author(s) agree that UNIMAS Publisher holds copyright, or an exclusive license to publish. Readers or users may view, download, print, and copy the content, for academic purposes, subject to the following conditions of use: (a) any reuse of materials is subject to permission from UNIMAS Publisher; (b) archived materials may only be used for academic research; (c) archived materials may not be used for commercial purposes, which include but not limited to monetary compensation by means of sale, resale, license, transfer of copyright, loan, etc.; and (d) archived materials may not be re-published in any part, either in print or online.
4) The author(s) is/are responsible to ensure his or her or their submitted work is original and does not infringe any existing copyright, trademark, patent, statutory right, or propriety right of others. Corresponding author(s) has (have) obtained permission from all co-authors prior to submission to the journal. Upon submission of the manuscript, the author(s) agree that no similar work has been or will be submitted or published elsewhere in any language. If submitted manuscript includes materials from others, the authors have obtained the permission from the copyright owners.
5) In signing this statement, the author(s) declare(s) that the researches in which they have conducted are in compliance with the current laws of the respective country and UNIMAS Journal Publication Ethics Policy. Any experimentation or research involving human or the use of animal samples must obtain approval from Human or Animal Ethics Committee in their respective institutions. The author(s) agree and understand that UNIMAS Publisher is not responsible for any compensational claims or failure caused by the author(s) in fulfilling the above-mentioned requirements. The author(s) must accept the responsibility for releasing their materials upon request by Chief Editor or UNIMAS Publisher.
6) The author(s) should have participated sufficiently in the work and ensured the appropriateness of the content of the article. The author(s) should also agree that he or she has no commercial attachments (e.g. patent or license arrangement, equity interest, consultancies, etc.) that might pose any conflict of interest with the submitted manuscript. The author(s) also agree to make any relevant materials and data available upon request by the editor or UNIMAS Publisher.