Analisis Komparatif Metode Bag Of Words, TF-IDF, dan Transformer pada Sistem Penilaian Esai Otomatis Berbasis Kecerdasan Buatan
Keywords:
automated essay scoring, bag of words, IndoBERT, TF-IDF, transformerAbstract
Penilaian esai secara manual menghadapi kendala inkonsistensi, subjektivitas, dan keterbatasan waktu, terutama pada pembelajaran berskala besar. Penelitian ini membandingkan tiga pendekatan representasi teks pada sistem penilaian esai otomatis berbasis kecerdasan buatan, yaitu Bag of Words (BoW), Term Frequency–Inverse Document Frequency (TF-IDF), dan Transformer (IndoBERT). Dataset yang digunakan berasal dari Kaggle Learning Agency Lab Automated Essay Scoring 2.0 yang terdiri atas 17.207 esai berbahasa Inggris dan diterjemahkan ke bahasa Indonesia menggunakan model Helsinki-NLP opus-mt-en-id. Tahap prapemrosesan meliputi case folding, pembersihan teks, penghapusan stopword, dan stemming menggunakan pustaka Sastrawi. Metode BoW dan TF-IDF dipadukan dengan Support Vector Regression, sedangkan pendekatan Transformer menggunakan fine-tuning IndoBERT. Evaluasi dilakukan menggunakan metrik Quadratic Weighted Kappa (QWK). Hasil eksperimen menunjukkan bahwa IndoBERT mencapai performa tertinggi dengan nilai QWK sebesar 0,7842, diikuti TF-IDF sebesar 0,6521 dan BoW sebesar 0,6103. Meskipun Transformer unggul dari sisi akurasi, metode klasik tetap relevan untuk implementasi dengan keterbatasan sumber daya komputasi karena efisiensi waktu dan kompleksitas yang lebih rendah. Temuan ini menegaskan pentingnya pemilihan metode penilaian otomatis yang disesuaikan dengan konteks kebutuhan dan infrastruktur pendidikan.
References
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement. Psychological Bulletin, 70(4), 213–220.
Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches. Sage Publications.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171–4186.
Dong, F., Zhang, Y., & Yang, J. (2017). Attention-based recurrent convolutional neural network for automated essay scoring. Proceedings of ACL, 153–162.
Hussein, A., Hassan, S., & Nabil, M. (2019). Automated essay scoring using TF-IDF and SVM. International Journal of Advanced Computer Science and Applications, 10(8), 89–95.
Joachims, T. (2002). Learning to classify text using support vector machines. Kluwer Academic Publishers.
Kaggle. (2023). Learning Agency Lab – Automated Essay Scoring 2.0 Dataset. https://www.kaggle.com
Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi. (2023). Laporan infrastruktur digital pendidikan tinggi Indonesia. Kemendikbudristek.
Kumar, A., & Bakhshi, S. (2018). Automated essay scoring: A survey of the state of the art. Artificial Intelligence Review, 51(2), 245–273.
Lane, S., & Stone, C. (2019). Inter-rater reliability in essay assessment. Journal of Educational Measurement, 56(2), 345–362.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Russell, M., & Bennett, R. (2020). Automated scoring of constructed response items: A practical guide. Educational Measurement Issues and Practice, 39(3), 6–13.
Shermis, M. D., & Burstein, J. (2013). Handbook of automated essay evaluation. Routledge.
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of ACL, 3645–3650.
Tala, F. Z. (2003). A study of stemming effects on information retrieval in Bahasa Indonesia [Master's thesis, University of Amsterdam]. University of Amsterdam Repository.
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: A survey. ACM Computing Surveys, 55(6), 1–28.
Tiedemann, J., & Thottingal, S. (2020). OPUS-MT — Building open translation services. Proceedings of the EAMT, 479–480.
Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., & Purwarianti, A. (2020). IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 843–857.
Zhou, Y., & Hovy, E. (2021). Balancing accuracy and efficiency in neural NLP systems. *Transactions of the ACL, 9, 142–156.




