The potential of ChatGPT in translation evaluation: A case study of the Chinese-Portuguese machine translation

Lili Jiang; Yunxiao Jiang; Lili Han

doi:10.5007/2175-7968.2024.e98613

Autores

Lili Jiang Macao Polytechnic University https://orcid.org/0009-0001-3297-4892
Yunxiao Jiang Macao Polytechnic University https://orcid.org/0000-0003-4938-8854
Lili Han Macao Polytechnic University https://orcid.org/0000-0002-8995-2301

DOI:

https://doi.org/10.5007/2175-7968.2024.e98613

Palavras-chave:

ChatGPT, machine translation (MT), automatic scoring, human assessment, evaluation metric

Resumo

The integration of artificial intelligence (AI) in translation assessment represents a significant evolution in the field, transcending traditional human-only scoring approaches. This study specifically examines the role of ChatGPT, a multilingual, transformer-based large language model developed by OpenAI, in the automated evaluation of machine translations between Portuguese and Mandarin. Despite ChatGPT's burgeoning reputation for its advanced Natural Language Processing (NLP) capabilities, research on its application in translation evaluation, particularly for this language pair, remains unexplored. To fill this gap, our research employed three prevalent machine translation tools to translate a set of twenty sentences from Chinese into Portuguese. Translated target text versions provided by professional Chinese-Portuguese translators were also included to estimate if the machine-translated target texts have achieved a certain degree of human parity. We then assessed these translations using both GPT models (ChatGPT 3.5 and ChatGPT 4.0) and five human raters to offer a comprehensive scoring analysis. The study's findings reveal that, particularly ChatGPT 4.0, exhibits substantial promise in evaluating translations across varied text types. However, this potential is tempered by notable inconsistencies and limitations in its performance. Through both quantitative analysis and qualitative insights, this research highlights the synergy between ChatGPT's automated scoring and traditional human assessment. It uncovers some key benefits of this automated approach: (1) exploring viability of automated translation evaluation, particularly in Chinese-Portuguese language pair; (2) fostering critical supplement to human evaluation, and (3) uncovering the astonishing capability of cutting-edge machine translation tools in Chinese-Portuguese language pair. Our findings contribute to a more detailed comprehension of ChatGPT's role in translation assessment and underscore the need for a balanced approach that leverages both human expertise and AI capabilities.

Referências

Beauchemin, D., Saggion, H., & Khoury, R. (2023). MeaningBERT: Assessing Meaning Preservation between Sentences. Frontiers in Artificial Intelligence, 6. https://doi.org/10.3389/frai.2023.1223924

Cao, Y., Kementchedjhieva, Y., Cui, R., Karamolegkou, A., Zhou, L., Dare, M., Donatelli, L., & Hershcovich, D. (2023). Cultural Adaptation of Recipes. arXiv preprint. https://doi.org/10.48550/arXiv.2310.17353

Capparelli, S., & Sun, Y. (2012). Poemas clássicos chineses. L&PM.

Colina, S. (2009). Further Evidence for a Functionalist Approach to Translation Quality Evaluation. Target, 21(2), 235–264. https://doi.org/10.1111/j.1540-4781.2011.01232_1.x

Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge University Press.

Fernandes, P., Deutsch, D., Finkelstein, M., Riley, P., Martins, A. F., Neubig, G., Garg, A., Clark, J. H., Freitag, M., & Firat, O. (2023). The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. arXiv preprint. https://doi.org/10.18653/v1/2023.wmt-1.100

Ghafar, Z. N. (2023). ChatGPT: A New Tool to Improve Teaching and Evaluation of Second and Foreign Languages A Review of ChatGPT: the Future of Education. International Journal of Applied Research and Sustainable Sciences, 1(2), 73–86. https://doi.org/10.59890/ijarss.v1i2.392

Guerreiro, N. M., Alves, D. M., Waldendorf, J., Haddow, B., Birch, A., Colombo, P., & Martins, A. F. (2023). Hallucinations in Large Multilingual Translation Models. Transactions of the Association for Computational Linguistics, 11, 1500–1517. https://doi.org/10.1162/tacl_a_00615

Guo, M., & Han, L. (2024). From Manual to Machine: Evaluating Automated Ear–voice Span Measurement in Simultaneous Interpreting. Interpreting, 26(1), 24–54. https://doi.org/10.1075/intp.00100.guo

Han, C. (2018). Quantitative Research Methods in Translation and Interpreting Studies. The Interpreter and Translator Trainer, 12(2), 244–247. https://doi.org/10.1080/1750399X.2018.1466262

Han, C. (2020). Translation Quality Assessment: A Critical Methodological Review. The Translator, 26(3), 257–273. https://doi.org/10.7202/037044ar

Han, C. (2021). Analytic Rubric Scoring versus Comparative Judgment: A Comparison of Two Approaches to Assessing Spoken-Language Interpreting. Meta, 66(2), 337–361. https://doi.org/10.7202/1083182ar

Han, C. (2022a). Assessing Spoken-Language Interpreting: The Method of Comparative Judgement. Interpreting, 24(1), 59–83. https://doi.org/10.1075/intp.00068

Han, C. (2022b). Interpreting Testing and Assessment: A State-of-the-art Review. Language Testing, 39(1), 30–55. https://doi.org/10.1177/02655322211036100

Han, C., Chen, S., & Fan, Q. (2019). Rater-mediated Assessment of Translation and Interpretation: Comparative Judgement versus Analytic Rubric Scoring. 5th International Conference on Language Testing and Assessment, Guangzhou, China.

Han, C., Hu, B., Fan, Q., Duan, J., & Li, X. (2022). Using Computerised Comparative Judgement to Assess Translation. Across Languages and Cultures, 23(1), 56–74. https://doi.org/10.1556/084.2022.00001

Han, L. (2022a). 中葡交替传译教程 Interpretação Consecutiva Chinês-Português. Universidade Politécnica de Macau.

Han, L. (2022b). 中葡口譯教學歷史、現狀與展望——兼及澳門的貢獻 [Portuguese Interpreting Teaching in China: Past, Present, and Future - Macao’s Contribution]. 澳門理工學報 [Revista da Universidade Politécnica de Macau], 25(2), 52–61.

Han, L. (2023). Tradução de poemas de Adriana Lisboa para o chinês: uma breve reflexão. Cadernos de Tradução, 43(esp. 3), 388–397. https://doi.org/10.5007/2175-7968.2023.e97182

Hasani, A. M., Singh, S., Zahergivar, A., Ryan, B., Nethala, D., Bravomontenegro, G., ... & Malayeri, A. (2024). Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. European Radiology, 34(6), 3566-3574. https://doi.org/10.1007/s00330-023-10384-x

Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., Kim, Y. J., Afify, M., & Awadalla, H. H. (2023). How Good are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv preprint. https://doi.org/10.48550/arXiv.2302.09210

House, J. (1997). Translation Quality Assessment: A Model Revisited. Gunter Narr Verlag.

House, J. (2001). Translation Quality Assessment: Linguistic Description versus Social Evaluation. Meta, 46(2), 243–257. https://doi.org/10.7202/003141ar

Jiao, W., Wang, W., Huang, J.-t., Wang, X., & Tu, Z. (2023). Is ChatGPT a Good Translator? A Preliminary Study. arXiv preprint. https://doi.org/10.48550/arXiv.2301.08745

Kadaoui, K., Magdy, S. M., Waheed, A., Khondaker, M. T. I., El-Shangiti, A. O., Nagoudi, E. M. B., & Abdul-Mageed, M. (2023). Tarjamat: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties. arXiv preprint. https://doi.org/10.48550/arXiv.2308.03051

Leiter, C., Opitz, J., Deutsch, D., Gao, Y., Dror, R., & Eger, S. (2023). The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics. arXiv preprint. https://doi.org/10.48550/arXiv.2310.19792

Lu, J., Han, L., & André, C. A. (2022). Tradução portuguesa de referências culturais extralinguísticas no Manual de Chinês Língua Não Materna: aplicação de estratégias de tradução propostas por Andrew Chesterman. Cadernos de Tradução, 42(1), 1–39. https://doi.org/10.5007/2175-7968.2022.e82416

Lu, X., & Han, C. (2023). Automatic Assessment of Spoken-language Interpreting based on Machine-translation Evaluation Metrics: A Multi-scenario Exploratory Study. Interpreting, 25(1), 109–143. https://doi.org/https://doi.org/10.1075/intp.00076.lu

Mu, L. (2006). 翻译测试及其评分问题 [Translation Testing and Grading]. Foreign Language Teaching and Research, 38(6), 466–471. https://doi.org/10.3969/j.issn.1000-0429.2006.06.010

Pöchhacker, F. (2022). Interpreters and Interpreting: Shifting the Balance? The Translator, 28(2), 148–161. https://doi.org/10.1080/13556509.2022.2133393

Pope, A. (1711). Sound and Sense. Sound and Sense.

Reiss, K. (2000). Translation Criticism: The Potentials and Limitations - Categories and Criteria for Translation Quality Assessment. St. Jerome Publishing.

Sun, Y., & Ye, Z. (2023). Tradução de metáforas verbo-pictóricas para páginas web do smartphone Huawei P40 Pro à luz da teoria de necessidades. Cadernos de Tradução, 43(esp. 3), 272–302. https://doi.org/10.5007/2175-7968.2023.e97183

TEM 8 Syllabus Revision Group. (1998). 高校英语专业八级考试大纲 [Syllabus for TEM 8]. Shanghai Foreign Language Education Express.

Trichopoulos, G., Konstantakis, M., Alexandridis, G., & Caridakis, G. (2023). Large language models as recommendation Systems in Museums. Electronics, 12(18), 3829. https://doi.org/10.3390/electronics12183829

Williams, M. (2001). The Application of Argumentation Theory to Translation Quality Assessment. Meta, 46(2), 326–344. https://doi.org/10.7202/004605ar

Williams, M. (2009). Translation Quality Assessment. Mutatis Mutandis: Revista Latinoamericana de Traducción, 2(1), 3–23.

Xiao, W. (2012). Research on the Test of Undergraduate Translation Majors. People’s Publishing House.

Yang, X., Yun, J., Zheng, B., Liu, L., & Ban, Q. (2023). Oversea Cross-lingual Summarization Service in Multilanguage Pre-trained Model through Knowledge Distillation. Electronics, 12(24), 5001. https://doi.org/10.3390/electronics12245001

Yang, Z. (2019). 翻译测试与评估研究 [Studies on Translation Testing and Assessment]. Foreign Languages Teaching and Research Press.

Zou, S. (2005). 语言测试 [Studies on Translation Testing and Assessment]. Shanghai Foreign Language Education Express.

Zou, S., & Xu, Q. (2017). A Washback Study of the Test for English Majors for Grade Eight (TEM8) in China—From the Perspective of University Program Administrators. Language Assessment Quarterly, 14(2), 140–159.