Investigating the Linguistic Fingerprint of GPT-4o in Arabic-to-English Translation Using Stylometry
Keywords:
Stylometric analysis, Machine-generated text, Natural Language Processing , GPT-4o, Authorship attributionAbstract
This study explores the linguistic and stylistic characteristics of machine-generated texts, focusing on the output of GPT-4o. Using various natural language processing (NLP) techniques, including word frequency and stopword count analysis, readability and sentence structure metrics, lexical diversity measures, syntactic frequency analysis, and named entity recognition (NER), the research aims to uncover the stylometric fingerprints present in machine-generated content. The results reveal that GPT-4ogenerated texts exhibit moderate lexical diversity and syntactic complexity, with certain chapters reflecting higher readability and more varied sentence structures, while others lean toward simpler linguistic patterns. The findings also highlight thematic variation across chapters, as observed in the distribution of named entities, which contributes to understanding the model’s handling of different contextual content.
The research suggests that while GPT-4o maintains a consistent style in its generated text, there are distinguishable characteristics that may serve as indicators of machine authorship. This provides valuable insights for stylometric analysis, authorship attribution, and the identification of machine-generated texts in various contexts. Future research could extend this work by exploring deeper stylometric features, conducting cross-model comparisons, and developing advanced authorship detection algorithms tailored for AI-generated content. Moreover, the ethical implications of stylometric analysis in the context of AI-generated texts warrant further investigation, particularly as machine-generated content becomes increasingly prevalent across different domains.
Downloads
References
Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y., & Woodard, D. (2017). Surveying stylometry techniques and applications. ACM Computing Surveys (CSuR), 50(6), 1-36.
Ramnial, H., Panchoo, S., & Pudaruth, S. (2016). Authorship attribution using stylometry and machine learning techniques. In Intelligent Systems Technologies and Applications: Volume 1 (pp. 113-125). Springer International Publishing.
Delcourt, C. (2002). Stylometry. Revue belge de philologie et d'histoire, 80(3), 979-1002.
Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2), 1-29.
Abbasi, A., & Chen, H. (2006, May). Visualizing authorship for identification. In International Conference on Intelligence and Security Informatics (pp. 60-71). Berlin, Heidelberg: Springer Berlin Heidelberg.
Pavelec, D., Oliveira, L. S., Justino, E. J., & Batista, L. V. (2008). Using Conjunctions and Adverbs for Author Verification. J. Univers. Comput. Sci., 14(18), 2967-2981.
Stańczyk, U. (2010). DRSA decision algorithm analysis in stylometric processing of literary texts. In Rough Sets and Current Trends in Computing: 7th International Conference, RSCTC 2010, Warsaw, Poland, June 28-30, 2010. Proceedings 7 (pp. 600-609). Springer Berlin Heidelberg.
Iqbal, M. M., Raza, A., Aslam, M. M., Farhan, M., & Yaseen, S. (2023). A Stylometric Fingerprinting Method for Author Identification Using Machine Learning. Technical Journal, 28(01), 28-35.
Raunak, V., Sharaf, A., Wang, Y., Awadallah, H. H., & Menezes, A. (2023). Leveraging gpt-4 for automatic translation post-editing. arXiv preprint arXiv:2305.14878.
Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
Jiao, W., Wang, W., Huang, J. T., Wang, X., Shi, S., & Tu, Z. (2023). Is ChatGPT a good translator? Yes with GPT-4 as the engine. arXiv preprint arXiv:2301.08745.
Banat, M., & Adla, Y. A. (2023). Exploring the effectiveness of Gpt-3 in translating specialized religious text from Arabic to English: a comparative study with human translation. Journal of Translation and Language Studies, 4(2), 1-23.
Zhu, W., Liu, H., Dong, Q., Xu, J., Huang, S., Kong, L., ... & Li, L. (2023). Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.
Niess, G., & Kern, R. (2024). Stylometric Watermarks for Large Language Models. arXiv preprint arXiv:2405.08400.
Kumarage, T., & Liu, H. (2023, November). Neural Authorship Attribution: Stylometric Analysis on Large Language Models. In 2023 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) (pp. 51-54). IEEE.
Diwan, N., Chakravorty, T., & Shafiq, Z. (2021). Fingerprinting fine-tuned language models in the wild. arXiv preprint arXiv:2106.01703.
Zaitsu, W., & Jin, M. (2023). Distinguishing ChatGPT (-3.5,-4)-generated and human-written papers through Japanese stylometric analysis. PLoS One, 18(8), e0288453.
Xu, J., Wang, F., Ma, M. D., Koh, P. W., Xiao, C., & Chen, M. (2024). Instructional fingerprinting of large language models. arXiv preprint arXiv:2401.12255.
Zaitsu, W., & Jin, M. (2023). Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis. PloS one, 18(8), e0288453. https://doi.org/10.1371/journal.pone.0288453
Farshoukh, M. (2018). Soul Breezes. Beirut: Iijaz forum.
Published
How to Cite
Issue
Section
Copyright (c) 2024 Maysaa Banat
This work is licensed under a Creative Commons Attribution 4.0 International License.