Individuação de autoria e identificação de estilo: análise de dados linguísticos com auxílio do R

Autores

DOI:

https://doi.org/10.5007/1984-8412.2022.e79086

Resumo

Este artigo soma-se aos trabalhos disponíveis sobre Processamento de Língua Natural ao fornecer uma demonstração de como linguagens de programação como o R (R CORE TEAM, 2020) podem ser úteis na detecção de autoria e na identificação do estilo do autor em obras literárias. Foram selecionados dois autores e duas obras de cada, a saber: The Adventures of Tom Sawyer (1876) e Adventures of Huckleberry Finn (1884), do autor Mark Twain (1835-1910), e Typee: A Peep at Polynesian Life (1846) e Omoo: A Narrative of Adventures in the South Seas (1847), do autor Herman Melville (1819-1891). Posteriormente, os dados foram analisados seguindo a mesma metodologia de Eder et al. (2016), a fim de testar a eficácia do pacote stylo e aplicar os métodos de Análise de Componentes Principais, Análise de Cluster e Árvore de Consenso. Os resultados apontaram que cada um dos métodos testados conseguiu distinguir as obras dos autores, evidenciando-se, assim, a eficácia do pacote utilizado. Além disso, realiza-se uma análise estilométrica baseada nos métodos de Zeta de Craig e Rolling Delta. Para este último, utilizaram-se obras de dois autores de língua alemã, Frank Kafka e Heinrich von Kleist. Os resultados apontaram uma semelhança estilística de von Kleist, sobretudo, na primeira obra de Kafka. Adicionalmente, o método Rolling Delta foi usado para explorar uma análise feita por Juola (2013ª, 2013b) a respeito de uma obra de J. K. Rowling escrita sob o pseudônimo de Robert Galbraith.

Referências

ARGAMON, S. Interpreting Burrow’s Delta: Geometric and Probabilistic Foundations. Literary and Linguistic Computing, v. 23, n. 2, p. 131-147, 2008.

BARTOSZUK, M.; GAGOLEWSKI, M. SimilaR: R Code Clone and Plagiarism Detection. The R Journal, v. 12, n. 1, p. 367-385, 2020.

BOEHMKE, B.; GREENWELL, B. Hands-On Machine Learning with R. New York: CRC Press, 2019.

BURROWS, J. ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship. Literary and Linguistic Computing, v. 17, n. 3, p. 267-287, 2002.

BURROWS, J. Who wrote Shamela? Verifying the Authorship of a Parodic Text. Literary and Linguistic Computing, v. 20, n. 4, p. 437-450, 2005.

BURROWS, J. All the Way Through: Testing for Authorship in Different Frequency Strata. Literary and Linguistic Computing, v. 22, n. 1, p. 27-47, 2006.

CRAIG, H.; KINNEY, A. Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press, 2009.

EDER, M. Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Studies in Polish Linguistics, v. 6, p. 99-114, 2011.

EDER, M. Rolling stylometry. Digital Scholarship in the Humanities, v. 31, n. 3, p. 1-13, 2015.

EDER, M. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities, v. 32, n. 1, p. 50-64, 2017.

EDER, M; RYBICKI, J.; KESTEMONT, M. Stylometry with R: A Package for Computational Text Analysis. The R Journal, v. 8, n. 1, p. 107-121, 2016.

ENGELSTEIN, S. The Open Wound of Beauty: Kafka Reading Kleist. The Germanic Review: Literature, Culture, Theory, v. 81, n. 4, p. 340-359, 2006.

FURST, L. Reading Kleist and Kafka. The Journal of English and Germanic Philology, v. 84, n. 3, p. 374-395, 1985.

GRANDIN, J. Kafka’s Prussian Advocate: A Study of the Influence of Heinrich von Kleist on Franz Kafka. Columbia: Camden, 1987.

HENNIG, C.; MEILA, M.; MURTAGH, F.; ROCCI, R. Handbook of Cluster Analysis. Boca Raton: CRC Press, 2016.

HOOVER, D. Corpus Stylistics, Stylometry, and the Styles of Henry James. Style, v. 41, n. 2, p. 174-203, 2007a.

HOOVER, D. Quantitative Analysis and Literary Studies. In: SCHREIBMAN, S.; SIEMENS, R. (ed.). A Companion to Digital Literary Studies. Oxford: Blackwell. 2007b. p. 517-533.

HOOVER, D. The Craig Zeta Spreadsheet. Digital Humanities 2010 [Book of Abstracts], London: King’s College London, 2010. Disponível em: http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-659.html. Acesso em: 25 ago. 2020.

JOLLIFFE, I. Principal Components Analysis. 2 ed. New York: Springer, 2002.

JOSHI, S. Sentiment Analysis on Whatsapp Group Chat Using R. In: SHUKLA, R.; AGRAWAL, J.; SHARMA, S.; TOMER, G. (orgs.). Data, Engineering and Applications. Vol. 1. Gateway East: Springer, 2019. p. 47-56.

JUOLA, P. How a Computer Program Helped Show J. K. Rowling write A Cuckoo’s Calling. In: Scientific American, 2013a. Disponível em: https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling. Acesso em: 22 ago. 2020.

JUOLA, P. Rowling and ‘Galbraith’: an authorial analysis. In: Language Log, 2013b. Disponível em: https://languagelog.ldc.upenn.edu/nll/?p=5315. Acesso em: 22 ago. 2020.

MEHIGAN, T. The process of inferential contexts: Franz Kafka reading Heinrich von Kleist. In: MEHIGAN, T. (ed.). Heinrich von Kleist: Writing After Kant. Rochester: Boydell & Brewer, 2011. p. 196-226.

O’SULLIVAN, J.; BAZARNIK, K.; EDER, M.; RYBICKI, J. Measuring Joycean Influences on Flann O’Brien. Digital Studies, v. 8, n. 1, p. 1-25, 2018.

PENNEBAKER, J. The Secret Life of Pronouns: What our Words Say About Us. New York: Bloomsbury Press, 2011.

PETERS, F. Kafka and Kleist: A Literary Relationship. Oxford German Studies, v. 1, n. 1, p. 114-162, 1966.

QIAN, C.; HE, T.; ZHANG, R. Deep Learning based Authorship Identification. Stanford Reports, 2017. Disponível em: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2760185.pdf. Acesso em: 25 ago. 2020.

R CORE TEAM. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Disponível em: www.r-project.org. Acesso em: 08 jan. 2021.

RYBICKI, J.; KESTEMONT, M.; HOOVER, D. Collaborative authorship: Conrad, Ford, and rolling delta. Literary and Linguistic Computing, v. 29, n.3, 422-431, 2014.

SHAHAR, G. Fragments and Wounded Bodies: Kafka after Kleist. The German Quarterly, v. 80, n. 4, p. 449-467, 2007.

SILGE, J.; ROBINSON, D. Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly, 2017.

STACK OVERFLOW. 2020. Disponível em: https://stackoverflow.com/. Acesso em: 20 ago. 2020.

STAMATATOS, E. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, v. 60, n. 3, p. 538-556, 2009.

TABATA, T. Stylometry of collaborations: Dickens, Collins and their collaborative writings. In: Digital Humanities 2014: Conference Abstracts. Lausanne: EPFL-UNIL, 2014. p. 378-380.

THE COMPREHENSIVE R ARCHIVE NETWORK. 2020. Disponível em: https://cran.r-project.org/. Acesso em: 22 ago. 2020.

THE R PROJECT FOR STATISTICAL COMPUTING. 2020. Disponível em: https://www.r-project.org/. Acesso em: 20 ago. 2020.

TIDYVERSE. 2020. Disponível em: https://www.tidyverse.org/. Acesso em: 20 ago. 2020.

WIERZCHOŃ, S.; KŁOPOTEK, M. Modern Algorithms of Cluster Analysis. Cham: Springer, 2018.

ZHANG, C.; WU, X.; NIU, Z.; DING, W. Authorship Identification from Unstructured Texts. Knowledge-Based Systems, v. 66, p. 99-111, 2014.

Downloads

Publicado

2022-11-23

Edição

Seção

Artigo