Individuación de autoría e identificación de estilo: análisis de obras literárias com R




Este artículo se suma a los trabajos disponibles sobre procesamiento del lenguaje natural al proporcionar una demostración de cómo los lenguajes de programación como R (R CORE TEAM, 2020) pueden ser útiles para detectar la autoría e identificar el estilo del autor en obras literarias. Se seleccionaron dos autores y dos obras de cada uno, a saber: The Adventures of Tom Sawyer (1876) y Adventures of Huckleberry Finn (1884) del autor Mark Twain (1835-1910), y Typee: A Peep at Polynesian Life (1846) y Omoo: A Narrative of Adventures in the South Seas (1847) del autor Herman Melville (1819-1891). Posteriormente, los datos se analizaron utilizando la misma metodología que Eder et al. (2016), con el fin de probar la efectividad del paquete stylo y aplicar los métodos de Análisis de Componentes Principales, Análisis de Cluster y Árbol de Consenso. Los resultados mostraron que cada uno de los métodos probados fue capaz de distinguir los trabajos de los autores, evidenciando así la efectividad del paquete utilizado. Además, se realiza un análisis estilométrico basado en los métodos de Zeta de Craig y Rolling Delta. Para esto último, se utilizaron obras de dos autores de habla alemana, Frank Kafka y Heinrich von Kleist. Los resultados apuntan a una similitud estilística de von Kleist, sobre todo, en la primera obra de Kafka. Además, el método Rolling Delta fue utilizado para explorar un análisis de Juola (2013ª, 2013b) sobre una obra de J. K. Rowling escrita bajo el seudónimo de Robert Galbraith.


ARGAMON, S. Interpreting Burrow’s Delta: Geometric and Probabilistic Foundations. Literary and Linguistic Computing, v. 23, n. 2, p. 131-147, 2008.

BARTOSZUK, M.; GAGOLEWSKI, M. SimilaR: R Code Clone and Plagiarism Detection. The R Journal, v. 12, n. 1, p. 367-385, 2020.

BOEHMKE, B.; GREENWELL, B. Hands-On Machine Learning with R. New York: CRC Press, 2019.

BURROWS, J. ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship. Literary and Linguistic Computing, v. 17, n. 3, p. 267-287, 2002.

BURROWS, J. Who wrote Shamela? Verifying the Authorship of a Parodic Text. Literary and Linguistic Computing, v. 20, n. 4, p. 437-450, 2005.

BURROWS, J. All the Way Through: Testing for Authorship in Different Frequency Strata. Literary and Linguistic Computing, v. 22, n. 1, p. 27-47, 2006.

CRAIG, H.; KINNEY, A. Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press, 2009.

EDER, M. Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Studies in Polish Linguistics, v. 6, p. 99-114, 2011.

EDER, M. Rolling stylometry. Digital Scholarship in the Humanities, v. 31, n. 3, p. 1-13, 2015.

EDER, M. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities, v. 32, n. 1, p. 50-64, 2017.

EDER, M; RYBICKI, J.; KESTEMONT, M. Stylometry with R: A Package for Computational Text Analysis. The R Journal, v. 8, n. 1, p. 107-121, 2016.

ENGELSTEIN, S. The Open Wound of Beauty: Kafka Reading Kleist. The Germanic Review: Literature, Culture, Theory, v. 81, n. 4, p. 340-359, 2006.

FURST, L. Reading Kleist and Kafka. The Journal of English and Germanic Philology, v. 84, n. 3, p. 374-395, 1985.

GRANDIN, J. Kafka’s Prussian Advocate: A Study of the Influence of Heinrich von Kleist on Franz Kafka. Columbia: Camden, 1987.

HENNIG, C.; MEILA, M.; MURTAGH, F.; ROCCI, R. Handbook of Cluster Analysis. Boca Raton: CRC Press, 2016.

HOOVER, D. Corpus Stylistics, Stylometry, and the Styles of Henry James. Style, v. 41, n. 2, p. 174-203, 2007a.

HOOVER, D. Quantitative Analysis and Literary Studies. In: SCHREIBMAN, S.; SIEMENS, R. (ed.). A Companion to Digital Literary Studies. Oxford: Blackwell. 2007b. p. 517-533.

HOOVER, D. The Craig Zeta Spreadsheet. Digital Humanities 2010 [Book of Abstracts], London: King’s College London, 2010. Disponível em: Acesso em: 25 ago. 2020.

JOLLIFFE, I. Principal Components Analysis. 2 ed. New York: Springer, 2002.

JOSHI, S. Sentiment Analysis on Whatsapp Group Chat Using R. In: SHUKLA, R.; AGRAWAL, J.; SHARMA, S.; TOMER, G. (orgs.). Data, Engineering and Applications. Vol. 1. Gateway East: Springer, 2019. p. 47-56.

JUOLA, P. How a Computer Program Helped Show J. K. Rowling write A Cuckoo’s Calling. In: Scientific American, 2013a. Disponível em: Acesso em: 22 ago. 2020.

JUOLA, P. Rowling and ‘Galbraith’: an authorial analysis. In: Language Log, 2013b. Disponível em: Acesso em: 22 ago. 2020.

MEHIGAN, T. The process of inferential contexts: Franz Kafka reading Heinrich von Kleist. In: MEHIGAN, T. (ed.). Heinrich von Kleist: Writing After Kant. Rochester: Boydell & Brewer, 2011. p. 196-226.

O’SULLIVAN, J.; BAZARNIK, K.; EDER, M.; RYBICKI, J. Measuring Joycean Influences on Flann O’Brien. Digital Studies, v. 8, n. 1, p. 1-25, 2018.

PENNEBAKER, J. The Secret Life of Pronouns: What our Words Say About Us. New York: Bloomsbury Press, 2011.

PETERS, F. Kafka and Kleist: A Literary Relationship. Oxford German Studies, v. 1, n. 1, p. 114-162, 1966.

QIAN, C.; HE, T.; ZHANG, R. Deep Learning based Authorship Identification. Stanford Reports, 2017. Disponível em: Acesso em: 25 ago. 2020.

R CORE TEAM. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Disponível em: Acesso em: 08 jan. 2021.

RYBICKI, J.; KESTEMONT, M.; HOOVER, D. Collaborative authorship: Conrad, Ford, and rolling delta. Literary and Linguistic Computing, v. 29, n.3, 422-431, 2014.

SHAHAR, G. Fragments and Wounded Bodies: Kafka after Kleist. The German Quarterly, v. 80, n. 4, p. 449-467, 2007.

SILGE, J.; ROBINSON, D. Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly, 2017.

STACK OVERFLOW. 2020. Disponível em: Acesso em: 20 ago. 2020.

STAMATATOS, E. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, v. 60, n. 3, p. 538-556, 2009.

TABATA, T. Stylometry of collaborations: Dickens, Collins and their collaborative writings. In: Digital Humanities 2014: Conference Abstracts. Lausanne: EPFL-UNIL, 2014. p. 378-380.

THE COMPREHENSIVE R ARCHIVE NETWORK. 2020. Disponível em: Acesso em: 22 ago. 2020.

THE R PROJECT FOR STATISTICAL COMPUTING. 2020. Disponível em: Acesso em: 20 ago. 2020.

TIDYVERSE. 2020. Disponível em: Acesso em: 20 ago. 2020.

WIERZCHOŃ, S.; KŁOPOTEK, M. Modern Algorithms of Cluster Analysis. Cham: Springer, 2018.

ZHANG, C.; WU, X.; NIU, Z.; DING, W. Authorship Identification from Unstructured Texts. Knowledge-Based Systems, v. 66, p. 99-111, 2014.




