Individuation of Authorship and Style Identification: analysis of Literary works carried with R

Authors

DOI:

https://doi.org/10.5007/1984-8412.2022.e79086

Abstract

This paper adds to the works available on Natural Language Processing by providing a demonstration of how programming languages ??such as R (R CORE TEAM, 2020) can be useful in detecting authorship and identifying the style of the author in literary works. Two authors and two works each were selected, namely: The Adventures of Tom Sawyer (1876) and Adventures of Huckleberry Finn (1884) by author Mark Twain (1835-1910), and Typee: A Peep at Polynesian Life (1846) and Omoo: A Narrative of Adventures in the South Seas (1847) by author Herman Melville (1819-1891). Subsequently, the data were analyzed following the same methodology as Eder et al. (2016), in order to test the effectiveness of the stylo package and apply the Principal Component Analysis, Cluster Analysis and Consensus Tree methods. The results showed that each of the tested methods was able to distinguish the works of the authors, thus evidencing the effectiveness of the package used. In addition, a stylometric analysis is performed based on Craig's Zeta and Rolling Delta methods. For the latter, works by two German-speaking authors, Frank Kafka and Heinrich von Kleist, were used. The results pointed to a stylistic similarity of von Kleist, especially in Kafka’s first work. Additionally, Rolling Delta was used to explore an analysis carried by Juola (2013a, 2013b) regarding a work by J. K. Rowling written under the pseudonym of Robert Galbraith.

References

ARGAMON, S. Interpreting Burrow’s Delta: Geometric and Probabilistic Foundations. Literary and Linguistic Computing, v. 23, n. 2, p. 131-147, 2008.

BARTOSZUK, M.; GAGOLEWSKI, M. SimilaR: R Code Clone and Plagiarism Detection. The R Journal, v. 12, n. 1, p. 367-385, 2020.

BOEHMKE, B.; GREENWELL, B. Hands-On Machine Learning with R. New York: CRC Press, 2019.

BURROWS, J. ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship. Literary and Linguistic Computing, v. 17, n. 3, p. 267-287, 2002.

BURROWS, J. Who wrote Shamela? Verifying the Authorship of a Parodic Text. Literary and Linguistic Computing, v. 20, n. 4, p. 437-450, 2005.

BURROWS, J. All the Way Through: Testing for Authorship in Different Frequency Strata. Literary and Linguistic Computing, v. 22, n. 1, p. 27-47, 2006.

CRAIG, H.; KINNEY, A. Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press, 2009.

EDER, M. Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Studies in Polish Linguistics, v. 6, p. 99-114, 2011.

EDER, M. Rolling stylometry. Digital Scholarship in the Humanities, v. 31, n. 3, p. 1-13, 2015.

EDER, M. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities, v. 32, n. 1, p. 50-64, 2017.

EDER, M; RYBICKI, J.; KESTEMONT, M. Stylometry with R: A Package for Computational Text Analysis. The R Journal, v. 8, n. 1, p. 107-121, 2016.

ENGELSTEIN, S. The Open Wound of Beauty: Kafka Reading Kleist. The Germanic Review: Literature, Culture, Theory, v. 81, n. 4, p. 340-359, 2006.

FURST, L. Reading Kleist and Kafka. The Journal of English and Germanic Philology, v. 84, n. 3, p. 374-395, 1985.

GRANDIN, J. Kafka’s Prussian Advocate: A Study of the Influence of Heinrich von Kleist on Franz Kafka. Columbia: Camden, 1987.

HENNIG, C.; MEILA, M.; MURTAGH, F.; ROCCI, R. Handbook of Cluster Analysis. Boca Raton: CRC Press, 2016.

HOOVER, D. Corpus Stylistics, Stylometry, and the Styles of Henry James. Style, v. 41, n. 2, p. 174-203, 2007a.

HOOVER, D. Quantitative Analysis and Literary Studies. In: SCHREIBMAN, S.; SIEMENS, R. (ed.). A Companion to Digital Literary Studies. Oxford: Blackwell. 2007b. p. 517-533.

HOOVER, D. The Craig Zeta Spreadsheet. Digital Humanities 2010 [Book of Abstracts], London: King’s College London, 2010. Disponível em: http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-659.html. Acesso em: 25 ago. 2020.

JOLLIFFE, I. Principal Components Analysis. 2 ed. New York: Springer, 2002.

JOSHI, S. Sentiment Analysis on Whatsapp Group Chat Using R. In: SHUKLA, R.; AGRAWAL, J.; SHARMA, S.; TOMER, G. (orgs.). Data, Engineering and Applications. Vol. 1. Gateway East: Springer, 2019. p. 47-56.

JUOLA, P. How a Computer Program Helped Show J. K. Rowling write A Cuckoo’s Calling. In: Scientific American, 2013a. Disponível em: https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling. Acesso em: 22 ago. 2020.

JUOLA, P. Rowling and ‘Galbraith’: an authorial analysis. In: Language Log, 2013b. Disponível em: https://languagelog.ldc.upenn.edu/nll/?p=5315. Acesso em: 22 ago. 2020.

MEHIGAN, T. The process of inferential contexts: Franz Kafka reading Heinrich von Kleist. In: MEHIGAN, T. (ed.). Heinrich von Kleist: Writing After Kant. Rochester: Boydell & Brewer, 2011. p. 196-226.

O’SULLIVAN, J.; BAZARNIK, K.; EDER, M.; RYBICKI, J. Measuring Joycean Influences on Flann O’Brien. Digital Studies, v. 8, n. 1, p. 1-25, 2018.

PENNEBAKER, J. The Secret Life of Pronouns: What our Words Say About Us. New York: Bloomsbury Press, 2011.

PETERS, F. Kafka and Kleist: A Literary Relationship. Oxford German Studies, v. 1, n. 1, p. 114-162, 1966.

QIAN, C.; HE, T.; ZHANG, R. Deep Learning based Authorship Identification. Stanford Reports, 2017. Disponível em: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2760185.pdf. Acesso em: 25 ago. 2020.

R CORE TEAM. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Disponível em: www.r-project.org. Acesso em: 08 jan. 2021.

RYBICKI, J.; KESTEMONT, M.; HOOVER, D. Collaborative authorship: Conrad, Ford, and rolling delta. Literary and Linguistic Computing, v. 29, n.3, 422-431, 2014.

SHAHAR, G. Fragments and Wounded Bodies: Kafka after Kleist. The German Quarterly, v. 80, n. 4, p. 449-467, 2007.

SILGE, J.; ROBINSON, D. Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly, 2017.

STACK OVERFLOW. 2020. Disponível em: https://stackoverflow.com/. Acesso em: 20 ago. 2020.

STAMATATOS, E. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, v. 60, n. 3, p. 538-556, 2009.

TABATA, T. Stylometry of collaborations: Dickens, Collins and their collaborative writings. In: Digital Humanities 2014: Conference Abstracts. Lausanne: EPFL-UNIL, 2014. p. 378-380.

THE COMPREHENSIVE R ARCHIVE NETWORK. 2020. Disponível em: https://cran.r-project.org/. Acesso em: 22 ago. 2020.

THE R PROJECT FOR STATISTICAL COMPUTING. 2020. Disponível em: https://www.r-project.org/. Acesso em: 20 ago. 2020.

TIDYVERSE. 2020. Disponível em: https://www.tidyverse.org/. Acesso em: 20 ago. 2020.

WIERZCHOŃ, S.; KŁOPOTEK, M. Modern Algorithms of Cluster Analysis. Cham: Springer, 2018.

ZHANG, C.; WU, X.; NIU, Z.; DING, W. Authorship Identification from Unstructured Texts. Knowledge-Based Systems, v. 66, p. 99-111, 2014.

Published

2022-11-23

Issue

Section

Article