Natural Language Processing in Information Metric Studies: an analysis of the articles indexed by the Web of Science (2000-2019)

Authors

DOI:

https://doi.org/10.5007/1518-2924.2021.e76886

Keywords:

Natural Language Processing , Information Metric Studies, Social Network Analysis, Scientific Research, Mapping of science

Abstract

Objective: To identify the international scientific structure of the research on the use of natural language processing in the information metric studies area.

Methods: It follows qualitative and quantitative approaches of the information metric studies and the knowledge organization domain. The data was retrieved on 02/02/2020 from the Web of Science Core Collection using the expression "natural language processing", limited to the document types articles and reviews, the category Information Science Library Science, and the timespan of the last 20 complete years (from 2000 to 2019). A Social Networks Analysis was conducted for the visualization of the scientific collaboration, co-citation, and keywords co-occurrence networks.

Results: Out of the 552 documents retrieved, 31 papers were identified in the information metric studies area. Bibliometric indicators of production, relationship, and impact were considered in the study and showed an increase of publications in the last three years, being 2018 the most productive year.

Conclusions: The international scientific literature on the application of NLP in information metric studies is emerging. Scientometrics was identified as the source that achieved a greatest impact. Finally, the k-core of the co-citation analysis shows the existence of an important theoretical core, often cited in the international academic community. The set of NLP techniques (e.., bag of words, tokenization, word stemming, part-of-speech tagging, and SVM) allows the researcher to go beyond the traditional citation analysis and focus on content and context of the citations.

Downloads

Download data is not yet available.

Author Biographies

Mirelys Puerta-Díaz, Universidade Estadual Paulista (Unesp)

- Doutoranda do Programa de Ciência da Informação

Universidade Estadual Paulista (UNESP)

- Professora Assistente na Faculdade de Comunicação da Universidade da Havana, Cuba

Bianca Savegnago de Mira, Universidade Estadual Paulista (Unesp)

Mestranda (PPGCI-UNESP) , Departamento Ciência da Informação, Marília

Daniel Martínez-Ávila, Carlos III University of Madrid

Doutor em Ciência da Informação, Professor do Departamento de Biblioteconomía y Documentación.

María-Antonia Ovalle-Perandones, Carlos III University of Madrid

Professora Contratada Doutora do Departamento de Biblioteconomía y Documentación.

Maria Cláudia Cabrini Grácio, Carlos III University of Madrid

Professora Doutora do Departamento Ciência da Informação.

References

BERGMANN, I.; BUTZKE, D.; WALTER, L.; FUERSTE, J. P.; MOEHRLE, M. G.; ERDMANN, V. A. Evaluating the risk of patent infringement by means of semantic patent analysis: the case of DNA chips: Evaluating the risk of patent infringement. R&D Management, v. 38, n. 5, p. 550–562,2008. Disponível em: https://doi.org/10.1111/j.1467-9310.2008.00533.x Acesso em: 24 out. 2020.

BOYACK, K. W; KLAVANS, R. Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for information Science and Technology, v. 61, n.12, p 2389-2404, 2010. Disponível em: https://doi.org/10.1002/asi.21419 Acesso em: 24 out. 2020.

CASCINI, G.; ZINI, M. Measuring patent similarity by comparing inventions functional trees. Computer-Aided Innovation (CAI), v.277, p. 31–42, 2008.

CHEN, Ch.; IBEKWE‐SANJUAN, F.; HOU, J. The structure and dynamics of cocitation clusters: A multiple‐perspective cocitation analysis. Journal of the American Society for Information Science and Technology, v. 61, 7, p. 1386-1409, 2010. Disponível em: https://doi.org/10.1002/asi.21309 Acesso em: 24 out. 2020.

CHEN, B.; TSUTSUI, S.; DING, Y.; MA, F. Understanding the topic evolution in a scientific domain: An exploratory study for the field of information retrieval. Journal of Informetrics, vol. 11, n. 4, p. 1175–1189, 2017. Disponível em: https://doi.org/10.1016/j.joi.2017.10.00 Acesso: 24 out. 2020.

CHEN, X., DING, R., XU, K., WANG, S., HAO, T., & ZHOU, Y. A bibliometric review of natural language processing empowered mobile computing. Wireless Communications and Mobile Computing, v. 2018. Disponível em: https://doi.org/10.1155/2018/1827074

CHOWDHARY, K. R. Natural Language Processing. Em: CHOWDHARY, K. R. Fundamentals of Artificial Intelligence. New Delhi: Springer India, p. 603–649, 2020. Disponível em: http://doi.org/10.1007/978-81-322-3972-7_19 Acesso em: 02 fev. 2020.

CHOWDHURY, G. G. Natural language processing. Annual Review of Information Science and Technology, v. 37, n. 1, p. 51–89, 31 Jan. 2005. Disponível em: https://doi.org/10.1002/aris.1440370103 Acesso em: 02 fev. 2020.

COHAN, A.; GOHARIAN, N. Scientific document summarization via citation contextualization and scientific discourse. International Journal on Digital Libraries, v. 19, n. 2–3, p. 287–303, Sep. 2018. Disponível em: https://doi.org/10.1007/s00799-017-0216-8. Acesso em: 02 fev. 2020.

CONROY, J.M.; DAVIS, S.T.Vector space and language models for scientific document summarization. Em: Proceedings of NAACL-HLT, p. 186–191, 2015.

DOLOREUX, D.; GAVIRIA DE LA PUERTA, J.; PASTOR-LÓPEZ, I.; PORTO GÓMEZ, I.; SANZ, B.; ZABALA-ITURRIAGAGOITIA, J. M. Territorial innovation models: to be or not to be, that’s the question. Scientometrics, v. 120, n. 3, p. 1163–1191, Sep. 2019. Disponível em: https://doi.org/10.1007/s11192-019-03181-1. Acesso em: 24 jun 2020.

FERREIRA, M. H. W.; CORRÊA, R. F. Estudo métrico temático sobre biblioteca digital no brasil: uma aplicação do software iramuteq. Encontro Brasileiro de Bibliometria e Cientometria, v. 6, p. 6º Encontro Brasileiro de Bibliometria e Cientometria, 2018. Disponível em: http://hdl.handle.net/20.500.11959/brapci/117376. Acesso em: 24 out. 2020.

GALVEZ C; MOYA-ANEGON, F. Standardizing formats of corporate source data. Scientometrics, v. 70 n.1, p. 3-26, 2007. Disponível em: 10.1007/s11192-007-0101-0 . Acesso em: 24 jun. 2020.

GARZONE, M.; MERCER, R. E. Towards an automated citation classifier. Em: Advances in Artificial Intelligence. p. 337-346, 2000.

GERKEN, J.; MOEHRLE, M.; WALTER L. Patents as an information source for product forecasting: Insights from a longitudinal study in the automotive industry. Em: The R&D management conference, v. 3, 2010. Disponível em: https://jmgerken.com/publication/gerken-2010-patents/ Acesso em: 24 out. 2020.

GHIASI, G.; LARIVIÈRE, V; SUGIMOTO, C. Gender differences in synchronous and diachronous self-citations. Em: 21st International Conference on Science and Technology Indicators-STI 2016. Book of Proceedings. 2016. Disponível em http://ocs.editorial.upv.es/index.php/STI2016/STI2016/paper/viewFile/4543/2327 Acesso em: 03 nov. 2020.

GLÄNZEL, W.; HEEFFER, S.; THIJS, B. Lexical analysis of scientific publications for nano-level scientometrics. Scientometrics, v. 111, n. 3, p. 1897–1906, Jun. 2017. Disponível em: https://doi.org/10.1007/s11192-017-2336-8. Acesso em: 02 fev. 2020.

HASSAN SU; IMRAN, M; IQBAL, S; ALJOHANI, NR; NAWAZ, R. Deep context of citations using machine-learning models in scholarly full-text articles. Scientometrics, v. 117, n.3, p.1645-62, 2018.

HJØRLAND, B. Domain analysis in information science: eleven approaches–traditional as well as innovative. Journal of documentation, v.58, n.4, p.422-462, 2002.

HJØRLAND, B. Domain analysis. Knowledge Organization, v.44, n. 6, p.436-464, 2017.

IQBAL, S.; HASSAN, S. U.; ALJOHANI, N. R.; ALELYANI, S.; NAWAZ, R.; BORNMANN, L. A Decade of In-text Citation Analysis based on Natural Language Processing and Machine Learning Techniques: An overview of empirical studies. 2020. arXiv preprint Disponí.vel em: https://arxiv.org/abs/2008.13020. Acesso em: 02 nov. 2020.

IOANNIDIS, J. P. A.; BAAS, J.; KLAVANS, J.; BOYACK, K. W. A standardized citation metrics author database annotated for the scientific field. PLOS Biology, v. 17, n. 8, e. 3000384, ago. 2019.Disponível em: https://doi.org/10.1371/journal.pbio.3000384 Acesso em: 06 nov. 2020.

KAMADA, T.; KAWAI, S. A general framework for visualizing abstract objects and relations. ACM Transactions on Graphics, Connecticut, v. 10, p. 1-39, 1991.

LADEIRA, A. P.; ALVARENGA, L. Processamento de linguagem natural: em busca de evidências temáticas nas publicações nacionais contemporâneas. In: Encontro Nacional de Pesquisa e Pós-Graduação em Ciência da Informação, 10, 2009, João Pessoa. Anais... João Pessoa: Ancib, 2009.

LI, L; MAO, L.; ZHANG, Y.; CHI, J.; HUANG, T.; CONG, X.; PENG, H. Computational linguistics literature and citations oriented citation linkage, classification and summarization. International Journal on Digital Libraries, v. 19, n. 2–3, p. 173–190, Sep. 2018. Disponível em: https://doi.org/10.1007/s00799-017-0219-5. Acesso em: 02 fev. 2020.

LI, X.; LEI, L. A bibliometric analysis of topic modelling studies (2000–2017). Journal of Information Science, p. 0165551519877049, 2019.

LIDDY, E. D. Natural language processing. p.1-15, 2001. Disponível em: https://surface.syr.edu/cgi/viewcontent.cgi?article=1019&context=cnlp Acesso em: 26 Jul. 2020.

LIDDY, E. D. Natural Language Processing for Information Retrieval. Em: BATES, M. J.; MAACK, M. N. (Eds.). Encyclopedia of Library and Information Sciences. CRC Press, 2010. Disponível em: https://doi.org/10.1081/E-ELIS3. Acesso em: 26 Jul. 2020.

LIU, Sh.; CHEN, Ch. The effects of co-citation proximity on co-citation analysis. Em: Proceedings of ISSI, p. 474-484. 2011.

LUPU, M. Information retrieval, machine learning, and Natural Language Processing for intellectual property information. World Patent Information, v. 49, p. A1–A3, 2017. Disponível em: https://doi.org/10.1016/j.wpi.2017.06.002 Acesso: 26 Jul. 2020.

MANNING, C. D., SURDEANU, M., BAUER, J., FINKEL, J., BETHARD, S. J., & MCCLOSKY, D. The Stanford CoreNLP natural language processing toolkit. Em: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, p. 55-60, 2014. Disponível em: https://www.aclweb.org/anthology/N15-3.pdf Acesso em: 26 Jul. 2020.

MOEHRLE, M. G; WALTER, L; GERITZ, A; MULLER, S. Patent-based inventor profiles as a basis for human resource decisions in research and development. R and D Management, v. 35, n. 5, p. 513–524, 2005. https://doi.org/10.1111/j.1467-9310.2005.00408.x. Acesso em: 26 Jul. 2020.

NADKARNI, P. M.; OHNO-MACHADO, L; CHAPMAN, W. W. Natural language processing: an introduction. Journal of the American Medical Informatics Association, v. 18, n. 5, p. 544-551, 2011.

PARK, H.; YOON, J; KIM, K. Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, v.90, n.2, p. 515-529, 2012. Disponível em: https://doi.org/10.1007/s11192-011-0522-7 Acesso em: 2 dez. 2020.

PRINCETON UNIVERSITY. About WordNet. WordNet. Princeton University. 2010. Disponível em https://wordnet.princeton.edu/. Acesso em: 26 oct. 2020.

PUERTA-DIAZ, M.; MIRA, B. S.; OVALLE-PERANDONES, M.; GRÁCIO, M. C. C.; MARTÍNEZ-ÁVILA, D. O processamento de linguagem natural na área dos estudos métricos da informação: um estudo no período de 2000 a 2019. Anais do 7º Encontro Brasileiro de Bibliometria e Cientometria. Salvador: EDUFBA, 2020. p. 145-152. Disponível em: http://repositorio.ufba.br/ri/handle/ri/32385. Acesso em: 2 dez. 2020.

QAZVINIAN, V.; RADEV, D. R. Identifying non-explicit citing sentences for citation-based summarization. Em: Proceedings of the 48th annual meeting of the association for computational linguistics, p. 555-564, 2010.

R CORE TEAM. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. 2016. Disponível em: https://www.R-project.org/ Acesso: 24 out. 2020.

SAGGION, H.; ABURAED, A.; RONZANO, F. Trainable citation-enhanced summarization of scientific articles. Em: CABANAC, G; CHANDRASEKARAN, MK; FROMMHOLZ, I; JAIDKA, K; KAN, M; MAYR, P; WOLFRAM, D.(eds). Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL); 2016 Jun 23; Newark, United States.CEUR Workshop Proceedings, p. 175-86, 2016.

SEIDMAN, S. B. Network structure and minimum degree. Social networks, v.5 n.3, p. 269-287, 1983.

SZOMSZOR M; PENDLEBURY DA; ADAMS J. How much is too much? The difference between research influence and self-citation excess. Scientometrics, v.123, n.2, p. 1119-1147, 2020.

SMEATON, A. F. Using NLP or NLP Resources for Information Retrieval Tasks. In: STRZALKOWSKI, T. (ed.). Natural Language Information Retrieval. Dordrecht: Springer Netherlands, 1999. v. 7, p. 99–111. Disponível em: http://link.springer.com/10.1007/978-94-017-2388-6_4. Acesso em: 26 Jul. 2020.

SMIRAGLIA, R. Domain analysis for knowledge organization: tools for ontology extraction. Chandos Publishing, p. 116, 2015.

TASKIN, Z.; AL, U. Natural language processing applications in library and information science. Online Information Review, v. 43, n. 4, p. 676–690, 12 Aug. 2019. Disponível em: https://doi.org/10.1108/OIR-07-2018-0217. Acesso em: 26 Jul. 2020.

TSOURIKOV, V. M.; BATCHILO, L. S.; SOVPEL, I. V. Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures. United States Patent No. 6167370. 2000

VAN ECK, N. J.; WALTMAN, L. VOSviewer manual. Leiden: Univeristeit Leiden, v. 1, n. 1, p. 1-53, 2020.

WHITE, H. D. Authors as Citers over Time. Journal of the American Society for Information Science and Technology, v. 52, n. 2, p .87–108, 2001.

YOON, J.; CHOI, S.; KIM, K. Invention property-function network analysis of patents: a case of silicon-based thin film solar cells. Scientometrics, v. 86, n. 3, p. 687–703, 2011. Disponível em: https://doi.org/10.1007/s11192-010-0303-8. Acesso em: 26 Jul. 2020.

YOON J.; KIM K. Identifying rapidly evolving technological trends for R&D planning using SAO-based semantic patent networks. Scientometrics, v.88 n.1, p.213-28, 2011. Acesso em: 26 Jul. 2020.

YOON J; PARK H; KIM K. Identifying technological competition trends for R&D planning using dynamic patent maps: SAO-based content analysis. Scientometrics, v.94, n.1, p.313-31, 2013. Disponível em: http://doi.org/10.1007/s11192-012-0830-6 Acesso em: 26 Jul. 2020.

YUE, H. Core and visualization analysis based on network of co-citation. Em: 2010 2nd IEEE International Conference on Information Management and Engineering. IEEE, p. 266-269, 2010. Disponível em: http://doi.org/10.1109/ICIME.2010.5478291. Acesso em: 26 Jul. 2020.

ZHU XD; TURNEY P; LEMIRE D; VELLINO A. Measuring Academic Influence: Not All Citations Are Equal. Journal of the Association for Information Science and Technology, v.66, n.2, p.408-27, 2015.Disponível em: http://doi.org/10.1002/asi.23179 Acesso em: 26 Jul. 2020.

Published

2021-02-11

How to Cite

PUERTA-DÍAZ, Mirelys; DE MIRA, Bianca Savegnago; MARTÍNEZ-ÁVILA, Daniel; OVALLE-PERANDONES, María-Antonia; GRÁCIO, Maria Cláudia Cabrini. Natural Language Processing in Information Metric Studies: an analysis of the articles indexed by the Web of Science (2000-2019). Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação, [S. l.], v. 26, p. 01–24, 2021. DOI: 10.5007/1518-2924.2021.e76886. Disponível em: https://periodicos.ufsc.br/index.php/eb/article/view/76886. Acesso em: 22 jan. 2025.