ScraperCI: um web scraper para coleta de dados científicos

Helton Luiz dos Santos Graciano; Rogério Aparecido Sá  Ramalho

doi:10.5007/1518-2924.2023.e92471

Authors

Helton Luiz dos Santos Graciano Federal University of São Carlos https://orcid.org/0000-0001-5372-7631
Rogério Aparecido Sá Ramalho Federal University of São Carlos https://orcid.org/0000-0002-8491-3514

DOI:

https://doi.org/10.5007/1518-2924.2023.e92471

Keywords:

Information recovery, Web scraping, Search engines, Data management

Abstract

Objective: The technological development of the last few decades has driven the massive production of informational resources and significant changes in data collection and management processes in practically all areas. This scenario is no different in the scientific field, where the collection and proper treatment of data has been a challenge for researchers. This research aimed to present a prototype of Web scraper, called ScraperCI, and to analyze the potential of using computational tools as it is for collection in databases available on the Web.

Methods: The research is characterized as applied, exploratory and descriptive in nature, with a qualitative approach that aims to identify the potential of using Web scrapers in the data collection process.

Results: It is concluded that the developed prototype enables considerable advances in the process of automating the collection of scientific data and that such tools enable the automation of retrieval processes, favoring greater productivity in terms of the extraction of informational resources on the Web.

Conclusions: It is hoped that this research can encourage information professionals to develop new skills and see innovative possibilities in their areas of professional activity, acting with protagonism in this interdisciplinary environment.

Downloads

Download data is not yet available.

Author Biographies

Helton Luiz dos Santos Graciano, Federal University of São Carlos

Mestre em Ciência da Informação e Engenheiro de Controle e Automação

Rogério Aparecido Sá Ramalho , Federal University of São Carlos

Doutor em Ciência da Informação - Docente na Universidade Federal de São Carlos e Coordenador do Núcleo de Informação, Tecnologia e Inovação - ITI UFSCar

References

BAEZA-YATES, R.; RIBEIRO-NETO, B. Recuperação de Informação: conceitos e tecnologia das máquinas de busca. 2. ed. Porto Alegre: Bookman, 2013.

BORKO, H. Information science: What is it? American Documentation, [s.l.], v. 19, n. 1, p. 3-5, 1968.

CHOWDHURY, G. G. Introduction to modern information retrieval. 3. ed. New York: Neal-Schuman Publishers, 2010.

BRIN, S.; PAGE, L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, [s.l.], v. 30, n. 1-7, p. 107-117, 1998. Disponível em: https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf. Acesso em: 25 fev. 2022.

DASTIDAR, B. G.; BANERJEE, D.; SENGUPTA, S. An Intelligent Survey of Personalized Information Retrieval using Web Scraper. International Journal of Education and Management Engineering, [s.l.], v. 6, n. 5, p. 24-31, 2016. Disponível em: https://www.mecs-press.org/ijeme/ijeme-v6-n5/IJEME-V6-N5-3.pdf. Acesso em: 25 fev. 2022.

IDC. The State of Data Discovery and Cataloging. IDC White Paper, 2018. Disponível em: https://www.datateam.mx/downloads/alteryx/The_State_of_Data_Discovery__Cataloging.pdf. Acesso em: 25 fev. 2022.

MITCHELL, R. Web Scraping with Python: collecting more data from the modern web. 2nd ed. [S.l.]: O’Reilly Media, 2018.

MOOERS, C. N. Zatocoding applied to mechanical organization of knowledge. American Documentation, [s.l.], v. 2, n. 1, p. 20 32, 1951. Disponível em: https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090020107. Acesso em: 25 fev. 2022.

PROBSTEIN, S. Reality check: still spending more time gathering instead of analyzing. Forbes Technology Council, 2019. Disponível em: https://www.forbes.com/sites/forbestechcouncil/2019/12/17/reality-check-still-spending-more-time-gathering-instead-of-analyzing. Acesso em: 25 fev. 2022.

RAMALHO, R. A. S.; OUCHI, M. T. Tecnologias Semânticas: novas perspectivas para a representação de recursos informacionais. Informação & Informação, Londrina, v. 16, n. 3, p. 75-60, 2011. Disponível em: https://ojs.uel.br/revistas/uel/index.php/informacao/article/view/9829. Acesso em: 25 fev. 2022.

SANT’ANA, R. C. G. Ciclo de vida dos dados: uma perspectiva a partir da ciência da informação. Informação & Informação, Londrina, v. 21, n. 2, p. 116 142, 2016. Disponível em: https://ojs.uel.br/revistas/uel/index.php/informacao/article/view/27940. Acesso em: 25 fev. 2022.

SANT’ANA, R.C.G. Transdução informacional: impactos do controle sobre os dados. In: MARTÍNEZ-ÁVILA, D.; SOUZA, E.A.; GONZALEZ, M.E.Q. (ed.). Informação, conhecimento, ação autônoma e big data: continuidade ou revolução? Marília: Oficina Universitária; São Paulo: Cultura Acadêmica; FiloCzar, 2019, p. 117-128. Disponível em: http://books.scielo.org/id/gfrbh/pdf/martinez-9788572490559-09.pdf. Acesso em: 25 fev. 2022.

SILVEIRA, D. T.; CÓRDOVA, F. P. A pesquisa científica. In: GERHARDT, T. E., SILVEIRA, D. T. (orgs.). Métodos de pesquisa. Porto Alegre: Editora da UFRGS, 2009. Disponível em: http://hdl.handle.net/10183/52806. Acesso em: 25 fev. 2022.

SIRISURIYA, S. A. Comparative study on web scraping. In: INTERNATIONAL RESEARCH CONFERENCE, 8., 2015, KDU. Proceedings […]. [S.l.: s.n.], 2015. Disponível em: http://ir.kdu.ac.lk/bitstream/handle/345/1051/com-059.pdf. Acesso em: 25 fev. 2022.

SOUZA, R. R.; ALMEIDA, M. B.; BARACHO, R. M. A. Ciência da informação em transformação: Big Data, nuvens, redes sociais e Web Semântica. Ciência da Informação, Brasília, v. 42, n. 2, p. 159 173, 2013. Disponível em: https://revista.ibict.br/ciinf/article/view/1379. Acesso em: 25 fev. 2022.

SILVA, R. E. DA; SANTOS, P. L. V. A. DA C.; FERNEDA, E. Modelos de recuperação de informação e web semântica: a questão da relevância. Informação & Informação, Londrina v. 18, n. 3, p. 27, 2013. Disponível em: https://ojs.uel.br/revistas/uel/index.php/informacao/article/view/12822. Acesso em: 25 fev. 2022..

UPADHYAY. S. et al. Articulating the construction of a Web scraper for massive data extraction. In: INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES (ICECCT), 2., 2017, Coimbatore, India. Proceedings […]. [S.l.: s.n.], 2017. Disponível em: https://ieeexplore.ieee.org/document/8117827. Acesso em: 22 jan. 2022.