Fundamentals in natural language processing: a proposal for extraction bigrams
DOI:
https://doi.org/10.5007/1518-2924.2014v19n40p1Keywords:
Multiword expression extraction, Measures of association statistics, HeudetAbstract
It is common sense that the written text is an important way of to register information and currently much of this information content is available in digital form. However, in general, the computers consider a text is a string that have not significance. The area of Natural Language Processing (PLN) has been engaged in extracting meaning from text. Accordingly this paper presents a review of this issue and proposes an automated method that uses a deterministic heuristic called Heudet which aims extract bigram of the text. The goal is to extract the meaning of the text identifing a set of multiword expressions (MWE). The results were better compared to those using up the techniques of statistical association measures obtained from the software ngram Statistics Package (NSP).
Downloads
References
CALZOLARI, Nicoletta et al. Towards best practice for multiword expressions in computational lexicons. Em Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), pp. 1934–1940, Las Palmas, Canary Islands, 2002.
CHEN, Jisong, YEH, Chung-Hsing, CHAU, Rowena. A multi-word term extraction system. PRICAI 2006, LNAI 4099, pp. 1160 – 1165, 2006. Springer-Verlag Berlin Heidelberg, 2006.
CINTRA, Anna Maria Marques. Elementos de linguística para estudos de indexação. Ciências de Informação, v.12, n. 1, p. 5-22, 1983.
CIPRO NETO, Pasquale; INFANTE, Ulisses. Gramática da língua portuguesa. São Paulo. Ed. Scipione, 2009. 584p.
DIAS, Gael; LOPES, José Gabriel Pereira; GUILLORÉ, Sylvie. Mutual expectation: a measure for multiword lexical unit extraction. In Proceedings of Vextal, 1999.
FARACO, Carlos Emílio; MOURA, Francisco Marto, Gramática, 7. ed. São Paulo: Ática, 1990. 487p.
EVERT, Stefan; KRENN, Brigitte. Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, 19(4):450–466, 2005.
KURAMOTO, Hélio. Uma abordagem alternativa para o tratamento e a recuperação da informação textual: os sintagmas nominais. Ciência da Informação, Brasília v. 25, n. 2, mai/ago, p. 182-196, 1995.
LADEIRA, Ana. Paula. Processamento de linguagem natural: caracterização da produção científica dos pesquisadores brasileiros. 2010. 262f. Tese (Doutorado em Ciência da Informação), Escola de Ciência da Informação da UFMG, Belo Horizonte, 2010.
MAIA, Luiz Cláudio Gomes; SOUZA, Renato Rocha. Uso de sintagmas nominais na classificação automática de documentos eletrônicos. Perspectivas em Ciência Informação, Belo Horizonte, v. 15, p. 154-172 , 2010.
PEARCE, Darren. A comparative evaluation of collocation extraction techniques. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands, Spain, May, 2002. European Language Resources Association.
PECINA, Pavel. Lexical association measures and collocation extraction. Language Resources and Evaluation (LREC 2010) 44(1-2): 137-158, 2010.
PEDERSEN, Ted et al. The Ngram Statistics Package. Disponível em: http://www.d.umn.edu/~tpederse/nsp.html. Acesso em: ago. 2011.
PORTELA, Ricardo; MAMEDE Nuno; BAPTISTA, Jorge. Mutiword Identificação. In Terceiro Simpósio de Informáctica (INFORUM 2011), Oct. 2011, pp.
RAMISCH, Carlos. Multiword terminology extraction for domain specific documents. Dissertação – Mathématiques Appliqueées, École Nationale Supérieure d’Informatiques, Grenoble, 2009.
RANCHHOD, Elisabete Marques. O lugar das expressões ‘fixas’ na gramática do Português. in Castro, I. and I. Duarte (eds.), Razão e Emoção, vol. II, Lisbon: INCM, pp. 239-254, 2003.
RAYSON, Paul; PIAO, Scott; SHAROFF, Serge; EVERT, Stefan. MOIRÓN, Begoña Villada. Multiword expressions: hard going or plain sailing? Springer Science Business Media B. V, 2009.
ROUSSINOV, Dmitri. Towards Combined Aspect Verification Model. (no prelo).
SAG, I. A. et al. Multiword expressions: a pain in the neck for nlp. Em Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing CICLing-2002), volume 2276 of (Lecture Notes in Computer Science), pp. 1–15, London, UK. Springer-Verlag, 2002.
SARMENTO, Luís. Simpósio Doutoral Linguateca 2006. Disponível em: http://www.linguateca.pt/documentos/SimposioDoutoral2005.html: out. 2011.
SILVA, Joaquim Ferreira; LOPES, Gabriel Pereira. A local maxima method and fair dispersion normalization for extracting multi-word units from corpora. Sixth meeting on Mathematics of Language, pp. 369-381, 1999.
SOUZA, Renato Rocha. Uma proposta de metodologia para a escolha automática de descritores utilizando sintagmas nominais. 2005. 215f. Tese (Doutorado em Ciência da Informação), Escola de Ciências da Informação, UFMG, Belo Horizonte, 2005.
VILLAVICENCIO, Aline et al. Identificação de expressões multipalavra em domínios específicos. Linguamática, v. 2, n. 1, p. 15-33, abril, 2010.
WANG, Lijuan; LIU, Rong. A Rapid Method to Extract Multiword Expressions with Statistic Measures and Linguistic Rules. WISM 2011, Part II, LNCS 6988, pp. 234–241, 2011.
YAGONOVA, E. V.; PIVOVAROVA, L.M. The Nature of Collocations in the Russian Language. The Experience of Automatic Extraction and Classification of the Material of News Texts. Automatic Documentation and Mathematical Linguistics, 2010, Vol. 44, No. 3, pp. 164–175. Allerton Press, Inc., 2010.
ZHANG, Wen; et al. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Systems with Applications, Elsevier, v. 36, 2009.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2014 Edson Marchetti da Silva, Renato Rocha Souza

This work is licensed under a Creative Commons Attribution 4.0 International License.
The author must guarantee that:
- there is full consensus among all the coauthors in approving the final version of the document and its submission for publication.
- the work is original, and when the work and/or words from other people were used, they were properly acknowledged.
Plagiarism in all of its forms constitutes an unethical publication behavior and is unacceptable. Encontros Bibli has the right to use software or any other method of plagiarism detection.
All manuscripts submitted to Encontros Bibli go through plagiarism and self-plagiarism identification. Plagiarism identified during the evaluation process will result in the filing of the submission. In case plagiarism is identified in a manuscript published in the journal, the Editor-in-Chief will conduct a preliminary investigation and, if necessary, will make a retraction.
This journal, following the recommendations of the Open Source movement, provides full open access to its content. By doing this, the authors keep all of their rights allowing Encontros Bibli to publish and make its articles available to the whole community.
Encontros Bibli content is licensed under a Creative Commons Attribution 4.0 International License.
Any user has the right to:
- Share - copy, download, print or redistribute the material in any medium or format.
- Adapt - remix, transform and build upon the material for any purpose, even commercially.
According to the following terms:
- Attribution - You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything that the license permits.