LINGUARUDO - An Approach for Natural Language Information Retrievak for Portuguese

An Interinstitutional Center for Research and Development in Computational Linguistics

LINGUARUDO

Using stylistic features for Web pages presentation according to user search intention –

an instantiation for Portuguese language

Starting Time: September 2001

Current Status: Concluded in August 2005

This project is part of a PhD program (2001-2005) that was carried on partially (from January 2002 - December 2003) at the Oslo node of Linguateca at SINTEF.

Goal

To define an approach to present the results of IR systems, which takes into account not only the document topic but also the focus the user expects for the results. Here, the expected focus is selected from a taxonomy of seven general types of users´ needs, personalized users´ needs and also traditional genres or text types.

Presenting more accurate results to the user, we expect to have an alternative way to show the results of IR systems that prevents the user of looking at many results that are relevant considering the query topic, but that are not relevant for the user on the moment he/she is posing his/her query. Making the time spent on finding answers shorter, and the system operation and the relation among the given results clearer.

Project's Features

The approach to present the results of IR systems explores the use of stylistic features of pages in Portuguese to present the results according to seven general users´ needs or binary personalized users´ needs.

Our taxonomy of users´ need is composed of seven categories, which are based on what the user wants:

1 - A definition of something or to learn how or why something happens. For example, “what are the northern lights?”

2 - To learn how to do something or how something is usually done. For example, “find a recipe of his favourite cake”, “learn how to make gift boxes”, or “how to install Linux on his computer”.

3 - A comprehensive presentation about a given topic, such as “a panorama of 20th century American literature”.

4 - To read news about a specific subject. For example, “what is the current news about the situation in Israel?”

5 - To find information about someone or some company or organization. For example, the user wants “to know more about his blind date” or “to find the contact information of someone he/she met in a conference”.

6 - To find a specific web page that he/she wants to visit, but does not remember its URL.

7 - To find URLs where he/she can have access to a given online service. For example, “he/she wants to buy new clothes” or “he/she wants to download a new version of software”.

Obviously, these seven types of users´ needs do not cover all types of user intentions, as users may do all kinds of unpredictable searches. However, the very features used to generate rules and classify texts can be used to build customized schemes for other tasks. For example, a doctor can create a classification scheme to distinguish between web pages with technical articles about a disease and web pages that deal with the subject without scientific rigor. However, it is not possible to use the same features we have studied to distinguish among subjects, for example, to tell cardiology technical texts apart from other medical technical texts.

We offer customized schemes to the user in a desktop web search prototype, where the user can select examples of text types that often make his/her searches difficult. In the doctor’s example, he/she would give to the system samples of technical and non-technical material that would be used as training material. The system would then automatically calculate the features for the given text set, train a classifier and present an estimation of the system efficacy to the user personal scheme. The generated classification model would be saved as a new option of classification task. Summing up, we offer predefined options (genres, text types and seven users´ needs) as it is provided by search engines shortcuts and tabs, but we also allow the user to create his/her own shortcut specific to the binary text type related problematic tasks that he/she often performs.

The fact that we apply the classification schemes in the end of the search process (in the presentation of the results that have been already generated), makes possible to other systems to use the results of this project without great modifications.

Results

1) A review of IR from the point of view of NLP.

2) A confirmation that the use of stylistic features to classify texts in genre and text types, as advocated and used for other languages, also works for Portuguese.

3) The use of stylistic features to automatically categorise, in terms of user needs, texts in Portuguese on the Web. The first work, as far as we know, that tried to automatically categorise, in terms of user needs, the texts on the Web, for any language.

The same features can be used to create personalized classification options, putting on the users hands the decision of how to classify and rank the texts. download PhD thesis pdf file

4) Yes, User!, a corpus with 1,703 texts extracted from the Web and classified according to the seven user needs taxonomy. See "Yes, user!: compiling a corpus according to what the user wants" for corpus description. download Yes, User! corpus download Yes, User! parsed version

5) Seven binary domain-specific corpora of around 200 pages each (100 positive and 100 negative), developed independently by different users to evaluate the personalized classification option. download corpora developed by users

6) A prototype of a desktop search tool for Portuguese (Leva-e-traz). download Leva-e-traz (source code included)

Team

Rachel Virgínia Xavier Aires (PhD Student)

Sandra Maria Aluísio (supervisor)

Diana Santos (supervisor)

Financial Support

Fundação para Computação Científica Nacional (FCCN) through Fundação para a Ciência e Tecnologia with the grant POSI/PLP/43931/2001 and co-financed by POSI.

Contact

Rachel Aires: raires@icmc.usp.br

Related Publications

Aires, R. V. X. (2005) Uso de marcadores estilísticos para a busca na Web em português. PhD Thesis. September, 2005. download pdf file

Aires, R.; Aluísio, S; Santos, D.. (2005) User-aware page classification in a search engine. In SIGIR Workshop on Textual Stylistics in Information Access. Salvador - Brazil, August 2005. download pdf file

Aires, R., Santos, D., Aluísio, S. (2005) "Yes, user!": compiling a corpus according to what the user wants. Corpus Linguistics 2005, Birgman - UK, July 2005. download pdf file

Aires, R. & Aluísio, S. M. As avaliações atuais de sistemas de busca na Web e a importância do usuário. In Diana Santos (ed.), Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa. 2005.

Santos, D.; Simões, A.; Frankenberg-Garcia, A. et al. (2004) Linguateca: Um centro de recursos distribuído para o processamento computacional da língua portuguesa. Proceedings of the international workshop "Taller de Herramientas y Recursos Linguísticos para el Espanõl y el Portugués", IX Iberoamerican Conference on Artificial Intelligence (IBERAMIA), November 2005, Puebla - México, p. 147-154. download pdf file

Aires, R.; Manfrin, A.; Aluísio, S.; Santos, D. (2004) Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users’ needs? Written report nº 241, October 2004, ICMC/USP. download pdf file

Aires, R.; Manfrin, A.; Aluísio, S.; Santos, D. (2004) What is my Style? Using Stylistic Features of Portuguese Web Texts to classify Web pages according to Users' Needs. To appear in Proceedings of LREC 2004. Lisbon - Portugal. download pdf file download corpus download trainingfile

Aires, R.V.X.; Aluísio, S. M.; Quaresma, P.; Santos, D.; Silva, M. (2003). An initial proposal for cooperative evaluation on information retrieval in Portuguese. In PROPOR 2003 – 6th Workshop on Computational Processing of the Portuguese Language, Faro - Portugal, June 2003, p. 227-234. (c) Springer-Verlag.

Aires, R. V. X. (2003). Linguarudo – Uma arquitetura lingüisticamente motivada para recuperação de informação de textos em português. Qualificação de Doutorado. ICMC-USP, March, 2003, 86p. download pdf file

Aires, R. V. X; Aluísio, S. M. (2003). Como incrementar a qualidade dos resultados das máquinas de busca: da análise de logs à interação em português. Revista Ciência da Informação, vol 32, n. 1, p. 5-16, jan./abr. 2003. download pdf file

Aires, R. V. X; Santos, D. (2002). Measuring the Web in Portuguese. In Euroweb 2002 conference, Oxford, UK, p. 198-199, December 2002. download poster download poster abstract

Aires, R. V. X; Aluísio, S. M. (2002). Eu falo português. E daí? Poster in IHC 2002 – 5th Symposium on Human Factors in Computer Systems, Fortaleza - CE, October, 2002.

Project evolution since September 2001

Linguateca
SINTEF Information & Communication Technology
NILC - Núcleo Interinstitucional de Lingüística Computacional
ICMC - USP

Last Update: 11/08/2005