|
|
An Interinstitutional Center for Research and Development in Computational Linguistics |
LINGUARUDO
Using stylistic features for Web pages presentation according to user search intention – an instantiation for Portuguese language |
Starting Time: September 2001
Current
Status: Concluded in August 2005
This project is part of a PhD program (2001-2005) that was carried
on partially (from January 2002 - December 2003) at the
Goal
To define an approach to present the results of IR systems, which
takes into account not only the document topic but also the focus the user
expects for the results. Here, the expected focus is selected from a taxonomy of seven general types of users´ needs,
personalized users´ needs and also traditional genres or text types.
Presenting more accurate results to the user, we expect to have an
alternative way to show the results of IR systems that prevents the user of
looking at many results that are relevant considering the query topic, but that
are not relevant for the user on the moment he/she is posing his/her query.
Making the time spent on finding answers shorter, and the system
operation and the relation among the given results clearer.
Project's
Features
The approach to present the results of IR systems explores the use
of stylistic features of pages in Portuguese to present the results according
to seven general users´ needs or binary personalized users´ needs.
Our taxonomy of users´ need is composed of seven categories, which
are based on what the user wants:
1 - A definition of something or to learn how or why something
happens. For example, “what are the northern lights?”
2 - To learn how to do something or how something is usually done.
For example, “find a recipe of his favourite cake”, “learn how to make gift
boxes”, or “how to install Linux on his computer”.
3 - A comprehensive presentation about a given topic, such as “a
panorama of 20th century American literature”.
4 - To read news about a specific subject. For example, “what is
the current news about the situation in
5 - To find information about someone or some company or
organization. For example, the user wants “to know more about his blind date”
or “to find the contact information of someone he/she met in a conference”.
6 - To find a specific web page that he/she wants to visit, but
does not remember its URL.
7 - To find URLs where he/she can have access to a given online
service. For example, “he/she wants to buy new clothes” or “he/she wants to
download a new version of software”.
Obviously, these seven types of users´ needs do not cover all types
of user intentions, as users may do all kinds of unpredictable searches.
However, the very features used to generate rules and classify texts can be
used to build customized schemes for other tasks. For example, a doctor can
create a classification scheme to distinguish between web pages with technical
articles about a disease and web pages that deal with the subject without
scientific rigor. However, it is not possible to use the same features we have
studied to distinguish among subjects, for example, to tell cardiology
technical texts apart from other medical technical texts.
We offer customized schemes to the user in a desktop web search
prototype, where the user can select examples of text types that often make
his/her searches difficult. In the doctor’s example, he/she would give to the
system samples of technical and non-technical material that would be used as
training material. The system would then automatically calculate the features
for the given text set, train a classifier and present an estimation of the
system efficacy to the user personal scheme. The generated classification model
would be saved as a new option of classification task. Summing up, we offer
predefined options (genres, text types and seven users´ needs) as it is
provided by search engines shortcuts and tabs, but we also allow the user to
create his/her own shortcut specific to the binary text type related
problematic tasks that he/she often performs.
The fact that we apply the classification schemes
in the end of the search process (in the presentation of the results that have
been already generated), makes possible to other systems to use the results of
this project without great modifications.
Results
1) A review of IR from the point of view of NLP.
2) A confirmation that the use of stylistic features to classify
texts in genre and text types, as advocated and used for other languages, also
works for Portuguese.
3) The use of stylistic features to automatically categorise, in
terms of user needs, texts in Portuguese on the Web. The first work, as far as
we know, that tried to automatically categorise, in terms of user needs, the
texts on the Web, for any language.
The same features can be used to create personalized classification
options, putting on the users hands the decision of how to classify and rank
the texts. download PhD thesis pdf file
4) Yes, User!,
a corpus with 1,703 texts extracted from the Web and classified according to
the seven user needs taxonomy. See "Yes, user!: compiling a corpus according to what the user wants"
for corpus description. download Yes, User! corpus
download
Yes, User! parsed version
5) Seven binary domain-specific corpora of around 200 pages each
(100 positive and 100 negative), developed independently by different users to
evaluate the personalized classification option. download corpora developed by users
6) A prototype of a desktop search tool for Portuguese (Leva-e-traz). download Leva-e-traz (source code included)
Team
Rachel Virgínia Xavier Aires (PhD Student)
Sandra Maria Aluísio (supervisor)
Diana Santos
(supervisor)
Financial Support
Fundação para Computação
Científica Nacional (FCCN) through Fundação
para a Ciência e Tecnologia with the grant
POSI/PLP/43931/2001 and co-financed by POSI.
Contact
Rachel Aires: raires@icmc.usp.br
Related
Publications
Aires,
R. V. X. (2005) Uso de marcadores estilísticos para a busca na Web
Aires, R.; Aluísio, S;
Aires, R.,
Aires,
R. & Aluísio, S. M. As avaliações atuais de sistemas de busca na Web e a
importância do usuário. In Diana Santos (ed.), Avaliação conjunta: um novo
paradigma no processamento computacional da língua portuguesa. 2005.
Santos,
D.; Simões, A.; Frankenberg-Garcia, A. et al. (2004) Linguateca: Um centro
de recursos distribuído para o processamento computacional da língua
portuguesa. Proceedings of the international workshop "Taller de
Herramientas y Recursos Linguísticos para el Espanõl y el Portugués", IX
Iberoamerican Conference on Artificial Intelligence (IBERAMIA), November
2005, Puebla - México, p. 147-154. download pdf file
Aires, R.; Manfrin, A.; Aluísio, S.; Santos, D. (2004) Which classification
algorithm works best with stylistic features of Portuguese in order to classify
web texts according to users’ needs? Written report nº 241, October 2004, ICMC/USP. download pdf file
Aires, R.; Manfrin, A.; Aluísio, S.;
Aires, R.V.X.;
Aluísio, S. M.; Quaresma, P.; Santos, D.; Silva, M. (2003). An initial proposal for cooperative evaluation on information
retrieval in Portuguese. In
PROPOR 2003 – 6th Workshop on Computational Processing of the Portuguese
Language, Faro - Portugal, June 2003, p. 227-234. (c) Springer-Verlag.
Aires, R. V. X.
(2003). Linguarudo – Uma arquitetura lingüisticamente motivada para
recuperação de informação de textos
Aires, R. V. X;
Aluísio, S. M. (2003). Como incrementar a qualidade dos resultados das
máquinas de busca: da análise de logs à interação
Aires, R. V. X;
Santos, D. (2002). Measuring the Web in Portuguese. In Euroweb 2002 conference,
Aires, R. V. X;
Aluísio, S. M. (2002). Eu falo português. E daí? Poster
in IHC 2002 – 5th Symposium on Human Factors in Computer Systems,
Project evolution since September 2001
Linguateca
SINTEF Information & Communication Technology
NILC - Núcleo Interinstitucional de Lingüística
Computacional
ICMC - USP