PorSimples: Simplification of Portuguese Texts for Digital Inclusion and Accessibility
(First Edital FAPESP/MSRESEARCH) November 2007 to April 2010
The main goal of PorSimples was to develop Natural Language Processing (NLP) technologies related to Text Adaptation (TA) to promote digital inclusion and accessibility for people with low levels of literacy. There are two general different approaches for TA: Text Simplification and Text Elaboration. The first can be defined as any task that reduces the lexical or syntactic complexity of a text, while trying to preserve meaning and information, and can be subdivided into Lexical and Syntactic Simplification, Automatic Summarization, and other techniques. Text Elaboration aims at clarifying and explaining information and making connections explicit in a text, for example, providing definitions or synonyms for words known to only a few speakers of a language. The technologies developed in PorSimples are available by means of three systems aimed to distinct users: (1) An authoring system, called SIMPLIFICA, to help authors to produce simplified texts targeting people with low literacy levels; (2) An assistive technology system, called FACILITA, which explores the tasks of summarization and simplification to allow poor literate people to read Web content, and (3) A web content adaptation tool, named Educational FACILITA, for assisting low-literacy readers to perform detailed reading. It exhibits questions that clarify the semantic relations linking verbs to their arguments, highlights the associations amongst the main ideas of the texts and the named entities, and perform lexical elaboration. Currently, Educational FACILITA only explores the NLP tasks of lexical elaboration and named entity labeling.
We proposed/evaluated 5 types of text adaptation methods: (i) Text Summarization, (ii) Lexical Simplification, (iii) Syntactic Simplification, (iv) Natural Simplification, and (v) Text Elaboration (in fact, we have proposed 3 methods of elaboration), and a new model of Readability Assessment for the INAF literacy levels. They are available in the systems cited above and were reported deeply in several papers listed below.
We built a parallel corpora of texts simplified in XCES format, a Dictionary of Simple Words for Portuguese available under request, and a Manual of syntactic simplification for Portuguese Texts.
We developed a Simplification Annotation Tool to create parallel corpora of simplified texts and Coh-Metrix-Port, based on the Coh-Metrix Tool developed to compute features potentially relevant to the comprehension of English texts through a number of measures informed by linguistics, psychology and cognitive studies. The main aspects covered by the measures are cohesion and coherence.
PorSimples IN NUMBERS
PORSIMPLES: SIMPLIFICATION OF PORTUGUESE TEXT FOR DIGITAL INCLUSION AND ACCESSIBILITY Sandra Maria Aluísio São Carlos Institute of Mathematics and Computer Sciences / University of São Paulo (USP)
• Team: 6 researchers/students supported by MSR-FAPESP; 11 other students joint the project
• Publications: 28 papers (conferences and journal) 6 Demos/Posters (short papers) 12 Technical Reports
• Research Collaborations: 13 senior researchers from Psycholinguistics Statistics Natural Language Processing Human-Computer Interaction
• Products: 3 main systems 6 types of text adaptation methods 4 data resources 3 supporting tools
List of Publications of Porsimples project
(2010)
1) Scarton, C. E. e Aluísio, S. M. (2010) Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português. Linguamática (Online journal about natural language processing of Iberic languages. - ISSN: 1647-0818), v2 n1, pp. 45-61.
2) WATANABE, W. M. ; CÂNDIDO, Arnaldo ; Amancio, M.A. ; OLIVEIRA, M. ; PARDO, T. A. S. ; FORTES, R. P. M. ; ALUÍSIO, S. M. . Adapting Web content for low-literacy readers by using lexical elaboration and named entities labeling. New Review of Hypermedia and Multimedia , v. 16, p. 303-327, 2010.
3) Gasperin, C. Maziero, E. and Aluísio, S.M. (2010) Challenging Choices for Text Simplification, In: Proceedings of PROPOR 2010, p. 40-50, António Branco, Aldebaro Klautau, Renata Vieira, Vera Lúcia Strube de Lima (Eds.): Computational Processing of the Portuguese Language, 9th International Conference, PROPOR 2010, Porto Alegre, RS, Brazil, April 27-30, 2010. Proceedings. Springer 2010, v. 6001. p. 40-50. ISBN 978-3-642-12319-1
4) Specia, L. (2010) Translating from Complex to Simplified Sentences. In: Proceedings of PROPOR 2010, p. 30-39, António Branco, Aldebaro Klautau, Renata Vieira, Vera Lúcia Strube de Lima (Eds.): Computational Processing of the Portuguese Language, 9th International Conference, PROPOR 2010, Porto Alegre, RS, Brazil, April 27-30, 2010. Proceedings. Springer 2010, Vol. 6001, Springer, pp. 30-39. ISBN 978-3-642-12319-1
5) Duran, M. S.; Amancio, M. A; Aluísio, S.M. (2010a) Assigning Wh-Questions to Verbal Arguments in a Corpus of Simplified Texts. In the Proceedings of PROPOR 2010, 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, 1 CD-ROM v1. p. 1-6. ISSN: 21773580.
6) Duran, M. S.; Amancio, M. A; Aluísio, S.M. (2010b) Assigning Wh-Questions to Verbal Arguments: Annotation Tools Evaluation and Corpus Building. In the Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). (Eds) Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias. European Language Resources Association (ELRA), 1 CD-ROM v1. p. 1445-1451. ISBN 2-9517408-6-7
7) Watanabe, W. M.; Candido Jr. A.; Amancio, M. A.; Oliveira, M.; Pardo, T. A. S.; Fortes, R. P. M.; Aluísio, S. M. (2010) Adapting web content for low-literacy readers by using lexical elaboration and named entities labeling. Proceedings of the W4A-7th International Cross-Disciplinary Conference on Web Accessibility 2010, 2010, Raleigh - NC. Proc. of W4A CoLocated with the 19th International World Wide Web Conference. Nova York : ACM Press, 2010. v. 1. p. 1-9.
8) Aluísio, S.M. and Gasperin, C. (2010) Fostering Digital Inclusion and Accessibility: The PorSimples project for Simplification of Portuguese Texts. Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas. New York : ACL, 2010. v. 1. p. 46-53.
9) ALUÍSIO, S. M. ; Specia, L. ; GASPERIN, C. ; Scarton, C. E. . Readability Assessment for Text Simplification. In: NAACL 5th Workshop on Innovative Use of NLP for Building Educational Applications (BEA-2010), 2010, Los Angeles. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications. New York : ACL, 2010. v. 1. p. 1-9.
10) Scarton, C. E. ; GASPERIN, C. ; ALUÍSIO, S. M. . Revisiting the Readability Assessment of Texts in Portuguese. In: IBERAMIA 2010, 2010, Bahia Blanca. Lecture Notes in Computer Science. Heidelberg : Springer, 2010. v. 6433. p. 306-315.
11) Amancio, M.A., Duran, M.S. and Aluisio, S.M. Automatic Question Categorization: a New Approach for Text Elaboration. Proceedings of the Workshop in Natural Language Processing and web-based Technologies 2010, in conjunction with IBERAMIA 2010, p. 21-30.
12) Amancio, M.A., Duran, M.S. and Aluisio, S.M. Automatic Question Categorization: a New Approach for Text Elaboration. Procesamiento del Lenguaje Natural, v. 46, p. 43-50, 2011.
13) Carolina Scarton, Matheus Oliveira, Arnaldo Candido Jr., Caroline Gasperin and Sandra Aluísio. (2010a) SIMPLIFICA: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments. Proceedings of the NAACL HLT 2010: Demonstration Session, pages 41–44, Los Angeles, California, June 2010.
14) Carolina Scarton, Matheus de Oliveira, Arnaldo Candido Jr., Caroline Gasperin, and Sandra Maria Aluísio. (2010b) SIMPLIFICA: an authoring system targeting simplifed texts in Brazilian Portuguese. In the Proceedings of PROPOR 2010, 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, 1 CD-ROM v1. ISSN: 21773580.
15) Carolina Scarton and Sandra Maria Aluísio. (2010) Coh-Metrix-Port: a readability assessment tool for texts in Brazilian Portuguese. In the Proceedings of PROPOR 2010, 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, 1 CD-ROM v1. ISSN: 21773580.
16) Fernando Muniz and Sandra Maria Aluísio. (2010) NorMan Extractor: Automatic term extraction from technical manuals. In the Proceedings of PROPOR 2010, 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities
17) Amancio, M.A., Watanabe, W. Candido Jr., A., Oliveira, M., Pardo, T.A.S., Fortes, R. P. M. and Aluísio, S.M. (2010) Educational FACILITA: helping users to understand textual content on the Web. In the Proceedings of PROPOR 2010, 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, 1 CD-ROM v1. ISSN: 21773580.
(2009)
1) Candido Jr. A., Maziero E., Gasperin, C., Pardo, T., Specia, L. and Aluisio, S. (2009). Supporting the Adaptation of Texts for Poor Literacy Readers: a Text Simplification Editor for Brazilian Portuguese. In: Proceedings of the NAACL HLT Workshop on Innovative Use of NLP for Building Educational Applications, pages 34–42, Boulder, Colorado, June 2009.
2) Caseli, H.M.; Pereira, T.F., Specia, L.; Pardo, T.A.S.; Gasperin, C.; Aluísio, S.M.; (2009). Building a Brazilian Portuguese parallel corpus of original and simplified texts. In Alexander Gelbukh (ed), Advances in Computational Linguistics, Research in Computer Science, vol 41, pp. 59-70. 10th Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2009), March 01–07, Mexico City.
3) Watanabe W.M., Candido Jr. A, Uzêda V., Fortes R. P. M., Pardo T. A. S., Aluisio S. M. Facilita: reading assistance for low-literacy readers. In the Proceedings of ACM SIGDOC 2009 - ACM International Conference on Design of Communication, 2009, Bloomington, IN. v. 1. p. 29 - 36.
4) WATANABE, W. M. ; FORTES, R. P. M. ; PARDO, T. A. S. ; ALUÍSIO, S. M. Facilita: auxílio à leitura de textos disponíveis na Web. In: WEBMEDIA 2009, 2009, Fortaleza. Proceedings of WEBMEDIA 2009, Fortaleza - CE. Artigos Curtos & Workshops. Porto Alegre : Sociedade Brasileira de Computação, 2009. v. 2. p. 27-30.
5) CÂNDIDO, Arnaldo ; Oliveira, M. ; ALUÍSIO, S. M. Simplifica: um Sistema Web de Autoria de Textos Simplificados. In: WEBMEDIA 2009, 2009, Fortaleza. Proceedings of WEBMEDIA 2009, Fortaleza - CE. Artigos Curtos & Workshops. Porto Alegre : Sociedade Brasileira de Computação, 2009. v. 2. p. 55-58.
6) GASPERIN, C. ; Specia, L. ; Pereira, T. F. ; ALUÍSIO, S. M. Learning When to Simplify Sentences for Natural Text Simplification. In: Encontro Nacional de Inteligência Artificial - ENIA 2009, 2009, Bento Gonçalves. XXX Congresso da Sociedade Brasileira de Computação. Porto Alegre : Sociedade Brasileira de Computação, 2009. v. 1. p. 809-818.
7) Scarton, C. E. ; Almeida, D. M. ; ALUÍSIO, S. M. Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português. In: The 7th Brazilian Symposium in Information and Human Language Technology, 2009, São Carlos. Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology, 2009. v. 1. p. 1-10.
8) Spolavori Santos, G. ; Silveira, M. S. ; ALUÍSIO, S. M. . Produção de Textos Paralelos em Língua Portuguesa e uma Interlíngua de LIBRAS. In: XXXVI Seminário Integrado de Software e Hardware, 2009, 2009, Bento Gonçalves. XXX Congresso da Sociedade Brasileira de Computação. Porto Alegre : Sociedade Brasileira de Computação, 2009. v. 1. p. 371-385.
9) GASPERIN, C. ; Maziero, E. ; Specia, L. ; PARDO, T. A. S. ; ALUÍSIO, S. M. . Natural language processing for social inclusion: a text simplification architecture for different literacy levels. In: XXXVI Seminário Integrado de Software e Hardware, 2009, 2009, Bento Gonçalves. XXX Congresso da Sociedade Brasileira de Computação. Porto Alegre : Sociedade Brasileira de Computação, 2009. v. 1. p. 387-401.
(2008)
1) ALUÍSIO, S. M. ; Specia, L. ; CASELI, H. ; PARDO, T. A. S. ; Maziero, E. ; FORTES, R. P. M. A Corpus Analysis of Simple Account Texts and the Proposal of Simplification Strategies: First Steps towards Text Simplification Systems. In: The 26th ACM International Conference on Design of Communication, 2008, Lisboa. Proceedings of The 26th ACM International Conference on Design of Communication. New York : ACM Press, 2008. v. 1. p. 15-22.
2) ALUÍSIO, S. M. ; Specia, L. ; PARDO, T. A. S. ; Maziero, E. ; FORTES, R. P. M. Towards Brazilian. Portuguese Automatic Text Simplification Systems. In: The ACM Symposium on Document Engineering, 2008, São Paulo. Proceedings of the 2008 ACM symposium on Document engineering. New York : ACM Digital Library, 2008. v. 1. p. 240-248.
3) Margarido, P. ; PARDO, T. A. S. ; ANTONIO, G. ; Fuentes, V. ; AIRES, Rachel ; ALUÍSIO, S. M. ; FORTES, R. P. M. Automatic Summarization for Text Simplification: Improving Text Comprehension by Functional Illiteracy Readers. In: Workshop em Tecnologia da Informação e da Linguagem Humana, 2008, Vilha Velha/ES. Anais do VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL 2008), 2008. v. 1. p. 310-315.
PRESENTATIONS:
1) Curi, M., Tanizawa, R., Aluisio, S. Avaliação da Simplificação Textual Através da Teoria de Resposta ao Item. Poster Presentation; Local: Praia Mole Eco Village; Cidade: Florianópolis; Evento: I Congresso Brasileiro de Teoria de Resposta ao Item - 9 a 11 de dezembro de 2009; Inst.promotora/financiadora: CAED-UFJF,CESPE-UNB, INEP-MEC.
2) Sandra Aluisio & Caroline Gasperin. (2010) PorSimples: Simplification of Portuguese Texts – Fostering Digital Inclusion and Accessibility: Microsoft External Research Symposium 2010. Presented at Microsoft External Research Symposium April 6-7, 2010, Redmond, Washington. (http://research.microsoft.com/en-us/events/ersymposium2010/)
3) Willian M. Watanabe. (2010) The Resulting Systems of PorSimples: FACILITA and Educational FACILITA. Poster Presented at Microsoft External Research Symposium April 6-7, 2010, Redmond, Washington. (http://research.microsoft.com/en-us/events/ersymposium2010/).
4) Arnaldo Candido Jr. (2010) The Resulting Systems of PorSimples: SIMPLIFICA: an Authoring Tool to Simplify Brazilian Portuguese Texts. Poster Presented at Microsoft External Research Symposium April 6-7, 2010, Redmond, Washington. (http://research.microsoft.com/en-us/events/ersymposium2010/)
5) Carolina Scarton & Erick Masiero. (2010) PorSimples: Simplification of Portuguese Texts for Digital Inclusion and Accessibility. Presented at DemoFest session of the Microsoft Research Faculty Summit 2010, May 12-14, 2010, Guarujá, Brazil. (http://research.microsoft.com/en-us/events/latamfacsum2010/)
6) Sandra Aluisio. Textual Simplification and the PorSimples project. Talk presented at the First Workshop of the Interinstitutional Center of Computation Linguistics (http://www.nilc.icmc.usp.br/workshopNILC/), April 23 2009, ICMC-USP, São Carlos/SP, Brazil
7 Lucia Specia. Building a Brazilian Portuguese Parallel Corpus of Original and Simplified Texts. Poster presentation at CICLing-2009 (Conference on Intelligent Text Processing and Computational Linguistics – http://www.cicling.org/2009/), March 1-7 2009, Mexico City, Mexico
Technical Reports
1) Caseli, H.M., Pereira, T.F., Aluísio, S. M. "Editor de Anotação de Simplificação: Manual do Usuário". NILC-TR-08-10, 17 p. Julho, 2008, São Carlos-SP.
2) Pereira, T.F.; Aluisio, S. M. "Editor de Anotação de Simplificação: Construção". Technical Report NILC-TR_08_12, 30 p. Agosto 2008, São Carlos-SP.
3) MAZIERO, E.G., PARDO, T.A.S. (2008). Interface de acesso à base TeP 2.0. Série de Relatórios do Núcleo Interinstitucional de Lingüística Computacional (NILC-TR-08-07), 12 p., Junho 2008, São Carlos-SP.
4) MAZIERO, E.G., PARDO, T.A.S., ALUÍSIO, S.M. (2008). Ferramenta de análise automática de inteligibilidade de córpus (AIC). Série de Relatórios do Núcleo Interinstitucional de Lingüística Computacional (NILC-TR-08-08), 14 p., Julho 2008, São Carlos-SP.
5) SPECIA, L.; ALUÍSIO, S. M.; PARDO, T. A. S. (2008). Manual de Simplificação Sintática para o Português. Série de Relatórios do Núcleo Interinstitucional de Lingüística Computacional (NILC-TR-08-06), 27 p., Junho 2008, São Carlos-SP.
6) Watanabe W. M. and Fortes R. P. 2009. "Desenvolvimento do Facilita". Technical Report - ICMC USP library. 86 p. Available at: http://www.icmc.usp.br/~biblio/BIBLIOTECA/rel_tec/RT_343.pdf.
7) Amancio, M.A.; Aluísio, S.M. (2008). Explicitação de Entidades Mencionadas visando o aumento de Inteligibilidade de Textos em Português. Série de Relatórios do NILC. NILC-TR-08-11, Agosto, 42p.
8) Pereira, T.; Aluísio, S.M. (2008). Avaliação da Inteligibilidade de Textos para a Simplificação Textual. Série de Relatórios do NILC. NILC-TR-08-12, Agosto, 31p
9) Tanizawa, R.; Aluisio, S.M. (2009). Avaliação em Larga Escala da Tarefa de Simplificação Textual no PorSimples. Relatório Científico Final de Iniciação Científica. 14p