Tagged Text Mining in Portuguese and English

by Nuno C. Marques, Paulo Quaresma, Vitor Rocio, Luís Cavique, Gäel Dias, Yiling Yang, Luís Moniz Pereira

Keywords: Data Mining, Information Extraction from Text, Part-of-speech tagging, Semantic Web and Relational Data Base Systems.

Abstract

(English version)

This project goal is the development of general techniques for automatic knowledge extraction on limited domain sets of texts (namely Portuguese texts). Relational databases and logic programming tools (in a perspective compatible the semantic-web [4]), with will be used to rigorously describe each domain. We will try to develop a general method to transform domain data written in texts into knowledge in a relational database.

This project will try to conjugate synergies of two normally distinct areas: we will apply concepts from the relational databases area to help the computational processing of the text. An entity-relation model for domain data will allow the construction of a database describing domain knowledge. Note that most of the previously known knowledge should be directly available on domain database. Integrity restrictions in the database will allow the validation of the extracted knowledge before its insertion in the database. The core process in knowledge extraction will be the automatic annotation of each word with a corresponding semantic tag. Domain knowledge will be used for defining the appropriate semantic tag-set. We will designate this tag-set as the domain tag-set. The process of determination of the domain tag-set from domain knowledge should be performed in a systematic (and possible automatic) way.

The use of techniques for automatic tag disambiguation [1] will be a necessary step before the several entities in text can be identified. This way the tag-set definition will have to follow closely the attributes and entities in the ER model. This methodology was already successfully applied to a preliminary study in postal addresses extraction [2]. With this project, we intend to develop a general method that can be directly applicable to any other domain. The generation of disambiguation tag-set and the train of the learning module should then be as automatic as possible. A second processing step applies relevant expression extraction techniques [3] conjoined with regular expression generation and robust parsing techniques for extracting knowledge and automatically inserting it in a the domain database. Finally, the extracted knowledge will be integrated into SINO, a textual database search engine. A Web interface will be developed for Internet access to both extracted information and source texts. Several case-studies will be available for Internet accesses. This way we hope to provide general access to this project results and also to raise industry interest in this research, paving the way for future applied research projects.

We will research an example exploring the application of text and data mining techniques to touristic data information (for first results on data mining please see [5]).

(Portuguese version)

O objectivo deste projecto é de o desenvolver de técnicas gerais para extracção automática de conhecimento em conjuntos de textos (nomeadamente em Português) sobre um dado domínio. Para descrever rigorosamente cada domínio serão aplicadas as técnicas tradicionais para modelação de bases de dados relacionais, onde tanto meta-dados como ferramentas utilizando programação em lógica (numa perspectiva compatível com a semantic-web [4])) serão utilizados. Tentar-se-á desenvolver um método geral para transformar os dados sobre um mesmo domínio contidos num dado conjunto de textos, em conhecimento a inserir numa base de dados relacional.
Propõe-se pois a conjugação de sinergias de duas áreas normalmente distintas: aplicaremos conceitos bem estudados na área das bases de dados relacionais e programação em lógica para facilitar o processamento computacional do texto. Assume-se a pré-existência de uma base de dados relacional, onde poderá já existir conhecimento parcial sobre o domínio. As relações entre as várias entidades condicionam igualmente a informação que pode ser inserida na base de dados. Assim a extracção do conhecimento passará pela classificação das várias palavras no texto de acordo com um conjunto de etiquetas especifico para cada domínio em causa. Este conjunto de etiquetas deverá ser definido, de forma sistemática, de acordo com a informação modelada do domínio em estudo.
Utilizar-se-ão técnicas de desambiguação automática de etiquetas [1] para a identificação das diversas entidades no texto. Para tal a definição do conjunto de etiquetas terá que seguir a informação de domínio. Esta metodologia foi já aplicada com sucesso no caso concreto da extracção de endereços postais [2], pretende-se desenvolver um método geral para aplicação a qualquer outro domínio. A geração do sistema de desambiguação de etiquetas deve assim ser o mais automática possível. De igual forma, a utilização de métodos automáticos de extracção de padrões [3] e expressões regulares conjugada com métodos robustos para análise sintáctica possibilitará finalmente o reconhecimento das entidades e relações que caracterizarão o conhecimento a inserir de forma automática na base de dados.
Os resultados serão integrados no motor de pesquisa em bases de dados textuais SINO. Será desenvolvida uma interface Web, disponível via Internet que possibilitará a pesquisa do conhecimento minerado quer via o conhecimento contido na base de dados relacional, quer via as bases de dados textuais do motor SINO. Finalmente, a disponibilização via Internet de vários casos de estudo poderá não só ser útil para o público em geral, como também despertar o interesse nesta pesquisa por parte de futuros parceiros na industria e serviços. Em concreto será explorado um exemplo aplicando técnicas de mineração de dados e texto sobre informação turística (para um primeiro estudo, utilizando mineração de dados ver [5]).

Objectives

Project

The main goal of this project is the development of a tagging system capable of text disambiguation for specific domains. This project will focus on information extraction with domain knowledge integration. For total domain knowledge integration the annotation tag sets to use should be generated with base on domain information. Domain information should be described by a relational data base system (by means of a standard entity-relation diagrams) and by logic semantic rules ([4]).

The part-of-speech tagger of [1] will be extended to treat semantically oriented tag-sets (a specific case of domain oriented tag-sets).

Finally, a fully functional system integrating extracted information and information retrieval will be available through Internet. Text search engine SINO (adapted to Portuguese in projects PGR and ABC) will be extended to achieve this goal. Several case studies will be made available to pave the way to cooperation with industry and services in more applied projects.
 

Principal Researcher (PI)

Project’s PI main goal is the development and generalization of the neural network part-of-speech (POS) tagger [1]. This research is integrated in the larger goal of studying the application of neural networks to problems involving the treatment of large volumes of data.

PI has developed pioneering work by developing the first POS tagger for Portuguese texts [6]. Since then, the work has been continued with the application of neural network models to part-of-speech and specific domain tagging problems [1,2].

Currently the use of neural network models for tagging, is considered by some authors (e.g. [7]) as one of the best ways to learn from text examples.

[6] - A. Vilavicencio, N. Marques, et al.. Part-of-Speech Tagging for portuguese Texts. In J. Wainer e A. Carvalho, (ed), Lecture Notes in AI 991. Springer Verlag. 1995.

[7] M. Collins. Discriminative Training Methods for HMMs: Theory and Experiments with Perceptron Algorithms. In Proc. ACL on Empirical Methods in NLP. 2002.

State of the Art

Information extraction research was mainly supported by the Message Understanding Conferences series ([8]). After the last MUC, the detection of named entities could be achieved with a reliability on 90th percentile (but events were still in 60th percentile). MUCs are now replaced by NIST’s ACE program ([9]). ACE objective is the developing automatic content extraction technology to support automatic processing of language data, namely speech and text (i.e. things such as phone calls, radio bulletins or news texts). ACE processing model specifies the insertion of extracted knowledge into a database. However the open-domain goal of ACE limits domain knowledge integration. Until now information extraction research focus mainly on the recognition of entities like Location, Organization or Person but it is still not possible to relate a given location with a precise geographic place (as it is done in e.g. [2]), i.e. domain knowledge integration is not achieved (e.g. as needed for [4]).

Corpus-based methods and machine learning provide an excellent way for developing robust and efficient NLP systems (e.g. [10,11]). An example of these techniques is part-of-Speech tagging (e.g. [1, 6, 7, 12]). However, the treatment of unrestricted text and corpora requires the consideration of engineering aspects. For example, text has to be split in simple tokens and sentences have to be segmented. Unix OS text processing tools and related script programming languages are currently taught by PI in [13] and have provided a simple way to manipulate huge volumes of text. However, alone, these tools allow limited compatibility. The open-source GATE architecture [14] composed of Java API XML enabled collection CREOLE includes tokenization, sentence splitting, gazetteer (containing lists such as cities, organizations or days of week) and semantic tagging. The integration of previously developed work in this Java/XML environment could be a simple way of both extending functionality and sharing the results of the systems here proposed. E.g. [15] is a open-domain semantic annotation platform based on GATE architecture. KIM includes KIMO ontology of concepts. Multipurpose XML annotation schemas allowing the mapping of relations of concepts at several levels (e.g. [16]) will be studied in project task T1,.

This proposal draws on the knowledge and experience of project team in natural language corpus based research (e.g. [20,21]), and particularly on the extensive knowledge on database lexical resources and part-of-spech problems (e.g. [1,2,3,17]). [17] presents a first work on the use of neural networks for part-of-speech tagging. [1] shows that the neural network POS-tagger when properly backed up by a lexical database, can learn by using only a very limited number of training examples. Also, recently, an independent author [7] argued that neural-network based methods give better results than state-of-the-art maximum likelihood methods for learning natural language tags. [18] shows that part-of-speech taggers constitute an excellent source for word-sense disambiguation. In our research semantic tagging ([19]) will be seen as an immediate extension of POS-Tagging.

Extracted information must be related with source information on text. SINO is a specialized text search engine that has been used in many projects. In the context of AdI projects [20], SINO was already used as a base tool for the implementation of web information retrieval systems. This project will show the advantages of extending information retrieval systems to web information retrieval and extraction systems.

[8] http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html
[9] http://www.nist.gov/speech/history/index.htm
[10] Church, K. Mercer, R. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics (CL). 19(1). MIT Press. ISSN:0891-2017.
[11] Ratnaparkhi, A. (1999).Learning to Parse Natural Language with Maximum Entropy Models. Machine Learning. 34(1-3). Kluwer Academic Publishers. ISSN 0885-6125.
[12] Tufis, D. Dragomirescu L. . (2004)- Tiered Tagging Revisited. In Proceedings of the IV LREC. Volume I. pp. 39-42. Lisboa, Portugal, May 2004.
[13] Msc courses at http://di.fct.unl.pt, namely http://www.di.fct.unl.pt/mei0304 .
[14] http://gate.ac.uk/
[15] http://www.ontotext.com/kim/index.html
[16] Buitelaar P. et. Al. A Multi-Layered, XML-Based Approach to the Integration of Linguistic and Semantic Annotations. EACL 2003 3rd NLPXML..
[17] Marques, N. C., Lopes, J. (1996) Using Neural Nets for Portuguese Part-of-Speech Tagging. In Proceedings of the Fifth CSNLP.
[18] Stevenson, M. Wilks, Y. (2001). The interaction of knowledge sources in word sense disambiguation. CL 27(3).
[19] Ide, N. Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: the state of the art. CL 24(1).
[20] Projects PGR LO59-P31B-02/97 and ABC/AdI/2002
[21] Projects PRAXIS 2/2.1/CSH/778/95 or PBIC/C/TIT/1226/92
 

Results and Repercussions
 

A pilot system available in a Web page for general public use, presentations at international conferences, publication of scientific articles in peer-reviewed literature, organization of seminars and workshops will be the main ways of diffusion of project scientific results. A Web page will be maintained for diffusion of the project objectives and results with interest for the public in general.

Public dissemination of the resultant models and supporting code will be done as soon as possible. Its use by researchers and technicians involved in semantic web, NLP, text mining and data mining will be encouraged. Relevant project results will be taught in MsC classes at [13]. This will also allow interested talented students to develop their MsC in an area related with this project.

As soon as it becomes possible, a first Web interface to the pilot system will be available. Free documentation texts and extracted knowledge in selected case studies will be made available to general public

Repercussions (description)

The most natural form of storing information is text. So it is believed that text mining could have a much higher commercial impact than data mining.

We intend to select as a domain test bed the area of touristic information and culture (e.g. previous contacts exist with [22]). So, let us select as an illustrative example, the case of domain Pousadas de Portugal (studied in [5]) - a state owned network of hotel resorts in buildings of historical interest (like Pousada Rainha Sta. Isabel, situated in a XIII century castle in the middle of Portugal) or in particularly pleasant places across Portugal.

Besides the obviously highly structured information that could be inserted in a database (e.g. for all the Pousadas: location, main characteristics, nearby interesting places or main food menus), a huge amount of related text information is also available (both in Portuguese and English): specific tourist guides and bulletins are regularly issued and available on Internet (both in Portuguese and English sites) and main touristic events in sport or culture are available form LUSA news agency (as it can be seen in LUSA corpus [21]).

Specific domain text extraction could be used to update the more static information with up-to-date information either from the general news, from a new issue of a specific touristic guide or just from the daily inserted text menu of a Pousada restaurant. This way it would be possible (for a registered user) to receive an SMS or email alert saying that users favorite food is available in the Manteigas Pousada, with a classical music concert nearby (can be computed based on [2]) and that according to Jornal da Covilhã, the roads to the skiing resort in Serra da Estrela are open. Indeed the system could have detected automatically (using data mining techniques as [5]) a strong association between stays in the Pousadas, the fact that the ski resort was always open during previous user stays in Manteigas, the favorite meals the user always eats in all the Pousadas and the fact that the user always eats in Pousada de Queluz during the concert season.

We think this example is illustrative of the main repercussions this project could have on improving state-of-the-art scientific tools for areas that could be decisive for Portugal. The integration of synergies with other CENTRIA ongoing research, such as the one performed by the FP6 NoE REWERSE (Reasoning on the Web with Rules and Semantics), headed in Portugal by one of the team members [4, 23], should also be possible. Portuguese research into new areas of knowledge can then be directly applied to the services and industry in Portugal.

[22] CESEM (http://www.fcsh.unl.pt/cesem/ ), namely regarding the system music query (e.g. see project “Investigação, Edição e Estudos Críticos de Música Portuguesa dos Séculos XVIII a XX”).
[23] L. M. Pereira, REWERSE - Reasoning on the Web with Rules and Semantics,
Computational Logic Newsletter 5:8-9, December 2003.

 

Task List

T1 - Domain tag Annotation Based on Knowledge Models

Task description

This task goal is the definition of a semi-automatic mapping capable of building a domain specific tag-set based on domain knowledge and standard natural language analysis tools. We will start by creating resources for selected case studies in tagged text mining. Then, a general framework for representing and handling different domain information should be created. Different annotation schemas should be studied in order to allow the integration of the several annotation levels involved in this project. Particular care must be taken to provide tools capable of mapping between human annotated information for each domain and the specific tags of different automatic processing levels. Due to the rich set of tools already available (e.g. Gate Project [14])), a careful review must be made of available systems capable of information extraction (namely the MUC conferences [8] and recent results in ACE project [9]). All annotation will use the XML standard so it could be easily ported among different tools.

As input to the work here described we assume to have a set of texts over well-defined domains. Each domain should be represented by a relational database with corresponding E.R. model, and take into account both the structure of the knowledge we want to model (i.e. textual information to mine) and useful pre-inserted domain background knowledge.

The standard levels in natural language processing should be considered for helping human annotation. However only domain specific information should be manually annotated. That way the output of other processing modules should be integrated into the annotation in a way as transparent to the user as possible. Namely, standard morphologic analyses (we can use standard Unix text tools or a GATE module [14]), POS tagging (we will adapt and develop the system presented in [1]) and rule extractors (UNIX regular expressions will be used as based level), must be integrated by means of XML annotations. A set of tools - as domain independent as possible- should be developed or adapted to help the domain annotator.

Based on previous work, we will select as domain examples texts from three case studies. The first case study is the postal address problem ([2]). The second case study will be email announces for scientific conferences (i.e. “call-for-papers”- this problem is currently being studied). The third and more difficult problem will be the Touristic and Cultural example described in sec. 8.4. We will model and collect selected texts in these domains. Then we will manually annotate these texts with domain specific knowledge. The annotation scheme should allow a precise mapping between the annotations and fields in the domain relational database. In order to do so, a set of domain specific tags – based on the entities and attributes in the data-base, conjoined with the automatic analysis of text should be purposed in a semi-automatic way. This domain specific tag-set should be identified for each domain ER.

Expected Results

A general DTD schema should be proposed and used for domain annotation in selected domains: Postal address detection in institutional web pages, conference data in emails with calls for scientific publications (i.e. call for papers announces) and touristic and cultural related information (please see sec 8.4).

The developed DTD schema will be used for providing domain specific annotations in the text. These annotations will be automatically complemented by the use of standard natural language tools. Unlike standard resources where all the tokens are treated and identified, this work will only focus on relevant main tokens. Over these relevant main tokens the system will automatically apply a tokenizer (for trying to divide the main tokens), a standard part-of-speech tagger (to assign part-of-speech tags to all the relevant words) and regular expressions (for trying to splitting the text input into meaningful fields for the database). All this annotation will be done automatically. The several possible alternatives will be marked.

Wherever possible, annotated texts and extracted knowledge will be made available on the project web page. Limited annotating functionality should also be available to normal users, since by annotating the text the user is inserting more information in the database. So, if the case-studies are useful and attract enough users, we also expect a reasonable increase in resources by making annotating tolls available to the general public. All the annotation efforts on publicly available texts will be available for general use. In order to “boot-strap” this process, and to guarantee enough tagged text for the following tasks, after a first effort on hand annotation made by involved researchers, a human annotator will be contracted for text annotation. This annotation will be done essentially, in the touristic domain (this will be supported by budget item “Acquisition of services and maintenance”).

Two papers will be submitted to peer reviewed conferences.

A Web-based text annotation tool, and modules to open-source systems developed by others (e.g. the Gate architecture [14]) will be developed and made available through project web page.
 

T2 - Domain Tag Disambiguation

Expected results

The part-of-speech tagger developed in [1], will be extended to handle specific domain tag-sets. The domain-tagger will be generalized by extending the base POS-tagger with: the integration of more context, the use of a hidden tag-set and a richer set of features. The use of XML annotation will help the integration of this system with other systems.

A public domain release of the neural tagger will be made available in project web page. This tool will be presented to students in [13] and will be described in submitted papers. This is part on the effort of dissemination of project results into NLP and Text Mining research communities. Since a GNU-GPL license will be used, collaboration with other researchers in the field will be possible.

Two papers will be submitted to peer-reviewed conference proceedings: one on domain tagging and POS-Tagging, one on extending the POS-Tagger context and on using better features for domain-tag tagging.
One scientific article will be submitted to a peer-reviewed journal about POS tagging. Based on work done also in T1, a master thesis is possible.

A final system capable of domain tag annotation with a generic web interface will be available upon completion of this task. This tool will be a first step for starting the development of applied projects in specific domain areas. The use of Portuguese texts will also contribute for the increase in knowledge and processing power for the computational treatment of Portuguese.
 

Task description

This task main goal will be the development of a general purpose domain tag disambiguator. This system will be based on the Neural-Network part-of-speech tagger described in [1]. Indeed, previous results support our claim that the neural network tagger is an excellent candidate for a domain specific tagger ([1,7]): the very small requirements in terms of learning data of the POS-tagger make it useful for specialized domains with few training data. Also, the capacity for using databases of previous knowledge will be a key issue for domain knowledge integration. Finally, because specialized domains can be easily formalized by their meta-data (contained in the database description) and by specialized semantical rules, specialized domain tag-sets will be constructed.

This system has already been successfully adapted for a specific domain tagging in [2]. Due to the excellent learning characteristics of neural-network taggers ([7]), particularly when extended with background domain knowledge ([1]), this tagger should be able to learn with very small amounts of training data. So this task will use human annotated texts from T1 to learn to assign disambiguating probabilities to ambiguous annotations in text. The specific set of features and tags to be used by the system should be learnt in a fully automatic way by using the concept of surface and hidden tag-sets ([24]) and by automatically learning the best neural network feature vectors through evolutionary programming ([25]).

In a first step the automatically annotated input resulting from T1, will be converted into feature vectors by using a set of indicator functions (as defined, e.g. in [7]). Indicator functions allow a generalization of the context and features to use. The previous experience of using large lexicons as backup lexical knowledge for a part-of-speech tagger, should be extended to access all the available knowledge in the database for building indicator functions.

Careful literature review and evaluation of good indicator functions should be made regarding this task. All the indicator functions relevant for the task of domain tagging in any domain should be used. These input vectors can be acquired for any text in any domain.

The second step will be the training of the neural network. This step must use the hand annotated text of T1 for generalizing the patterns contained in the input vectors. However particular care must be taken for learning to classify the right tag-set. The method presented in [24], where a hidden tag-set is automatically generated will be researched in this task. Finally neural networks using evolutionary programming [25], will be used.

[24] Tufis, D. Dragomirescu. L. (2004) Tiered Tagging Revisited. In Proc. 4th Int. Conf. LREC. Volume I. pp. 39-42. Lisboa, Portugal.
[25] Marco Castellanni. (2003). Annual Activity Report. Centria Technical Report.
 

T3 - Non-Contiguous Regular Expressions for Knowledge Mining and Recovery

Expected results

The following results are expected after conclusion of this task:

* A software tool capable of converting lists of multiword units, annotated either by human or machine with domain specific tags, into general rules. These general rules could then be used to find relevant patterns to be inserted into the database from automatically annotated input texts.


* A software tool capable of helping the human knowledge expert to annotate unexpected or strange relevant patterns in text. These tool will be based on the fact that algorithms for the automatically extraction of regular expressions (e.g. Savchenko2002) allow us to find unexpected patterns.

* 2 peer reviewed conference papers describing software tools in previous points.

Task description

This task main goal will be the development of a general model for parsing the domain tags and other annotations. Available lists of entities (either domain dependent knowledge previously inserted into the database or human annotated data) are however distinct from the patterns available in text.

That is why automatic regular expression extractors should be applied to text. Indeed, the Extraction and identification of Multiword Units(MWUs) has shown successful results in applications that need a certain degree of semantics ([28]). For that purpose a system called HELAS (Hybrid Extraction of Lexical Associations) [3,29] was developed. HELAS has been thought and developed around the idea of total flexibility. Exclusively based on a new probabilistic measure (the combined Association Measure) and a new acquisition process (the GenLocalMaxs algorithm), HELAS detects multiword lexical units by processing only once a tagged corpus of any language, any domain or any type without changing it in any form. MWUs include a large range of linguistic phenomena, such as compound nouns, phrasal verbs, adverbial locutions, compound determinants, prepositional locutions and institutionalized phrases. MWUs are frequently used in everyday language, usually to precisely express ideas and concepts that cannot be compressed into a single word. As a consequence, their identification is a crucial issue for applications that require some degree of semantic processing.

HELAS only outputs a list of unstructured sequences of words. So, in order to recognize them from new input texts, these lists should be converted into regular expressions. Models based on Regular Expressions, finite automata and grammars have been used, in particular, for Natural Language Syntactic analysis [26]. However, these models are manually built from patterns that we expect to find in texts. But, algorithms for the automatic extraction of regular expressions [27] allow to find expected as well as unexpected patterns thus contributing to better knowledge of the analyzed data and as a consequence its optional annotation by the knowledge expert.

Automatically extracted regular expressions over annotated data could be used directly to convert domain tagged raw text into fields that may directly be inserted into the database.

[26] V. Rocio, E. de la Clergerie and J.G.P.Lopes. 2001. Tabulation for multi-purpose partial parsing. Grammars. 4(1): 41-65. Kluwer Academic Publishers.
[27] Savchenko. Regular Expression Mining and its Information Quality Applications. In: Craig Fisher and Bruce N. Davison, eds., Proceedings of the Seventh International Conference on Information Quality (IQ 2002), MIT, pp. 177-186.
[28] O. Vechtomova, M. Karamuftuoglu. (2004). Use of Noun Phrases in Interactive Search Refinement. MEMURA Workshop of the 4th LREC. Dias, G., Lopes, J.G.L. & Vintar, S. (eds). Lisbon, Portugal. ISBN: 2-9517408-1-6.
[29] http://helas.di.ubi.pt

T4 - Adding a general purpose interface for Tagged Text Mining to SINO

Expected results

As a result of this task, a new version of the text search engine SINO will be produced allowing to index additional information associated with each word. This new version will use the tags created in task T2 to reduce the complexity of the extraction of relevant expressions task (reducing the input size and improving the access to the documents).

SINO connection with a relational database engine (e.g. postgreSQL) will be enforced by providing primitives for handling extracted information and its association with base texts.

Since the final result of this task will be a web interface to all the research developed in this project, and to some previous research work already completed (namely [5] and [31]), a greater effort on software engineering is needed in this task (this will be supported by budget item “Acquisition of services and maintenance”).

A web interface illustrating the capabilities of the full system in selected case studies will be publicly available through Internet. Selected case studies will be used for illustrating the possible direct application of this research results in services and industry. Namely, a case study on tourist information will be made available to general public. We intend to select as a domain test bed the area of tourist information [5] and culture (previous contacts exist with [22]). After project completion, a pilot system, integrating access to database knowledge by web forms, email alarm systems, discussion forums and general news will be ready (please see section 8.4 for an illustrative example).

The success of this research could be easily checked by how popular the available information will be. If success is achieved, this will be one more argument for persuading a possible future R&D project with an industry partner (e.g. by [30]). Also from the research point of view, touristic and cultural information are a very coherent field. Correlation exists between several different areas such as culture, sports (based on previous experience with the LUSA corpus [21]) or place history (information available in Universidade Aberta partner).

Four papers will be submitted to peer-reviewed conferences as a result of this work. A final journal paper with the results of the full project will be submitted to a peer-reviewed conference after completion of this task. One master thesis is expected after the completion of this task.

[30] – http://www.adi.pt
 

Task Description

This task aims to improve the existent text search engine SINO to take into account, database information, domain tags and other kinds of annotation associated with words in sentences. A final project pilot Web system with a full range of data mining capabilities will be developed as a result of this integration.

SINO is a search engine used in several research projects, such as the AustLII -- Australasian Legal Information Institute one. In Portugal it has been used in [20]. In spite of being specialized for the Portuguese language, SINO does not take into account tags with information (part-of-speech, semantic, etc.) associated with each word. In this task, an extension of SINO will be done, allowing to index words with their associated information and to query this information. It will be possible to query the text base for sentences (or words) where specific tags appear. This new feature will be used in task T4 for reducing the spatial and temporal complexity of the expression extraction process.

Also the integration with specific domain knowledge will be possible by building general modules to search and manage specialized information on the database. Automatic code generation modules based on the domain knowledge will also be researched in this task. Namely the work in [5] will be used to mine relevant association rules on all information types available in each domain database. Mined knowledge will then be available as another entity in the domain database. Also the intended pilot system should adapt itself to the user, based on knowledge mined on access logs [31].

SINO integration will be a key issue for using this system. The domain knowledge should be available by conjoining two distinct functionalities: the information retrieval capabilities of SINO and a standard relational database engine (e.g. the open-source PostgresSQL database) containing previously available information, the information extracted by the system and data mining results. Sino will be used both for information retrieval, and to associate extracted information with the texts the information was extracted from.

[31] Yang,Y. Guan, X. You, J. (2002) CLOPE: A fast and effective clustering algorithm for transactional data. ACM SIGKDD '02, July 23-26, Edmonton, Alberta, Canada. (student scholarship award paper).

 

 

Project Main References

Year Publication
2001 [1] Marques, N.C. and Lopes, J.G. (2001) Tagging With Small Training Corpora. In F. Hoffmann, D. Hand, N. Adams, D. Fisher and G. Guimarães, Editors, Advances in Intelligent Data Analysis (LNCS 2189), 4th International Conference, IDA, pp. 63-72. Springer Verlag.
2004 [2] Marques, N.C. and Gonçalves, S. (2004). Applying a Part-of-Speech Tagger to Postal Address Detection on the Web. In Proceedings of the IV International Conference on Language Resources and Evaluation. LREC 2004. Volume I. pp. 287-290. Lisboa, Portugal.
2003 [3] Dias, G. (2003). Multiword Unit Hybrid Extraction. Workshop on Multiword Expressions of the 41st ACL meeting. 7-12 July. Sapporo. Japan.
2003 [4] Alferes J. J., Damásio C. V., Pereira, L. M. (2003). Semantic Web Logic Programming Tools, invited paper in: F. Bry, N. Henze, J. Maluszynski (eds.), Procs. Workshop on Principles and Practice of Semantic Web Reasoning (PPSWR´03), pp. 16-32, Springer, LNCS 2901. At 19th Int. Conf. on Logic Programming (ICLP ´03), Mumbai, India, December, 2003.
2003 [5] L. Cavique (2003), "Micro-Segmentação de Clientes com Base em Dados de Consumo: Modelo RM-Similis", Revista Portuguesa e Brasileira de Gestão, pp. 72-77, volume 2, nº 3.