by Nuno C. Marques, Paulo Quaresma, Vitor Rocio, Luís Cavique, Gäel Dias, Yiling Yang, Luís Moniz Pereira
Keywords: Data Mining, Information Extraction from Text, Part-of-speech tagging, Semantic Web and Relational Data Base Systems.
This project goal is the development of general techniques for automatic knowledge extraction on limited domain sets of texts (namely Portuguese texts). Relational databases and logic programming tools (in a perspective compatible the semantic-web [4]), with will be used to rigorously describe each domain. We will try to develop a general method to transform domain data written in texts into knowledge in a relational database.
This project will try to conjugate synergies of two normally distinct areas: we will apply concepts from the relational databases area to help the computational processing of the text. An entity-relation model for domain data will allow the construction of a database describing domain knowledge. Note that most of the previously known knowledge should be directly available on domain database. Integrity restrictions in the database will allow the validation of the extracted knowledge before its insertion in the database. The core process in knowledge extraction will be the automatic annotation of each word with a corresponding semantic tag. Domain knowledge will be used for defining the appropriate semantic tag-set. We will designate this tag-set as the domain tag-set. The process of determination of the domain tag-set from domain knowledge should be performed in a systematic (and possible automatic) way.
The use of techniques for automatic tag disambiguation [1] will be a necessary step before the several entities in text can be identified. This way the tag-set definition will have to follow closely the attributes and entities in the ER model. This methodology was already successfully applied to a preliminary study in postal addresses extraction [2]. With this project, we intend to develop a general method that can be directly applicable to any other domain. The generation of disambiguation tag-set and the train of the learning module should then be as automatic as possible. A second processing step applies relevant expression extraction techniques [3] conjoined with regular expression generation and robust parsing techniques for extracting knowledge and automatically inserting it in a the domain database. Finally, the extracted knowledge will be integrated into SINO, a textual database search engine. A Web interface will be developed for Internet access to both extracted information and source texts. Several case-studies will be available for Internet accesses. This way we hope to provide general access to this project results and also to raise industry interest in this research, paving the way for future applied research projects.
We will research an example exploring the application of text and data mining techniques to touristic data information (for first results on data mining please see [5]).
O objectivo deste projecto é de o desenvolver de técnicas gerais para
extracção automática de conhecimento em conjuntos de textos (nomeadamente em
Português) sobre um dado domínio. Para descrever rigorosamente cada domínio
serão aplicadas as técnicas tradicionais para modelação de bases de dados
relacionais, onde tanto meta-dados como ferramentas utilizando programação em
lógica (numa perspectiva compatível com a semantic-web [4])) serão utilizados.
Tentar-se-á desenvolver um método geral para transformar os dados sobre um mesmo
domínio contidos num dado conjunto de textos, em conhecimento a inserir numa
base de dados relacional.
Propõe-se pois a conjugação de sinergias de
duas áreas normalmente distintas: aplicaremos conceitos bem estudados na área
das bases de dados relacionais e programação em lógica para facilitar o
processamento computacional do texto. Assume-se a pré-existência de uma base de
dados relacional, onde poderá já existir conhecimento parcial sobre o domínio.
As relações entre as várias entidades condicionam igualmente a informação que
pode ser inserida na base de dados. Assim a extracção do conhecimento passará
pela classificação das várias palavras no texto de acordo com um conjunto de
etiquetas especifico para cada domínio em causa. Este conjunto de etiquetas
deverá ser definido, de forma sistemática, de acordo com a informação modelada
do domínio em estudo.
Utilizar-se-ão técnicas de desambiguação automática
de etiquetas [1] para a identificação das diversas entidades no texto. Para tal
a definição do conjunto de etiquetas terá que seguir a informação de domínio.
Esta metodologia foi já aplicada com sucesso no caso concreto da extracção de
endereços postais [2], pretende-se desenvolver um método geral para aplicação a
qualquer outro domínio. A geração do sistema de desambiguação de etiquetas deve
assim ser o mais automática possível. De igual forma, a utilização de métodos
automáticos de extracção de padrões [3] e expressões regulares conjugada com
métodos robustos para análise sintáctica possibilitará finalmente o
reconhecimento das entidades e relações que caracterizarão o conhecimento a
inserir de forma automática na base de dados.
Os resultados serão
integrados no motor de pesquisa em bases de dados textuais SINO. Será
desenvolvida uma interface Web, disponível via Internet que possibilitará a
pesquisa do conhecimento minerado quer via o conhecimento contido na base de
dados relacional, quer via as bases de dados textuais do motor SINO. Finalmente,
a disponibilização via Internet de vários casos de estudo poderá não só ser útil
para o público em geral, como também despertar o interesse nesta pesquisa por
parte de futuros parceiros na industria e serviços. Em concreto será explorado
um exemplo aplicando técnicas de mineração de dados e texto sobre informação
turística (para um primeiro estudo, utilizando mineração de dados ver
[5]).
The main goal of this project is the development of a tagging
system capable of text disambiguation for specific domains. This
project will focus on information extraction with domain knowledge
integration. For total domain knowledge integration the annotation
tag sets to use should be generated with base on domain information.
Domain information should be described by a relational data base
system (by means of a standard entity-relation diagrams) and by
logic semantic rules ([4]).
The part-of-speech tagger of [1] will be extended to treat
semantically oriented tag-sets (a specific case of domain oriented
tag-sets).
Finally, a fully functional system integrating extracted information
and information retrieval will be available through Internet. Text
search engine SINO (adapted to Portuguese in projects PGR and ABC)
will be extended to achieve this goal. Several case studies will be
made available to pave the way to cooperation with industry and
services in more applied projects.
Project’s PI main goal is the development and generalization of
the neural network part-of-speech (POS) tagger [1]. This research is
integrated in the larger goal of studying the application of neural
networks to problems involving the treatment of large volumes of
data.
PI has developed pioneering work by developing the first POS tagger
for Portuguese texts [6]. Since then, the work has been continued
with the application of neural network models to part-of-speech and
specific domain tagging problems [1,2].
Currently the use of neural network models for tagging, is
considered by some authors (e.g. [7]) as one of the best ways to
learn from text examples.
[6] - A. Vilavicencio, N. Marques, et al.. Part-of-Speech Tagging
for portuguese Texts. In J. Wainer e A. Carvalho, (ed), Lecture
Notes in AI 991. Springer Verlag. 1995.
[7] M. Collins. Discriminative Training Methods for HMMs: Theory and
Experiments with Perceptron Algorithms. In Proc. ACL on Empirical
Methods in NLP. 2002.
Information extraction research was mainly supported by the
Message Understanding Conferences series ([8]). After the last MUC,
the detection of named entities could be achieved with a reliability
on 90th percentile (but events were still in 60th percentile). MUCs
are now replaced by NIST’s ACE program ([9]). ACE objective is the
developing automatic content extraction technology to support
automatic processing of language data, namely speech and text (i.e.
things such as phone calls, radio bulletins or news texts). ACE
processing model specifies the insertion of extracted knowledge into
a database. However the open-domain goal of ACE limits domain
knowledge integration. Until now information extraction research
focus mainly on the recognition of entities like Location,
Organization or Person but it is still not possible to relate a
given location with a precise geographic place (as it is done in
e.g. [2]), i.e. domain knowledge integration is not achieved (e.g.
as needed for [4]).
Corpus-based methods and machine learning provide an excellent way
for developing robust and efficient NLP systems (e.g. [10,11]). An
example of these techniques is part-of-Speech tagging (e.g. [1, 6,
7, 12]). However, the treatment of unrestricted text and corpora
requires the consideration of engineering aspects. For example, text
has to be split in simple tokens and sentences have to be segmented.
Unix OS text processing tools and related script programming
languages are currently taught by PI in [13] and have provided a
simple way to manipulate huge volumes of text. However, alone, these
tools allow limited compatibility. The open-source GATE architecture
[14] composed of Java API XML enabled collection CREOLE includes tokenization, sentence splitting, gazetteer (containing lists such
as cities, organizations or days of week) and semantic tagging. The
integration of previously developed work in this Java/XML
environment could be a simple way of both extending functionality
and sharing the results of the systems here proposed. E.g. [15] is a
open-domain semantic annotation platform based on GATE architecture.
KIM includes KIMO ontology of concepts. Multipurpose XML annotation
schemas allowing the mapping of relations of concepts at several
levels (e.g. [16]) will be studied in project task T1,.
This proposal draws on the knowledge and experience of project team
in natural language corpus based research (e.g. [20,21]), and
particularly on the extensive knowledge on database lexical
resources and part-of-spech problems (e.g. [1,2,3,17]). [17]
presents a first work on the use of neural networks for
part-of-speech tagging. [1] shows that the neural network POS-tagger
when properly backed up by a lexical database, can learn by using
only a very limited number of training examples. Also, recently, an
independent author [7] argued that neural-network based methods give
better results than state-of-the-art maximum likelihood methods for
learning natural language tags. [18] shows that part-of-speech taggers constitute an excellent source for word-sense
disambiguation. In our research semantic tagging ([19]) will be seen
as an immediate extension of POS-Tagging.
Extracted information must be related with source information on
text. SINO is a specialized text search engine that has been used in
many projects. In the context of AdI projects [20], SINO was already
used as a base tool for the implementation of web information
retrieval systems. This project will show the advantages of
extending information retrieval systems to web information retrieval
and extraction systems.
[8] http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html
[9] http://www.nist.gov/speech/history/index.htm
[10] Church, K. Mercer, R. (1993). Introduction to the special issue
on computational linguistics using large corpora. Computational
Linguistics (CL). 19(1). MIT Press. ISSN:0891-2017.
[11] Ratnaparkhi, A. (1999).Learning to Parse Natural Language with
Maximum Entropy Models. Machine Learning. 34(1-3). Kluwer Academic
Publishers. ISSN 0885-6125.
[12] Tufis, D. Dragomirescu L. . (2004)- Tiered Tagging Revisited.
In Proceedings of the IV LREC. Volume I. pp. 39-42. Lisboa,
Portugal, May 2004.
[13] Msc courses at http://di.fct.unl.pt, namely http://www.di.fct.unl.pt/mei0304
.
[14] http://gate.ac.uk/
[15] http://www.ontotext.com/kim/index.html
[16] Buitelaar P. et. Al. A Multi-Layered, XML-Based Approach to the
Integration of Linguistic and Semantic Annotations. EACL 2003 3rd
NLPXML..
[17] Marques, N. C., Lopes, J. (1996) Using Neural Nets for
Portuguese Part-of-Speech Tagging. In Proceedings of the Fifth CSNLP.
[18] Stevenson, M. Wilks, Y. (2001). The interaction of knowledge
sources in word sense disambiguation. CL 27(3).
[19] Ide, N. Véronis, J. (1998). Introduction to the special issue
on word sense disambiguation: the state of the art. CL 24(1).
[20] Projects PGR LO59-P31B-02/97 and ABC/AdI/2002
[21] Projects PRAXIS 2/2.1/CSH/778/95 or PBIC/C/TIT/1226/92
A pilot system available in a Web page for general public use,
presentations at international conferences, publication of
scientific articles in peer-reviewed literature, organization of
seminars and workshops will be the main ways of diffusion of project
scientific results. A Web page will be maintained for diffusion of
the project objectives and results with interest for the public in
general.
Public dissemination of the resultant models and supporting code
will be done as soon as possible. Its use by researchers and
technicians involved in semantic web, NLP, text mining and data
mining will be encouraged. Relevant project results will be taught
in MsC classes at [13]. This will also allow interested talented
students to develop their MsC in an area related with this project.
As soon as it becomes possible, a first Web interface to the pilot
system will be available. Free documentation texts and extracted
knowledge in selected case studies will be made available to general
public
The most natural form of storing information is text. So it is
believed that text mining could have a much higher commercial impact
than data mining.
We intend to select as a domain test bed the area of touristic
information and culture (e.g. previous contacts exist with [22]).
So, let us select as an illustrative example, the case of domain Pousadas de Portugal (studied in [5]) - a state owned network of
hotel resorts in buildings of historical interest (like Pousada
Rainha Sta. Isabel, situated in a XIII century castle in the middle
of Portugal) or in particularly pleasant places across Portugal.
Besides the obviously highly structured information that could be
inserted in a database (e.g. for all the Pousadas: location, main
characteristics, nearby interesting places or main food menus), a
huge amount of related text information is also available (both in
Portuguese and English): specific tourist guides and bulletins are
regularly issued and available on Internet (both in Portuguese and
English sites) and main touristic events in sport or culture are
available form LUSA news agency (as it can be seen in LUSA corpus
[21]).
Specific domain text extraction could be used to update the more
static information with up-to-date information either from the
general news, from a new issue of a specific touristic guide or just
from the daily inserted text menu of a Pousada restaurant. This way
it would be possible (for a registered user) to receive an SMS or
email alert saying that users favorite food is available in the Manteigas Pousada, with a classical music concert nearby (can be
computed based on [2]) and that according to Jornal da Covilhã, the
roads to the skiing resort in Serra da Estrela are open. Indeed the
system could have detected automatically (using data mining
techniques as [5]) a strong association between stays in the Pousadas, the fact that the ski resort was always open during
previous user stays in Manteigas, the favorite meals the user always
eats in all the Pousadas and the fact that the user always eats in Pousada de Queluz during the concert season.
We think this example is illustrative of the main repercussions this
project could have on improving state-of-the-art scientific tools
for areas that could be decisive for Portugal. The integration of
synergies with other CENTRIA ongoing research, such as the one
performed by the FP6 NoE REWERSE (Reasoning on the Web with Rules
and Semantics), headed in Portugal by one of the team members [4,
23], should also be possible. Portuguese research into new areas of
knowledge can then be directly applied to the services and industry
in Portugal.
[22] CESEM (http://www.fcsh.unl.pt/cesem/ ), namely regarding the
system music query (e.g. see project “Investigação, Edição e Estudos
Críticos de Música Portuguesa dos Séculos XVIII a XX”).
[23] L. M. Pereira, REWERSE - Reasoning on the Web with Rules and
Semantics,
Computational Logic Newsletter 5:8-9, December 2003.
This task goal is the definition of a semi-automatic mapping capable of
building a domain specific tag-set based on domain knowledge and standard
natural language analysis tools. We will start by creating resources for
selected case studies in tagged text mining. Then, a general framework for
representing and handling different domain information should be created.
Different annotation schemas should be studied in order to allow the integration
of the several annotation levels involved in this project. Particular care must
be taken to provide tools capable of mapping between human annotated information
for each domain and the specific tags of different automatic processing levels.
Due to the rich set of tools already available (e.g. Gate Project [14])), a
careful review must be made of available systems capable of information
extraction (namely the MUC conferences [8] and recent results in ACE project
[9]). All annotation will use the XML standard so it could be easily ported
among different tools.
As input to the work here described we assume to have a set of texts
over well-defined domains. Each domain should be represented by a relational
database with corresponding E.R. model, and take into account both the structure
of the knowledge we want to model (i.e. textual information to mine) and useful
pre-inserted domain background knowledge.
The standard levels in natural language processing should be
considered for helping human annotation. However only domain specific
information should be manually annotated. That way the output of other
processing modules should be integrated into the annotation in a way as
transparent to the user as possible. Namely, standard morphologic analyses (we
can use standard Unix text tools or a GATE module [14]), POS tagging (we will
adapt and develop the system presented in [1]) and rule extractors (UNIX regular
expressions will be used as based level), must be integrated by means of XML
annotations. A set of tools - as domain independent as possible- should be
developed or adapted to help the domain annotator.
Based on previous work, we will select as domain examples texts from
three case studies. The first case study is the postal address problem ([2]).
The second case study will be email announces for scientific conferences (i.e. “call-for-papers”-
this problem is currently being studied). The third and more difficult problem
will be the Touristic and Cultural example described in sec. 8.4. We will model
and collect selected texts in these domains. Then we will manually annotate
these texts with domain specific knowledge. The annotation scheme should allow a
precise mapping between the annotations and fields in the domain relational
database. In order to do so, a set of domain specific tags – based on the
entities and attributes in the data-base, conjoined with the automatic analysis
of text should be purposed in a semi-automatic way. This domain specific tag-set
should be identified for each domain ER.
A general DTD schema should be proposed and used for domain annotation in
selected domains: Postal address detection in institutional web pages,
conference data in emails with calls for scientific publications (i.e. call for
papers announces) and touristic and cultural related information (please see sec
8.4).
The developed DTD schema will be used for providing domain specific
annotations in the text. These annotations will be automatically complemented by
the use of standard natural language tools. Unlike standard resources where all
the tokens are treated and identified, this work will only focus on relevant
main tokens. Over these relevant main tokens the system will automatically apply
a tokenizer (for trying to divide the main tokens), a standard part-of-speech
tagger (to assign part-of-speech tags to all the relevant words) and regular
expressions (for trying to splitting the text input into meaningful fields for
the database). All this annotation will be done automatically. The several
possible alternatives will be marked.
Wherever possible, annotated texts and extracted knowledge will be
made available on the project web page. Limited annotating functionality should
also be available to normal users, since by annotating the text the user is
inserting more information in the database. So, if the case-studies are useful
and attract enough users, we also expect a reasonable increase in resources by
making annotating tolls available to the general public. All the annotation
efforts on publicly available texts will be available for general use. In order
to “boot-strap” this process, and to guarantee enough tagged text for the
following tasks, after a first effort on hand annotation made by involved
researchers, a human annotator will be contracted for text annotation. This
annotation will be done essentially, in the touristic domain (this will be
supported by budget item “Acquisition of services and maintenance”).
Two papers will be submitted to peer reviewed conferences.
A Web-based text annotation tool, and modules to open-source systems
developed by others (e.g. the Gate architecture [14]) will be developed and made
available through project web page.
The part-of-speech tagger developed in
[1], will be extended to handle specific domain tag-sets. The
domain-tagger will be generalized by extending the base POS-tagger
with: the integration of more context, the use of a hidden tag-set
and a richer set of features. The use of XML annotation will help
the integration of this system with other systems.
A public domain release of the neural tagger will be made available
in project web page. This tool will be presented to students in [13]
and will be described in submitted papers. This is part on the
effort of dissemination of project results into NLP and Text Mining
research communities. Since a GNU-GPL license will be used,
collaboration with other researchers in the field will be possible.
Two papers will be submitted to peer-reviewed conference
proceedings: one on domain tagging and POS-Tagging, one on extending
the POS-Tagger context and on using better features for domain-tag
tagging.
One scientific article will be submitted to a peer-reviewed journal
about POS tagging. Based on work done also in T1, a master thesis is
possible.
A final system capable of domain tag annotation with a generic web
interface will be available upon completion of this task. This tool
will be a first step for starting the development of applied
projects in specific domain areas. The use of Portuguese texts will
also contribute for the increase in knowledge and processing power
for the computational treatment of Portuguese.
This task main goal will be the
development of a general purpose domain tag disambiguator. This
system will be based on the Neural-Network part-of-speech tagger
described in [1]. Indeed, previous results support our claim that
the neural network tagger is an excellent candidate for a domain
specific tagger ([1,7]): the very small requirements in terms of
learning data of the POS-tagger make it useful for specialized
domains with few training data. Also, the capacity for using
databases of previous knowledge will be a key issue for domain
knowledge integration. Finally, because specialized domains can be
easily formalized by their meta-data (contained in the database
description) and by specialized semantical rules, specialized domain
tag-sets will be constructed.
This system has already been successfully adapted for a specific
domain tagging in [2]. Due to the excellent learning characteristics
of neural-network taggers ([7]), particularly when extended with
background domain knowledge ([1]), this tagger should be able to
learn with very small amounts of training data. So this task will
use human annotated texts from T1 to learn to assign disambiguating
probabilities to ambiguous annotations in text. The specific set of
features and tags to be used by the system should be learnt in a
fully automatic way by using the concept of surface and hidden
tag-sets ([24]) and by automatically learning the best neural
network feature vectors through evolutionary programming ([25]).
In a first step the automatically annotated input resulting from T1,
will be converted into feature vectors by using a set of indicator
functions (as defined, e.g. in [7]). Indicator functions allow a
generalization of the context and features to use. The previous
experience of using large lexicons as backup lexical knowledge for a
part-of-speech tagger, should be extended to access all the
available knowledge in the database for building indicator
functions.
Careful literature review and evaluation of good indicator functions
should be made regarding this task. All the indicator functions
relevant for the task of domain tagging in any domain should be
used. These input vectors can be acquired for any text in any
domain.
The second step will be the training of the neural network. This
step must use the hand annotated text of T1 for generalizing the
patterns contained in the input vectors. However particular care
must be taken for learning to classify the right tag-set. The method
presented in [24], where a hidden tag-set is automatically generated
will be researched in this task. Finally neural networks using
evolutionary programming [25], will be used.
[24] Tufis, D. Dragomirescu. L. (2004) Tiered Tagging Revisited. In
Proc. 4th Int. Conf. LREC. Volume I. pp. 39-42. Lisboa, Portugal.
[25] Marco Castellanni. (2003). Annual Activity Report. Centria
Technical Report.
The following results are expected
after conclusion of this task:
* A software tool capable of converting lists of multiword units,
annotated either by human or machine with domain specific tags, into
general rules. These general rules could then be used to find
relevant patterns to be inserted into the database from
automatically annotated input texts.
* A software tool capable of helping the human knowledge expert to
annotate unexpected or strange relevant patterns in text. These tool
will be based on the fact that algorithms for the automatically
extraction of regular expressions (e.g. Savchenko2002) allow us to
find unexpected patterns.
* 2 peer reviewed conference papers describing software tools in
previous points.
This task main goal will be the
development of a general model for parsing the domain tags and other
annotations. Available lists of entities (either domain dependent
knowledge previously inserted into the database or human annotated
data) are however distinct from the patterns available in text.
That is why automatic regular expression extractors should be
applied to text. Indeed, the Extraction and identification of
Multiword Units(MWUs) has shown successful results in applications
that need a certain degree of semantics ([28]). For that purpose a
system called HELAS (Hybrid Extraction of Lexical Associations)
[3,29] was developed. HELAS has been thought and developed around
the idea of total flexibility. Exclusively based on a new
probabilistic measure (the combined Association Measure) and a new
acquisition process (the GenLocalMaxs algorithm), HELAS detects
multiword lexical units by processing only once a tagged corpus of
any language, any domain or any type without changing it in any
form. MWUs include a large range of linguistic phenomena, such as
compound nouns, phrasal verbs, adverbial locutions, compound
determinants, prepositional locutions and institutionalized phrases. MWUs are frequently used in everyday language, usually to precisely
express ideas and concepts that cannot be compressed into a single
word. As a consequence, their identification is a crucial issue for
applications that require some degree of semantic processing.
HELAS only outputs a list of unstructured sequences of words. So, in
order to recognize them from new input texts, these lists should be
converted into regular expressions. Models based on Regular
Expressions, finite automata and grammars have been used, in
particular, for Natural Language Syntactic analysis [26]. However,
these models are manually built from patterns that we expect to find
in texts. But, algorithms for the automatic extraction of regular
expressions [27] allow to find expected as well as unexpected
patterns thus contributing to better knowledge of the analyzed data
and as a consequence its optional annotation by the knowledge
expert.
Automatically extracted regular expressions over annotated data
could be used directly to convert domain tagged raw text into fields
that may directly be inserted into the database.
[26] V. Rocio, E. de la Clergerie and J.G.P.Lopes. 2001. Tabulation
for multi-purpose partial parsing. Grammars. 4(1): 41-65. Kluwer
Academic Publishers.
[27] Savchenko. Regular Expression Mining and its Information
Quality Applications. In: Craig Fisher and Bruce N. Davison, eds.,
Proceedings of the Seventh International Conference on Information
Quality (IQ 2002), MIT, pp. 177-186.
[28] O. Vechtomova, M. Karamuftuoglu. (2004). Use of Noun Phrases in
Interactive Search Refinement. MEMURA Workshop of the 4th LREC. Dias,
G., Lopes, J.G.L. & Vintar, S. (eds). Lisbon, Portugal. ISBN:
2-9517408-1-6.
[29] http://helas.di.ubi.pt
As a result of this task, a new
version of the text search engine SINO will be produced allowing to
index additional information associated with each word. This new
version will use the tags created in task T2 to reduce the
complexity of the extraction of relevant expressions task (reducing
the input size and improving the access to the documents).
SINO connection with a relational database engine (e.g. postgreSQL)
will be enforced by providing primitives for handling extracted
information and its association with base texts.
Since the final result of this task will be a web interface to all
the research developed in this project, and to some previous
research work already completed (namely [5] and [31]), a greater
effort on software engineering is needed in this task (this will be
supported by budget item “Acquisition of services and maintenance”).
A web interface illustrating the capabilities of the full system in
selected case studies will be publicly available through Internet.
Selected case studies will be used for illustrating the possible
direct application of this research results in services and
industry. Namely, a case study on tourist information will be made
available to general public. We intend to select as a domain test
bed the area of tourist information [5] and culture (previous
contacts exist with [22]). After project completion, a pilot system,
integrating access to database knowledge by web forms, email alarm
systems, discussion forums and general news will be ready (please
see section 8.4 for an illustrative example).
The success of this research could be easily checked by how popular
the available information will be. If success is achieved, this will
be one more argument for persuading a possible future R&D project
with an industry partner (e.g. by [30]). Also from the research
point of view, touristic and cultural information are a very
coherent field. Correlation exists between several different areas
such as culture, sports (based on previous experience with the LUSA
corpus [21]) or place history (information available in Universidade
Aberta partner).
Four papers will be submitted to peer-reviewed conferences as a
result of this work. A final journal paper with the results of the
full project will be submitted to a peer-reviewed conference after
completion of this task. One master thesis is expected after the
completion of this task.
[30] – http://www.adi.pt
This task aims to improve the existent
text search engine SINO to take into account, database information,
domain tags and other kinds of annotation associated with words in
sentences. A final project pilot Web system with a full range of
data mining capabilities will be developed as a result of this
integration.
SINO is a search engine used in several research projects, such as
the AustLII -- Australasian Legal Information Institute one. In
Portugal it has been used in [20]. In spite of being specialized for
the Portuguese language, SINO does not take into account tags with
information (part-of-speech, semantic, etc.) associated with each
word. In this task, an extension of SINO will be done, allowing to
index words with their associated information and to query this
information. It will be possible to query the text base for
sentences (or words) where specific tags appear. This new feature
will be used in task T4 for reducing the spatial and temporal
complexity of the expression extraction process.
Also the integration with specific domain knowledge will be possible
by building general modules to search and manage specialized
information on the database. Automatic code generation modules based
on the domain knowledge will also be researched in this task. Namely
the work in [5] will be used to mine relevant association rules on
all information types available in each domain database. Mined
knowledge will then be available as another entity in the domain
database. Also the intended pilot system should adapt itself to the
user, based on knowledge mined on access logs [31].
SINO integration will be a key issue for using this system. The
domain knowledge should be available by conjoining two distinct
functionalities: the information retrieval capabilities of SINO and
a standard relational database engine (e.g. the open-source PostgresSQL database) containing previously available information,
the information extracted by the system and data mining results.
Sino will be used both for information retrieval, and to associate
extracted information with the texts the information was extracted
from.
[31] Yang,Y. Guan, X. You, J. (2002) CLOPE: A fast and effective
clustering algorithm for transactional data. ACM SIGKDD '02, July
23-26, Edmonton, Alberta, Canada. (student scholarship award paper).
| Year | Publication |
| 2001 | [1] Marques, N.C. and Lopes, J.G. (2001) Tagging With Small Training Corpora. In F. Hoffmann, D. Hand, N. Adams, D. Fisher and G. Guimarães, Editors, Advances in Intelligent Data Analysis (LNCS 2189), 4th International Conference, IDA, pp. 63-72. Springer Verlag. |
| 2004 | [2] Marques, N.C. and Gonçalves, S. (2004). Applying a Part-of-Speech Tagger to Postal Address Detection on the Web. In Proceedings of the IV International Conference on Language Resources and Evaluation. LREC 2004. Volume I. pp. 287-290. Lisboa, Portugal. |
| 2003 | [3] Dias, G. (2003). Multiword Unit Hybrid Extraction. Workshop on Multiword Expressions of the 41st ACL meeting. 7-12 July. Sapporo. Japan. |
| 2003 | [4] Alferes J. J., Damásio C. V., Pereira, L. M. (2003). Semantic Web Logic Programming Tools, invited paper in: F. Bry, N. Henze, J. Maluszynski (eds.), Procs. Workshop on Principles and Practice of Semantic Web Reasoning (PPSWR´03), pp. 16-32, Springer, LNCS 2901. At 19th Int. Conf. on Logic Programming (ICLP ´03), Mumbai, India, December, 2003. |
| 2003 | [5] L. Cavique (2003), "Micro-Segmentação de Clientes com Base em Dados de Consumo: Modelo RM-Similis", Revista Portuguesa e Brasileira de Gestão, pp. 72-77, volume 2, nº 3. |