Databases
|
Using Databases to Represent Linguistic Data
Database systems can be used to represent linguistic data.
This appendix gives pointers regarding three kinds of databases:
Special-purpose linguistic databases
Some linguists have solved the problem of representing linguistic data
by building special-purpose linguistic database programs. The following are
three programs developed by the Summer Institute of Linguistics that specialize
in interlinear text analysis and lexical data management:
-
IT,
the Interlinear Text processor, for DOS (1987; version 1.2, 1992) and Macintosh
(1988; version 1.01r7, 1992)
-
The linguist's
Shoebox:
an integrated data management and analysis tool, for DOS (1990; version
2.0, 1993), Windows (version 3.0, 1996), and Macintosh (version 3.0, 1997)
-
LinguaLinks, an electronic
productivity support system for field language workers (see
Linguistics
Workshop for data management tools), for Windows (1996)
Relational databases
Relational databases are the most popular kind of database in use today,
but they are found wanting for handling linguistic data. In
terms of the requirements proposed in this chapter
of the book, relational databases excel at handling the
multidimensional and integrated nature of linguistic data (requirement
4 and 5), but handle poorly the sequential and hierarchical nature of
linguistic data (requirements 2 and 3).
Here are some research projects which have extended the relational
model to deal with these requirements for handling text:
-
Stonebraker, Michael, Heidi Stettner, Nadene Lynn, Joseph Kalash, and Antonin
Guttman. (1983) Document processing in a relational database system,
ACM Transactions on Office Information Systems, 1(2):143-188.
-
Text/Relational Database
Management System Project, Centre for the New Oxford English Dictionary
and Text Research, University of Waterloo. Provides postscript versions
of many publications.
Some leading relational database systems:
An important notion from relational database
theory is normalization. This is the process of organizing
a database in such a way that no piece of information occurs more than once
in the database.
-
Database
Normalization Basics, article #Q100139 from Microsoft Technical Support
Knowledge Base
-
Database Normalization
from Reid Software Development
-
Stages
of Normalization, by Oliver Burmeister, Swinburne University of Technology
-
Smith, Henry C. (1985) Database design: composing fully normalized
tables from a rigorous dependency diagram. Communications of the
ACM, 28(8):826-838
(online
review) describes an easy-to-use methodology.
Some journals:
Object-oriented databases
Object-oriented databases are a relatively recently development.
They have the advantage of inherently supporting all of
requirements 2 through 5. These section
first offers definitions of some of the
concepts of object-oriented databases, with
pointers to resources where you can learn
more. Finally, a general-purpose object-oriented database system
named CELLAR, which has been
specifically built
to support requirements 1 and 6
as well, is introduced.
Concepts
These are some of the key concepts of object-oriented databases:
-
object-oriented database
-
A database system which models entities in the real world as objects and
follows the object-oriented paradigm of programming.
-
object-oriented
-
A modern paradigm of programming which models information in terms of
objects. Computation occurs when one object receives
a message from another asking it to perform one of its built-in operations.
The object-oriented approach, in which the data and the program behavior
are encapsulated in the objects, contrasts with the conventional approach
to programming, in which a program operates on data which is completely separate.
-
object
-
The fundamental unit of information modeling in the object-oriented paradigm.
There is a one-to-one correspondence between objects in the data model and
the entities in the real world which are being modeled. (This is not true
of data modeling in a relational database system; all of the information
about a single entity in the real world may be scattered throughout
many tables of a normalized database.) An object
stores state information (variously called properties, attributes, or instance
variables; these are like the fields of a database record). It
also stores behavioral information (typically called methods) about
what computations can be performed on an instance of the object. The information
stored in an object is encapsulated in that it is not visible directly; it
can only be seen by sending a message to the object which asks it to perform
one of its methods.
-
object-oriented analysis
-
The process of analyzing a problem domain in order to build a formal model
that can serve as the basis for an object-oriented implementation of it.
The main outcome is a description of the classes of objects in
the problem domain, along with the properties, behaviors, and relationships
of each.
-
Booch, Grady. (1994) Object-oriented analysis and design with applications,
2nd ed. Benjamin/Cummings Publishing Co. (An
online overview.)
-
Object Modeling Technique (OMT), described in Rumbaugh, James and others
(1991) Object-Oriented Modeling and Design, Prentice Hall. (An
online
overview. Summary of
notation.)
-
UML (the Unified Modeling
Language) fuses the concepts of Booch and OMT.
-
Object-Oriented
Analysis, an online tutorial
CELLAR: A multilingual object-oriented database
system
CELLAR (Computing Environment for
Linguistic, Literary, and Anthropological Research) is a multilingual
object-oriented database system that has been developed by the
Summer Institute of Linguistics to specifically
meet the six requirements for a linguistic computing
environment.
CELLAR lies at the heart of SIL's
LinguaLinks product, an electronic
performance support system (EPSS) for field linguists. It provides both the
object database for storing user data and the programming language for
implementing the applications to manage and otherwise manipulate the data.
CELLAR is not currently packaged as a product in its own right; rather,
the full data modeling system and programming language are included as part
of the LinguaLinks product.
These are some articles that have been published about CELLAR:
-
Rettig, Marc, Gary F. Simons, and John V. Thomson. (1993) Extended
objects. Communications of the ACM, 36(8):19-24.
-
Simons, Gary F. (1997) Conceptual modeling versus visual modeling:
a technological key to building consensus. Computers and the
Humanities, 30:303-319. (The
original working paper
is available electronically.)
-
Simons, Gary F. (In press) Multilingual data processing in the CELLAR
environment. To appear in John Nerbonne (ed.), Linguistic
Databases. Stanford, CA: Center for the Study of Language and Information.
(The original working
paper is available electronically.)
|