Using
Computers
in
Linguistics:
A Practical Guide

The Nature of Linguistic Data
and the Requirements of a
Computing Environment for
Linguistic Research

Online Appendix:
Using Databases


Summary

Multilingual Computing


Text Encoding


Databases

Using Databases to Represent Linguistic Data

Database systems can be used to represent linguistic data.  This appendix gives pointers regarding three kinds of databases:

Special-purpose linguistic databases

Some linguists have solved the problem of representing linguistic data by building special-purpose linguistic database programs. The following are three programs developed by the Summer Institute of Linguistics that specialize in interlinear text analysis and lexical data management:

  • IT, the Interlinear Text processor, for DOS (1987; version 1.2, 1992) and Macintosh (1988; version 1.01r7, 1992)
  • The linguist's Shoebox: an integrated data management and analysis tool, for DOS (1990; version 2.0, 1993), Windows (version 3.0, 1996), and Macintosh (version 3.0, 1997)
  • LinguaLinks, an electronic productivity support system for field language workers (see Linguistics Workshop for data management tools), for Windows (1996)

Relational databases

Relational databases are the most popular kind of database in use today, but they are  found wanting for handling linguistic data.  In terms of the requirements proposed in this chapter of the book, relational databases excel at handling the multidimensional and integrated nature of linguistic data (requirement 4 and 5), but handle poorly the sequential and hierarchical nature of linguistic data (requirements 2 and 3).

Here are some research projects which have extended the relational model to deal with these requirements for handling text:

  • Stonebraker, Michael, Heidi Stettner, Nadene Lynn, Joseph Kalash, and Antonin Guttman. (1983) ‘Document processing in a relational database system,’ ACM Transactions on Office Information Systems, 1(2):143-188.
  • Text/Relational Database Management System Project, Centre for the New Oxford English Dictionary and Text Research, University of Waterloo.  Provides postscript versions of many publications.

Some leading relational database systems:

An important notion from relational database theory is normalization.  This is the process of organizing a database in such a way that no piece of information occurs more than once in the database.

  • Database Normalization Basics, article #Q100139 from Microsoft Technical Support Knowledge Base
  • Database Normalization from Reid Software Development
  • Stages of Normalization, by Oliver Burmeister, Swinburne University of Technology
  • Smith, Henry C. (1985) ‘Database design: composing fully normalized tables from a rigorous dependency diagram.’ Communications of the ACM, 28(8):826-838 (online review) describes an easy-to-use methodology.

Some journals:

Object-oriented databases

Object-oriented databases are a relatively recently development.  They have the  advantage of inherently supporting all of requirements 2 through 5. These section first offers definitions of some of the concepts of object-oriented databases, with pointers to resources where you can learn more.  Finally, a general-purpose object-oriented database system named CELLAR, which has been specifically built to support requirements 1 and 6 as well, is introduced.

Concepts

These are some of the key concepts of object-oriented databases:

object-oriented database
A database system which models entities in the real world as objects and follows the object-oriented paradigm of programming.
object-oriented
A modern paradigm of programming which models information in terms of objects. Computation occurs when one object receives a message from another asking it to perform one of its built-in operations. The object-oriented approach, in which the data and the program behavior are encapsulated in the objects, contrasts with the conventional approach to programming, in which a program operates on data which is completely separate.
object
The fundamental unit of information modeling in the object-oriented paradigm. There is a one-to-one correspondence between objects in the data model and the entities in the real world which are being modeled. (This is not true of data modeling in a relational database system; all of the information about a single entity in the real world may be scattered throughout many tables of a normalized database.) An object stores state information (variously called properties, attributes, or instance variables; these are like the fields of a database record). It also stores behavioral information (typically called methods) about what computations can be performed on an instance of the object. The information stored in an object is encapsulated in that it is not visible directly; it can only be seen by sending a message to the object which asks it to perform one of its methods.
object-oriented analysis
The process of analyzing a problem domain in order to build a formal model that can serve as the basis for an object-oriented implementation of it.  The main outcome is a description of the classes of objects in the problem domain, along with the properties, behaviors, and relationships of each.
  • Booch, Grady. (1994) Object-oriented analysis and design with applications, 2nd ed. Benjamin/Cummings Publishing Co. (An online overview.)
  • Object Modeling Technique (OMT), described in Rumbaugh, James  and others (1991) Object-Oriented Modeling and Design, Prentice Hall.  (An online overview. Summary of notation.)
  • UML (the Unified Modeling Language) fuses the concepts of Booch and OMT.
  • Object-Oriented Analysis, an online tutorial

CELLAR: A multilingual object-oriented database system

CELLAR (Computing Environment for Linguistic, Literary, and Anthropological Research) is a multilingual object-oriented database system that has been developed by the Summer Institute of Linguistics to specifically meet the six requirements for a linguistic computing environment.

CELLAR lies at the heart of SIL's LinguaLinks product, an electronic performance support system (EPSS) for field linguists. It provides both the object database for storing user data and the programming language for implementing the applications to manage and otherwise manipulate the data.  CELLAR is not currently packaged as a product in its own right; rather, the full data modeling system and programming language are included as part of the LinguaLinks product. 

These are some articles that have been published about CELLAR:

  • Rettig, Marc, Gary F. Simons, and John V. Thomson. (1993) ‘Extended objects.’ Communications of the ACM, 36(8):19-24.
  • Simons, Gary F. (1997) ‘Conceptual modeling versus visual modeling: a technological key to building consensus.’ Computers and the Humanities, 30:303-319. (The original working paper is available electronically.)
  • Simons, Gary F. (In press) ‘Multilingual data processing in the CELLAR environment.’ To appear in John Nerbonne (ed.), Linguistic Databases. Stanford, CA: Center for the Study of Language and Information. (The original working paper is available electronically.)


This page is part of the preliminary online appendices for the book
Using Computers in Linguistics: A Practical Guide, 1998 (Routledge).

Up to Chapter Page   Up to Book Page  
Summary   Multilingual computing   Text encoding   Databases
Last modified: May 12, 1997