This is a review of some of my old research made in order to emphasize the idea of achieving proper balance between bottom-up and top-down approaches to analysis of text, both written and spoken.
The concepts of bottom-up and top-down analysis are taken from compiler technology, where both are well established techniques for parsing of programming languages.
In bottom-up parsing one first recognizes objects represented by smaller portions of text, identifies their syntactic classes, and then one combines results obtained for smaller objects to identify bigger objects composed of those smaller objects; so one proceeds until the whole text (a compilation unit) is recognized as some kind of language entity.
In top-down approach one has an idea about what the object is going to be like before one proceeds to parse it, e.g., knows that the whole text is a program. Then, after some analysis one can figure out what are the constituents, and then proceed to the constituents, etc., until smallest units are reached.
In programming languages both types of parsers are based on complete formal definitions of the language, and the difference between the two approaches is not very significant anyway. This is obviously not the case for natural languages, and due to incompleteness of our descriptions the two approaches might be needed to complement each other.
Bottom-up approach seems to be most popular in analysis of written text. Normally this analysis starts with surface-level entities like words and morphemes, which, at the same time, are smallest in size. Manipulation of entities of deeper levels, like syntactic constructs (which simultaneously represent bigger portions of the text) becomes possible after the lower level information is known.
However this approach is unlikely to succeed with speech analysis. Too little can be done with acoustic signals to identify them as lingustic entities on their own. On the other hand, if one has some preliminary idea of what this could be like a more or less accurate match can be found. That's why many speech recognition projects are restricted to small vocabularies.
Some of my previous research was connected with identification of specific words in samples of connected speech. A modification of the technique known as dynamic programming was used. The standard use of dynamic programming was to match a segment against another segment as a whole. The modification I used was to find a match for a smaller segment (a word) within a bigger segment, representing an utterance; the usual minimization procedure was modified to determine the most likely location of the matching subsegment within an utterance [1]. Later this approach was extended to recognition of sentences in a very restricted domain, where syntax could be represented by a finite-state grammar. The minimization procedure had to determine a path in the state diagram which matched best the whole sentence. (With a small vocabulary the results were satisfactory for a single speaker.) This work was clearly based on a top-down approach.
In analysis of written text bottom-up approach is probably fine for smaller units like words. In syntactic parsing various local "parsings" for two or three words proliferate, and many of them have nothing to do with the structure the writer had in mind. Special mathematical techniques are used to cope with the possibility of combinatorial explosion. However the interesting thing is what linguistic methods can be used for the same purpose. The most important of them is defining "syntax frames" for words that can predict syntactically linked words; the prediction cannot be complete, but usually some morphological features, semantic class, or position are known. This is obviously a manifestation of top-down approach.
The experiments in natural language understanding (for a restricted domain) carried out by my group [2,3] used a technique that gave much room to top-down approach. Indeed, the only parts which were strictly bottom-up were dictionary and morphology, and when they failed to produce a unique result, several variants were submitted for further processing. The work was based on a representation of data which we called "association nets" [4], essentially the same structure as semantic nets known from artificial intelligence research, but with nodes representing specific objects rather than generic concepts; this all was augmented by a kind of object oriented approach with procedures attached to nodes and some procedural form of property inheritance (so this technique as a whole turns out to be as much related to PROLOG as to SMALLTALK).
The important thing was that this allowed simultaneous representation of linguistic units of various levels. Elements of morphological, syntactic and semantic/pragmatic representations could coexist and there was no need to destroy a lower level representation once a higher level structure had been obtained. Thus semantic evidence could direct syntactic processing, or vice versa.
The representation method looks as follows. Consider a simple sentence, e.g., "John loves Mary". It consists of three words, and each of them will be represented as a node in an association net. Each node will carry grammatical attributes of the word, and there will be links representing the dependency structure, labeled "subject", "direct object", "master" (for dependent words). A parallel structure will be built of "semantic" nodes, representing, in this case, real objects, like John, Mary, and the instance of love, with their corresponding links, labeled "lover" and "beloved". Each word has a link to the corresponding real object. But the two structures don't need to be strictly parallel. E.g., if the following sentence mentions John again, the second occurrence will be represented by a distinct word node which will refer to the same semantic node. (In Russian, due to peculiarities of agreement in gender, one more level might be useful for some cases, but we didn't go that far.)
Now, if we are going to build such a structure out of words found in a sentence, we have to start from prefabricated nets for single words. To get an idea of what such "dictionary nets" could be like, consider the same sentence again, but this time let us think that the only word we know is the verb, "loves". We don't know that the subject is John, but still we can represent the node with the attributes and links that can be predicted from the verb alone. In this simple case the dictionary net for the verb will have at least as many nodes as the net representing the complete sentence. (In fact, a dictionary can produce a number of alternative nets, or alternative parts of them, and a special technique is used to keep track of that.)
Given the nets for each word in the sentence, one has to combine them into a single net by merging nodes from different words; e.g., the predicted node for the subject of "loves" will be merged with the "main" node for "John", and this will propagate to the corresponding semantic nodes. If our aim is understanding the text (e.g., by producing a response) the only significant part of the final structure would be the semantic part (not exactly so if we build a verbal response); the function of the other level of the language is to control the process of merging separate nets into a single structure. (By the way, this purely syntactic process may be used to elucidate some grammatical "meanings", albeit in unusual terms.)
The dictionary nets contain some auxiliary nodes providing for the activity needed in this process; they are called hypotheses (or predictions). Each hypothesis has a link to its "goal node" (the one for which it tries to find a match in other words), the "base node" (the word to which the dictionary net belongs), and to some procedures. The hypotheses are put on a queue and called in turn (there is some control of the order both from inside and outside the hypotheses). There are some generic procedures to search the matching word (e.g., by syntax class, sequentially to the right of the base) and to verify the matching candidate (e.g., by a projectivity test), but a word can have a "custom" procedure if it deviates from common patterns. If, at some stage, the attempted match fails verification, the whole attempt is backtracked.
Projectivity test can be used as an illustration of more complex relationships between the hypotheses. If a remote syntactic link between two words is attempted, one has to verify that all words in between are subordinated syntactically, directly or not, to one of the two words. At the moment when the remote link is attempted some links of the intermediate words can be missing; there can be a good reason to build the link early, e.g., when one of the two words is the only candidate with the proper semantic class. At verification stage the hypothesis will call to the whole queue of hypotheses with a message indicating the range of words to be tested for links (words out of the range are not supposed to respond), and each of the hypotheses will work under the assumption that the link under verification already exists and no other link may cross it. (This projectivity requirement is not absolutely compulsory, and some hypotheses may have procedures disregarding the requirement.)
This top-down organization is completed by the "top word" hypothesis imposed from outside; this hypothesis is responsible for finding the root of the dependency tree, usually the predicate.
To conclude, one can observe that neither pure bottom-up approach nor pure top-down approach can work well, except in simple cases. Usually we cannot start with a prediction for the whole structure, but once some parts have been recognized in a bottom-up fashion, they can generate predictions to control parsing of other elements, etc.
1. В.И.Галунов, М.И.Орлова, Г.С.Цейтин, Н.Н.Ягунова. Организация процесса понимания речи через распознавание отдельных слов. XIII Всесоюзн. школа-семинар "Автоматическое распознавание слуховых образов", тезисы докладов и сообщений, ч. II, Новосибирск, 1984, 124-125.
2. М.М.Железняков, Т.Н.Невлева, И.М.Новицкая, Л.Н.Смирнова, Г.С.Цейтин. Опыт построения модели типа "текст -> действительность" с использованием ассоциативных сетей. В сб.: Машинный фонд русского языка: предпроектные исследования, Институт русского языка АН СССР, М., 1988, 140-167.
3. G.S.Tseytin. On some mechanisms of representation of meaning in natural languages. To appear in "The Prague Bulletin of Mathematical Linguistics" (originally presented to the Conference on Logic Programming and Automatic Reasoning, St.Petersburg, 1991).
4. Г.С.Цейтин. Программирование на ассоциативных сетях. В сб.: ЭВМ в проектировании и производстве, вып.2, под ред. Г.В.Орловского, "Машиностроение", Ленингр. отд., Л., 1985, 16-48.