A Companion to Chomsky. Группа авторов

A Companion to Chomsky

StartSet normal upper A comma normal upper V EndSet

chased

chased dogs empty-set

dogs dogs sleep chased big dogs chased

Like forward sets, the inside set of a string (relative to a particular CFG) can be computed recursively from the inside sets of the string's parts. Recall that the forward set of a string x 1 x 2 x 3 x 4 , for example, can be computed by considering the states in the forward set of , and asking which of these states have an outgoing transition emitting x 4 . The situation for inside sets is similar. Considering only rules of the form “ upper A right-arrow upper B upper C ” for ease of exposition: to compute the inside set of x 1 x 2 x 3 x 4 , we must consider the inside set of x 1 x 2 and ask whether any of its member nonterminals can be combined with any nonterminal in the inside set of x 3 x 4 ; but also consider combinations drawing one nonterminal from the inside set of x 1 and one from that of , plus combinations drawing one from the inside set of x 1 x 2 x 3 and one from that of x 4 . (The inside set of a length‐one string , we can assume, is just the set of nonterminals upper A for which the grammar contains the rule upper A right-arrow x .)¹⁵ The upshot is that to work out the inside set of x 1 x 2 x 3 x 4 , it suffices to know the inside sets of all its substrings. This is more complex than the situation for forward sets because there are multiple “splits” to consider (as opposed to only splitting into and x 4 ); but it is analogous in the important sense that inside sets tell us everything there is to know about how the grammar treats the relevant subexpression.

This freedom to split strings into pieces in arbitrary ways underlies a CFG's ability to create classes of intersubstitutable infixes. For example, consider the first tree in (5): the fact that we can split the last word off dogs dogs chase sleep but then split the first word off dogs dogs chase leads to the end result that the overall string is split into an infix sans-serif-italic d o g s c h a s e and its surroundings sans-serif-italic d o g s bar bar sans-serif-italic s l e e p . The nonterminal symbol RC serves as a hinge or pivot that links together these two pieces – dogs chase, which can appear “inside” RC, and , which can appear “outside” it – just as the state B in Figure 5.2, for example, links together someone ran and and ran really quickly.

In an important sense, what distinguishes CFGs from FSGs is not directly an issue of hierarchical tree structure or constituency – the tree in Figure 5.3 is hierarchical, expressing part‐whole relationships among constituents, in at least one sense – but rather, the distinction between being able to make infix‐circumfix splits and being restricted to prefix‐suffix splits. Specifically, the stringsets that can be generated by a CFG but not by any FSG are those for which every CFG is “self‐embedding” (Chomsky, 1959, p. 167), i.e. those where some string can be substituted for a proper infix of itself, such as the way aabb can be substituted for ab in the sans-serif-italic a Superscript n Baseline sans-serif-italic b Superscript n stringset. The relevant equivalence classes are still classes of strings, and are based on the way substrings fit into larger strings.

What sorts of stringsets cannot be generated by the infix‐based mechanisms of a CFG? The comparison between the stringsets upper L Subscript pal and upper L Subscript repeat defined in (6) is instructive (Chomsky and Miller, 1963, pp. 285–287). Some notation: w Superscript upper R is the reverse of a string , and StartSet sans-serif-italic a comma sans-serif-italic b EndSet Superscript plus is the set of all non‐empty strings using the symbols a and b. So upper L Subscript pal is the set of even‐length palindromes, and upper L Subscript repeat is the set of strings whose second half simply repeats the first half.

1 (6)

There is a CFG that generates upper L Subscript pal , shown in (7).

1 (7)S aSaS bSbS aaS bb

It may seem at first surprising, then, that no CFG can generate upper L Subscript repeat , since it is simply upper L Subscript pal without the

Скачать книгу