Formal languages summary by Joshua Flynn

Formal languages

This text is an outline of formal language theory. The text is split into blocks. There are 4 types of blocks: DeďŹ nitions, Examples, Theorems and Discussions. Conventions in this text: N = {1, 2, 3, ...} and N0 = {0, 1, 2, 3, ...}

Alphabets and words

Definition: Alphabet: Let Σ be a non-empty finite set. In the context of formal language theory Σ is referred to as an alphabet. Synonyms: Vocabulary Example 1: The set {a, b, c} is an alphabet. Example 2: The set {begin, end, while, do, if, then} is an alphabet similar to ones used in programming languages. Example 3: The set ∅ fails to be an alphabet as it needs to be nonempty. Definition: Subalphabet: Let Σ be an alphabet and let V ⊆ Σ. Then V is a subalphabet of Σ Definition: Letter: Let Σ be an alphabet. Then the elements of Σ are referred to as the letters of Σ. Synonyms: Symbol, Token Example 1: Given the alphabet Σ = {1, 2, x, y}. x is a letter of Σ though xy is not. Definition: Word: Let Σ be an alphabet. A finite sequence in Σ is a word over Σ. Synonyms: String Example 1: Given the alphabet Σ = {5, a, 7, j, t}. Then ha, 7, a, ti is a word over Σ. If there is no confusion this would simply be written as ‘a7at’. Example 2: Given the alphabet Σ = {l, ll, lll}. Then hll, lll, lli is a word over Σ. Ambiguity arises if it is written in the shorter way so it is not written in this way. Definition: i-th letter: Let Σ be an alphabet and x be a word over Σ. The ith letter of x is the ith term of x when viewed as a finite sequence over Σ. It is denoted xi . Definition: Word length: Let Σ be an alphabet and x be a word over 2

Σ. The length of x is the cardinality of the domain of x when viewed as a sequence. It is denoted len(x) or lg(x) or |x|. Example 1: Given the alphabet Σ = {j, p, r, k}. Let x =‘rrkrjp’. Then len(x) = 6. Definition: Word equality: Let Σ be an alphabet and let x and y be words over Σ. x and y are defined to be equal (denoted x = y) iff: len(x) = len(y) and ∀i : xi = yi Definition: Empty word: Let Σ be an alphabet and x be a word over Σ. x is an empty word iff len(x) = 0. It is denoted λ or ². Synonyms: Empty string, Null word, Null string Definition: Concatenation: Let Σ be an alphabet and let x and y be words over Σ. The concatenation of x with y is denoted xy in the literature (but is denoted x ◦ y in this text). We define x ◦ y = x if y = λ and x ◦ y = y if x = λ. Otherwise it is defined: { xi if 1 ≤ i ≤ len(x) (xy)i = yi−len(x) if len(x) < i ≤ len(x) + len(y) Definition: String power: Let Σ be an alphabet, x be a word over Σ and ◦ denoted concatenation. The nth string power of x, where n ∈ N0 , is denoted xn and defined inductively: { λ if n = 0 xn = xn−1 ◦ x if n > 0 Synonyms: Product Definition: Subalphabet product: Let Σ be an alphabet and let V, W ⊆ Σ. The subalphabet product (under concatenation) of V with W is denoted V W and defined: V W = {v ◦ w : v ∈ V ∧ w ∈ W } Definition: Subalphabet power: Let Σ be an alphabet and let V ⊆ Σ. The nth power of V , where n ∈ N0 , is denoted V n and defined inductively: { {λ} if n = 0 Vn = V n−1 V if n > 0 3

Notation: Some sources denote V n as Vn . Deﬁnition: Kleene star: Let Σ be an alphabet and let V ⊆ Σ. The Kleene star of V is denoted V ∗ and deﬁned: ∪ Vi i∈N0

Synonyms: Kleene closure, Monoid closure Definition: P-star Let Σ be an alphabet. The P-star of Σ, denoted P(Σ∗ ), is the set of all languages over Σ. Hence it is the power set of the Kleene star of Σ. Definition: Kleene plus: Let Σ be an alphabet and let V ⊆ Σ. The Kleene plus of V is denoted V + and defined: ∪ Vi i∈N

Definition: P-plus Let Σ be an alphabet. The P-plus of Σ, denoted P(Σ+ ), is the power set of the Kleene plus of Σ. Definition: Formal language: Let Σ be an alphabet and let V ⊆ Σ∗ . Then V is a formal language over Σ. Definition: Empty language: The empty language is the language containing no words, it is denoted ∅ (as it is the empty set but in the context of formal language theory). Definition: Trivial language: The trivial language is the language containing the empty word only, it is denoted {λ}. Definition: Language product: Let Σ be an alphabet and let V and W be formal languages over Σ. Then the language product of V with W , denoted V W , is defined: V W = {x ◦ y : x ∈ V ∧ y ∈ W } 4

Note: The notation is the same as that of the subalphabet product as the language product is an extension of the definition of subalphabet product and could only be defined after formal language was defined. Definition: Language power: Let Σ be an alphabet and let V be a formal languages over Σ. The nth power of V , where n ∈ N0 , is denoted V n and defined inductively: { {λ} if n = 0 n V = n−1 V V if n > 0 Note: The notation is the same as that of the subalphabet power as the language power is an extension of the definition of subalphabet power and could only be defined after formal language was defined. Definition: Linguistic structure: Let Σ be an alphabet, V be a formal language over Σ and ◦ denote concatenation. (V, ◦) is a linguistic structure iff ∀(x, y) ∈ V × V : x ◦ y ∈ V . That is, V is closed under ◦. Note: This definition was invented for this text. Every linguistic structure is an algebraic structure hence the name. Theorem: [The empty word is unique]: Let Σ be an alphabet and let λ and λ0 be words over Σ of length 0. Then λ = λ0 . Hence we can speak of the empty word. Proof: From the definition of word equality, we have len(λ) = len(λ0 ) and ∀i : λi = λ0i holds vacuously. Hence the result. ¥ Theorem: [Length of concatenation]: Let Σ be an alphabet, x and y be words over Σ and ◦ denote concatenation. Then len(x ◦ y) = len(x) + len(y). Proof: If x = λ or y = λ then the result follows immediately from the definition of concatenation with the empty word and the definition of word length. Otherwise, from the definition of concatenation: x ◦ y is a mapping from [1.. len(x) + len(y)] to Σ. From the definition of the length of a word: len(x ◦ y) is the cardinality of the domain of x ◦ y when viewed as a sequence. Hence len(x ◦ y) = len(x) + len(y). ¥

Theorem: [The empty word is a two sided identity]: Let Σ be an alphabet, x be a word over Σ, λ be the empty word over Σ and ◦ denote concatenation. Then x ◦ λ = x and λ ◦ x = x. That is, λ is a two-sided identity element of concatenation. Proof: Follows immediately from the definition of concatenation with the empty word. ¥ Theorem: [Concatenation is associative]: Let Σ be an alphabet. Let x, y, z be words over Σ and let ◦ denote concatenation. Then (x ◦ y) ◦ z = x ◦ (y ◦ z). That is, concatenation is associative. Proof: If x, y or z are λ then the result follows immediately from [The empty word is a two sided identity]. Otherwise, from the definition of word equality it must first be shown that the lengths of the words are equal. From [Length of concatenation]: len((x ◦ y) ◦ z) = len(x) + len(y ◦ z) = len(x) + len(y) + len(z) len(x ◦ (y ◦ z)) = len(x ◦ y) + len(z) = len(x) + len(y) + len(z) Then it must be shown that: ∀i : ((x ◦ y) ◦ z)i = (x ◦ (y ◦ z))i Which will be demonstrated by repeatedly using the definition of con-

catenation and [Length of concatenation]. { (x ◦ y)i if 1 ≤ i ≤ len(x ◦ y) ((x ◦ y) ◦ z)i = zi−len(x◦y) if len(x ◦ y) < i ≤ len(x ◦ y) + len(z)   if 1 ≤ i ≤ len(x) x i = yi−len(x) if len(x) < i ≤ len(x) + len(y)   zi−len(x◦y) if len(x ◦ y) < i ≤ len(x ◦ y) + len(z)   if 1 ≤ i ≤ len(x) x i = yi−len(x) if len(x) < i ≤ len(x ◦ y)   zi−len(x◦y) if len(x ◦ y) < i ≤ len(x) + len(y ◦ z) { xi if 1 ≤ i ≤ len(x) = (y ◦ z)i−len(x) if len(x) < i ≤ len(x) + len(y ◦ z) = (x ◦ (y ◦ z))i Hence the result. ¥ Theorem: [Language product is associative]: Let Σ be an alphabet. Let X, Y, Z be formal languages over Σ. Then (XY )Z = X(Y Z). That is, the language product is associative. Proof: (XY )Z = {(x ◦ y) ◦ z : x ∈ X ∧ y ∈ Y ∧ z ∈ Z} From [Concatenation is associative]: = {x ◦ (y ◦ z) : x ∈ X ∧ y ∈ Y ∧ z ∈ Z} = X(Y Z) Hence the result. ¥ Theorem: [Linguistic structure underlying set]: Let (V, ◦) be a linguistic structure. Then V is either an inﬁnite set or λ where λ denotes the empty word. Proof: Suppose (V, ◦) is a linguistic structure and V is a ﬁnite set such that there exists an x ∈ V where x 6= λ. Then V contains a word y of greatest length m. Then len(y ◦ x) > len(y) which contradicts our assumption. 7

If V = λ then from [The empty word is a two-sided identity] we have that V is closed under ◦ and so ({λ}, ◦) is a linguistic structure. ¥ Theorem: [Kleene plus is a linguistic structure]: Let Σ be an alphabet, Σ+ be the Kleene plus of Σ and ◦ denote concatenation. Then (Σ+ , ◦) is a linguistic structure. Proof: As Σ+ ⊆ Σ∗ we have that Σ+ is a formal language over Σ. From the deﬁnition of Σ+ it follows: x ∈ Σ+ ⇔ len(x) > 0 and ∀i : xi ∈ Σ. Let x, y ∈ Σ+ . From [Length of concatenation], len(x) > 0 and len(y) > 0 so len(x ◦ y) > 0. From the deﬁnition of concatenation, ∀i : xi ∈ Σ and ∀i : yi ∈ Σ so ∀i : (x ◦ y)i ∈ Σ. Hence Σ+ is closed under ◦ and (Σ+ , ◦) is a linguistic structure. ¥ Theorem: [Kleene plus is a semigroup]: Let Σ be an alphabet, Σ+ be the Kleene plus of Σ and ◦ denote concatenation. Then (Σ+ , ◦) is a semigroup. Proof: This follows immediately from [Kleene plus is a linguistic structure] and [Concatenation is associative]. ¥ Theorem: [Kleene star is a linguistic structure]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then (Σ∗ , ◦) is a linguistic structure. Proof: As Σ∗ ⊆ Σ∗ we have that Σ∗ is a formal language over Σ. From [Kleene plus is a linguistic structure] we have that ◦ is closed on Σ∗ − {λ}. Including λ in the underlying set we have from [The empty word is a two sided identity] that Σ∗ is closed under ◦ and hence (Σ∗ , ◦) is a linguistic structure. ¥ Theorem: [Kleene star is a monoid]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then (Σ∗ , ◦) is a monoid. Proof: This follows immediately from [Kleene star is a linguistic structure], [Concatenation is associative] and [The empty word is a two-sided identity]. ¥ 8

Theorem: [Length is an epimorphism]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then the length function is an epimorphism from (Σ∗ , ◦) to (N0 , +). Proof: The morphism property follows immediately from [Length of concatenation] as len(x ◦ y) = len(x) + len(y). Hence it remains to be shown that length is a surjective function. As a special case 0 has a pre-image as len(λ◦λ) = len(λ)+len(λ) = 0+0 = 0. Now let n ∈ N and let x ∈ Σn . As len(x) = n we have that ∀n ∈ N0 : ∃x ∈ Σ∗ : len(x) = n. Hence the result. ¥ Theorem: [Length is an isomorphism iff alphabet is singleton]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then the length function is an isomorphism from (Σ∗ , ◦) to (N0 , +) iff |Σ| = 1. Proof: Sufficient condition: Suppose the length function is an isomorphism from (Σ∗ , ◦) to (N0 , +) and suppose |Σ| > 1. Let a and b be distinct letters of Σ and let n ∈ N. Then len(an−1 ◦ a) = len(an−1 ◦ b) = n and an−1 ◦ a 6= an−1 ◦ b. Which contradicts the assumption that the length function was an isomorphism. Hence |Σ| = 1. Note by the definition of an alphabet, |Σ| 6= 0 (and there is no special case to consider for λ as it is a word not a letter ). Necessary condition: From [Length is an epimorphism], it remains to be shown that |Σ| = 1 implies the length function is injective. That is, ∀x, y ∈ Σ∗ : len(x) = len(y) =⇒ x = y. Let x, y ∈ Σ∗ and let len(x) = len(y). We have that ∀i : xi = yi . So from the definition of word equality x = y. Hence the length function is injective. ¥ Theorem: [Concatenation is a cancellable operation]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then concatenation is a cancellable operation. That is, ∀x, y, z ∈ Σ∗ : x◦z = y ◦ z =⇒ x = y and z ◦ x = z ◦ y =⇒ x = y. Proof: The special case where x = λ or y = λ follows immediately from [Empty word is two-sided identity]. Let x, y ∈ Σ∗ , from the 9

deﬁnition of concatenation: { xi (x ◦ z)i = zi−len(x)

if 1 ≤ i ≤ len(x) if len(x) < i ≤ len(x) + len(z)

{ yi (y ◦ z)i = zi−len(y)

if 1 ≤ i ≤ len(y) if len(y) < i ≤ len(y) + len(z)

And

So len(x) = len(y) and ∀i : xi = yi . Hence by the definition of word equality x = y and concatenation is right cancellable. The proof that concatenation is left cancellable follows similarly. ¥ Theorem: [Intersection of linguistic structures]: Let (V, ◦) and (W, ◦) be linguistic structures. Then (V ∪ W, ◦) is a linguistic structure. Proof: Let x, y ∈ V ∩ W . As x, y ∈ V we have x ◦ y ∈ V . Also x, y ∈ W so x ◦ y ∈ W . By definition of union: x ◦ y ∈ V ∧ x ◦ y ∈ W =⇒ x ◦ y ∈ V ∩ W So: ∀x, y ∈ V ∩ W : x ◦ y ∈ V ∩ W Hence (V ∩ W, ◦) is a linguistic structure. ¥ Theorem: [Language product is distributive over union]: Let Σ be an alphabet. Let V, W and Y be formal languages over Σ. Then V (W ∪ Y ) = (V W ) ∪ (V Y ) Proof: V (W ∪ Y ) = {x ◦ y : x ∈ V ∧ y ∈ W ∪ Y } By the definition of set union: = {x ◦ y : x ∈ V ∧ (y ∈ W ∨ y ∈ Y )} Conjunction is distributive over disjunction: = {x ◦ y : (x ∈ V ∧ y ∈ W ) ∨ (x ∈ V ∧ y ∈ Y )} By the definition of language product and set union: = (V W ) ∪ (V Y ) Hence the result. ¥ 10

Theorem: [P-star is commutative monoid under union]: Let Σ be an alphabet and P(Σ∗ ) be the P-star of Σ. Then (P(Σ∗ ), ∪) is a monoid. Proof: Closure: As V ⊆ Σ∗ ∧ W ⊆ Σ∗ =⇒ (V ∪ W ) ⊆ Σ∗ and V, W ∈ P(Σ∗ ) we have that P(Σ∗ ) is closed under ∪. Associativity: Set union is associative. Identity: The empty language is a language over any alphabet and therefore is an element of P(Σ∗ ). The empty language is equivalent to the empty set and for any set X: X ∪ ∅ = X. So the empty language is the identity element. Commutativity: Set union is commutative. Hence (P(Σ∗ ), ∪) satisfies all the defining properties of a monoid. ¥ Theorem: [P-star forms additive rig with unity]: Let Σ be an alphabet ,P(Σ∗ ) be the P-star of Σ and ◦L denote the language product operation. Then (P(Σ∗ ), ∪, ◦L ) is an additive rig with unity. That is to say it satisfies all three of these conditions: (1) (P(Σ∗ ), ∪) is a commutative monoid. (2) ◦L is distributive over ∪ (3) (P(Σ∗ ), ◦L ) is a monoid. Proof: (1) Follows directly from [P-star is commutative monoid under union]. (2) Follows directly from [Language product is distributive over union]. ¥ Definition: Infix: Let Σ be an alphabet and x and y be words over Σ. Then y is an infix of x, denoted y inf x, iff y = λ or all of the following conditions hold: (1) len(y) ≤ len(x) (2) ∃i ∈ [1.. len(x) − len(y) + 1] : y1 = xi (3) ∀j ∈ [1.. len(y)] : yj = xi+1−j 11

Note: y is a non-trivial infix iff y 6= x. y is a non-empty infix iff y 6= λ. y is a proper infix iff it is both a non-trivial and non-empty infix. Synonyms: Subword Definition: Infix set: Let Σ be an alphabet and x be a word over Σ. Then the infix set of x, denoted INFIX(x), is the set of all infixes of x. Note: The proper infix set of x, denoted INFIXpr(x), is the set of all proper infixes of x. Definition: Base alphabet: Let Σ be an alphabet and V be a formal language over Σ. Then the base alphabet of V is denoted B(V ) and defined: x : len(x) = 1 ∧ ∃y ∈ V : x inf y Theorem: [Condition for concatenation commutativity]: Let (V, ◦) be a linguistic structure. Then ◦ is commutative on V iff: ∀n ∈ N0 : x ∈ V =⇒ @y ∈ V : y 6= x ∧ len(y) = len(x) Proof: Sufficient condition: Suppose ◦ is commutative on V and for some n ∈ N0 there does exist two non-equal words x and y of length n. From [The empty word is unique], these words cannot be the empty word. So from the definition of word equality: ∃i : xi 6= yi . Hence from the definition of concatenation: (x ◦ y)i 6= (y ◦ x)i . Which contradicts our assumption that ◦ was commutative. Necessary condition: ¥ Theorem: [Levi’s lemma case 1]: Let Σ be an alphabet, Σ∗ the Kleene star of Σ and ◦ denote concatenation. Then: ∀v, w, x, y ∈ Σ∗ : v ◦ w = x ◦ y and len(v) ≥ len(x) =⇒ ∃!z ∈ Σ∗ : v = x ◦ z and y = z ◦ w Proof: From the definition of word equality, ∀i : (v ◦ w)i = (x ◦ y)i . By the definition of concatenation, up to i = len(x) we have xi = vi . If len(v) = len(x) then z = λ. That this is unique follows from [The 12

empty word is unique]. For the other cases we will show, by induction, that the infix t of (x ◦ z) defined as ti = (x ◦ y)len(x)+i for i ∈ [1.. len(v) − len(x)] is the unique infix satisfying z. For the base case where len(v) = len(x) + 1 then t = (x ◦ y)len(x)+1 . This is unique as it is a single letter of Σ and being an element of a set it is defined to be unique. For the the inductive step, we suppose there exists a k ∈ [1.. len(v) − len(x) − 1] such that len(v) = len(x) + k and that there exists a unique infix t satisfying z. Then for len(v) = len(x) + k + 1 we define t0 as t ◦ (x ◦ y)len(x)+len(t)+1 . t0 uniquely satisfies z up to t0len(t) by assumption. The remaining letter satisfies z and is unique by being a single letter. Hence the result. ¥ Definition: Prefix: Let Σ be an alphabet and x and y be words over Σ. Then y is a prefix of x, denoted y pre x iff y = λ or y is an infix of x and y 1 = x1 . Note: y is a non-trivial prefix iff y 6= x. y is a non-empty prefix iff y 6= λ. y is a proper prefix iff it is both a non-trivial and non-empty prefix. Definition: Prefix set: Let Σ be an alphabet and x be a word over Σ. Then the prefix set of x, denoted PREF(x), is the set of all prefixes of x. Note: The proper prefix set of x, denoted PREFpr(x), is the set of all proper prefixes of x. Definition: Suffix: Let Σ be an alphabet and x and y be words over Σ. Then y is a suffix of x, denoted y suff x iff y = λ or y is a infix of x and ylen(y) = xlen(x) . Note: y is a non-trivial suffix iff y 6= x. y is a non-empty suffix iff y 6= λ. y is a proper suffix iff it is both a non-trivial and non-empty suffix. Definition: Suffix set: Let Σ be an alphabet and x be a word over Σ. Then the suffix set of x, denoted SUFF(x), is the set of all suffixes of x. Note: The proper suffix set of x, denoted SUFFpr(x), is the set of all proper suffixes of x. 13

Theorem: [Infix relation is partial ordering]: Let Σ be an alphabet and Σ∗ the Kleene star of Σ. Then the prefix relation is a partial ordering on Σ∗ × Σ∗ . Proof: A relation is a partial ordering iff it is reflexive, antisymmetric and transitive. We will show that the infix relation has all three of these properties. Let x, y, z ∈ Σ∗ : Reflexivity: From the definition of infix, every word is an infix of itself. Antisymmetry: Supposing x inf y and y inf x. Then len(x) = len(y) by condition (1) of the definition of infix and ∀i : xi = yi by conditions (2) and (3) of the definition of infix. Hence by the definition of word equality: y = x. Transitivity: Supposing x inf y and y inf z. Then len(x) ≤ len(z) by condition (1) of the definition of infix and conditions (2) and (3) are satisfied as well. ¥

Grammars

Definition: Chomsky grammar: A Chomsky grammar is a 4-tuple: G = (N, T, P, S), where: V is a non-empty finite set called the total vocabulary. Σ is a non-empty finite set such that Σ ⊆ V called the terminal alphabet. N = V − Σ is called the set of.