Formal languages

This text is an outline of formal language theory. The text is split into blocks. There are 4 types of blocks: DeďŹ nitions, Examples, Theorems and Discussions. Conventions in this text: N = {1, 2, 3, ...} and N0 = {0, 1, 2, 3, ...}

1

1

Alphabets and words

Deﬁnition: Alphabet: Let Σ be a non-empty ﬁnite set. In the context of formal language theory Σ is referred to as an alphabet. Synonyms: Vocabulary Example 1: The set {a, b, c} is an alphabet. Example 2: The set {begin, end, while, do, if, then} is an alphabet similar to ones used in programming languages. Example 3: The set ∅ fails to be an alphabet as it needs to be nonempty. Deﬁnition: Subalphabet: Let Σ be an alphabet and let V ⊆ Σ. Then V is a subalphabet of Σ Deﬁnition: Letter: Let Σ be an alphabet. Then the elements of Σ are referred to as the letters of Σ. Synonyms: Symbol, Token Example 1: Given the alphabet Σ = {1, 2, x, y}. x is a letter of Σ though xy is not. Deﬁnition: Word: Let Σ be an alphabet. A ﬁnite sequence in Σ is a word over Σ. Synonyms: String Example 1: Given the alphabet Σ = {5, a, 7, j, t}. Then ha, 7, a, ti is a word over Σ. If there is no confusion this would simply be written as ‘a7at’. Example 2: Given the alphabet Σ = {l, ll, lll}. Then hll, lll, lli is a word over Σ. Ambiguity arises if it is written in the shorter way so it is not written in this way. Deﬁnition: i-th letter: Let Σ be an alphabet and x be a word over Σ. The ith letter of x is the ith term of x when viewed as a ﬁnite sequence over Σ. It is denoted xi . Deﬁnition: Word length: Let Σ be an alphabet and x be a word over 2

Σ. The length of x is the cardinality of the domain of x when viewed as a sequence. It is denoted len(x) or lg(x) or |x|. Example 1: Given the alphabet Σ = {j, p, r, k}. Let x =‘rrkrjp’. Then len(x) = 6. Deﬁnition: Word equality: Let Σ be an alphabet and let x and y be words over Σ. x and y are deﬁned to be equal (denoted x = y) iﬀ: len(x) = len(y) and ∀i : xi = yi Deﬁnition: Empty word: Let Σ be an alphabet and x be a word over Σ. x is an empty word iﬀ len(x) = 0. It is denoted λ or ². Synonyms: Empty string, Null word, Null string Deﬁnition: Concatenation: Let Σ be an alphabet and let x and y be words over Σ. The concatenation of x with y is denoted xy in the literature (but is denoted x ◦ y in this text). We deﬁne x ◦ y = x if y = λ and x ◦ y = y if x = λ. Otherwise it is deﬁned: { xi if 1 ≤ i ≤ len(x) (xy)i = yi−len(x) if len(x) < i ≤ len(x) + len(y) Deﬁnition: String power: Let Σ be an alphabet, x be a word over Σ and ◦ denoted concatenation. The nth string power of x, where n ∈ N0 , is denoted xn and deﬁned inductively: { λ if n = 0 xn = xn−1 ◦ x if n > 0 Synonyms: Product Deﬁnition: Subalphabet product: Let Σ be an alphabet and let V, W ⊆ Σ. The subalphabet product (under concatenation) of V with W is denoted V W and deﬁned: V W = {v ◦ w : v ∈ V ∧ w ∈ W } Deﬁnition: Subalphabet power: Let Σ be an alphabet and let V ⊆ Σ. The nth power of V , where n ∈ N0 , is denoted V n and deﬁned inductively: { {λ} if n = 0 Vn = V n−1 V if n > 0 3

Notation: Some sources denote V n as Vn . Deﬁnition: Kleene star: Let Σ be an alphabet and let V ⊆ Σ. The Kleene star of V is denoted V ∗ and deﬁned: ∪ Vi i∈N0

Synonyms: Kleene closure, Monoid closure Deﬁnition: P-star Let Σ be an alphabet. The P-star of Σ, denoted P(Σ∗ ), is the set of all languages over Σ. Hence it is the power set of the Kleene star of Σ. Deﬁnition: Kleene plus: Let Σ be an alphabet and let V ⊆ Σ. The Kleene plus of V is denoted V + and deﬁned: ∪ Vi i∈N

Deﬁnition: P-plus Let Σ be an alphabet. The P-plus of Σ, denoted P(Σ+ ), is the power set of the Kleene plus of Σ. Deﬁnition: Formal language: Let Σ be an alphabet and let V ⊆ Σ∗ . Then V is a formal language over Σ. Deﬁnition: Empty language: The empty language is the language containing no words, it is denoted ∅ (as it is the empty set but in the context of formal language theory). Deﬁnition: Trivial language: The trivial language is the language containing the empty word only, it is denoted {λ}. Deﬁnition: Language product: Let Σ be an alphabet and let V and W be formal languages over Σ. Then the language product of V with W , denoted V W , is deﬁned: V W = {x ◦ y : x ∈ V ∧ y ∈ W } 4

Note: The notation is the same as that of the subalphabet product as the language product is an extension of the deﬁnition of subalphabet product and could only be deﬁned after formal language was deﬁned. Deﬁnition: Language power: Let Σ be an alphabet and let V be a formal languages over Σ. The nth power of V , where n ∈ N0 , is denoted V n and deﬁned inductively: { {λ} if n = 0 n V = n−1 V V if n > 0 Note: The notation is the same as that of the subalphabet power as the language power is an extension of the deﬁnition of subalphabet power and could only be deﬁned after formal language was deﬁned. Deﬁnition: Linguistic structure: Let Σ be an alphabet, V be a formal language over Σ and ◦ denote concatenation. (V, ◦) is a linguistic structure iﬀ ∀(x, y) ∈ V × V : x ◦ y ∈ V . That is, V is closed under ◦. Note: This deﬁnition was invented for this text. Every linguistic structure is an algebraic structure hence the name. Theorem: [The empty word is unique]: Let Σ be an alphabet and let λ and λ0 be words over Σ of length 0. Then λ = λ0 . Hence we can speak of the empty word. Proof: From the deﬁnition of word equality, we have len(λ) = len(λ0 ) and ∀i : λi = λ0i holds vacuously. Hence the result. ¥ Theorem: [Length of concatenation]: Let Σ be an alphabet, x and y be words over Σ and ◦ denote concatenation. Then len(x ◦ y) = len(x) + len(y). Proof: If x = λ or y = λ then the result follows immediately from the deﬁnition of concatenation with the empty word and the deﬁnition of word length. Otherwise, from the deﬁnition of concatenation: x ◦ y is a mapping from [1.. len(x) + len(y)] to Σ. From the deﬁnition of the length of a word: len(x ◦ y) is the cardinality of the domain of x ◦ y when viewed as a sequence. Hence len(x ◦ y) = len(x) + len(y). ¥

5

Theorem: [The empty word is a two sided identity]: Let Σ be an alphabet, x be a word over Σ, λ be the empty word over Σ and ◦ denote concatenation. Then x ◦ λ = x and λ ◦ x = x. That is, λ is a two-sided identity element of concatenation. Proof: Follows immediately from the deﬁnition of concatenation with the empty word. ¥ Theorem: [Concatenation is associative]: Let Σ be an alphabet. Let x, y, z be words over Σ and let ◦ denote concatenation. Then (x ◦ y) ◦ z = x ◦ (y ◦ z). That is, concatenation is associative. Proof: If x, y or z are λ then the result follows immediately from [The empty word is a two sided identity]. Otherwise, from the deﬁnition of word equality it must ﬁrst be shown that the lengths of the words are equal. From [Length of concatenation]: len((x ◦ y) ◦ z) = len(x) + len(y ◦ z) = len(x) + len(y) + len(z) len(x ◦ (y ◦ z)) = len(x ◦ y) + len(z) = len(x) + len(y) + len(z) Then it must be shown that: ∀i : ((x ◦ y) ◦ z)i = (x ◦ (y ◦ z))i Which will be demonstrated by repeatedly using the deﬁnition of con-

6

catenation and [Length of concatenation]. { (x ◦ y)i if 1 ≤ i ≤ len(x ◦ y) ((x ◦ y) ◦ z)i = zi−len(x◦y) if len(x ◦ y) < i ≤ len(x ◦ y) + len(z)   if 1 ≤ i ≤ len(x) x i = yi−len(x) if len(x) < i ≤ len(x) + len(y)   zi−len(x◦y) if len(x ◦ y) < i ≤ len(x ◦ y) + len(z)   if 1 ≤ i ≤ len(x) x i = yi−len(x) if len(x) < i ≤ len(x ◦ y)   zi−len(x◦y) if len(x ◦ y) < i ≤ len(x) + len(y ◦ z) { xi if 1 ≤ i ≤ len(x) = (y ◦ z)i−len(x) if len(x) < i ≤ len(x) + len(y ◦ z) = (x ◦ (y ◦ z))i Hence the result. ¥ Theorem: [Language product is associative]: Let Σ be an alphabet. Let X, Y, Z be formal languages over Σ. Then (XY )Z = X(Y Z). That is, the language product is associative. Proof: (XY )Z = {(x ◦ y) ◦ z : x ∈ X ∧ y ∈ Y ∧ z ∈ Z} From [Concatenation is associative]: = {x ◦ (y ◦ z) : x ∈ X ∧ y ∈ Y ∧ z ∈ Z} = X(Y Z) Hence the result. ¥ Theorem: [Linguistic structure underlying set]: Let (V, ◦) be a linguistic structure. Then V is either an inﬁnite set or λ where λ denotes the empty word. Proof: Suppose (V, ◦) is a linguistic structure and V is a ﬁnite set such that there exists an x ∈ V where x 6= λ. Then V contains a word y of greatest length m. Then len(y ◦ x) > len(y) which contradicts our assumption. 7

If V = λ then from [The empty word is a two-sided identity] we have that V is closed under ◦ and so ({λ}, ◦) is a linguistic structure. ¥ Theorem: [Kleene plus is a linguistic structure]: Let Σ be an alphabet, Σ+ be the Kleene plus of Σ and ◦ denote concatenation. Then (Σ+ , ◦) is a linguistic structure. Proof: As Σ+ ⊆ Σ∗ we have that Σ+ is a formal language over Σ. From the deﬁnition of Σ+ it follows: x ∈ Σ+ ⇔ len(x) > 0 and ∀i : xi ∈ Σ. Let x, y ∈ Σ+ . From [Length of concatenation], len(x) > 0 and len(y) > 0 so len(x ◦ y) > 0. From the deﬁnition of concatenation, ∀i : xi ∈ Σ and ∀i : yi ∈ Σ so ∀i : (x ◦ y)i ∈ Σ. Hence Σ+ is closed under ◦ and (Σ+ , ◦) is a linguistic structure. ¥ Theorem: [Kleene plus is a semigroup]: Let Σ be an alphabet, Σ+ be the Kleene plus of Σ and ◦ denote concatenation. Then (Σ+ , ◦) is a semigroup. Proof: This follows immediately from [Kleene plus is a linguistic structure] and [Concatenation is associative]. ¥ Theorem: [Kleene star is a linguistic structure]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then (Σ∗ , ◦) is a linguistic structure. Proof: As Σ∗ ⊆ Σ∗ we have that Σ∗ is a formal language over Σ. From [Kleene plus is a linguistic structure] we have that ◦ is closed on Σ∗ − {λ}. Including λ in the underlying set we have from [The empty word is a two sided identity] that Σ∗ is closed under ◦ and hence (Σ∗ , ◦) is a linguistic structure. ¥ Theorem: [Kleene star is a monoid]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then (Σ∗ , ◦) is a monoid. Proof: This follows immediately from [Kleene star is a linguistic structure], [Concatenation is associative] and [The empty word is a two-sided identity]. ¥ 8

Theorem: [Length is an epimorphism]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then the length function is an epimorphism from (Σ∗ , ◦) to (N0 , +). Proof: The morphism property follows immediately from [Length of concatenation] as len(x ◦ y) = len(x) + len(y). Hence it remains to be shown that length is a surjective function. As a special case 0 has a pre-image as len(λ◦λ) = len(λ)+len(λ) = 0+0 = 0. Now let n ∈ N and let x ∈ Σn . As len(x) = n we have that ∀n ∈ N0 : ∃x ∈ Σ∗ : len(x) = n. Hence the result. ¥ Theorem: [Length is an isomorphism iﬀ alphabet is singleton]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then the length function is an isomorphism from (Σ∗ , ◦) to (N0 , +) iﬀ |Σ| = 1. Proof: Suﬃcient condition: Suppose the length function is an isomorphism from (Σ∗ , ◦) to (N0 , +) and suppose |Σ| > 1. Let a and b be distinct letters of Σ and let n ∈ N. Then len(an−1 ◦ a) = len(an−1 ◦ b) = n and an−1 ◦ a 6= an−1 ◦ b. Which contradicts the assumption that the length function was an isomorphism. Hence |Σ| = 1. Note by the deﬁnition of an alphabet, |Σ| 6= 0 (and there is no special case to consider for λ as it is a word not a letter ). Necessary condition: From [Length is an epimorphism], it remains to be shown that |Σ| = 1 implies the length function is injective. That is, ∀x, y ∈ Σ∗ : len(x) = len(y) =⇒ x = y. Let x, y ∈ Σ∗ and let len(x) = len(y). We have that ∀i : xi = yi . So from the deﬁnition of word equality x = y. Hence the length function is injective. ¥ Theorem: [Concatenation is a cancellable operation]: Let Σ be an alphabet, Σ∗ be the Kleene star of Σ and ◦ denote concatenation. Then concatenation is a cancellable operation. That is, ∀x, y, z ∈ Σ∗ : x◦z = y ◦ z =⇒ x = y and z ◦ x = z ◦ y =⇒ x = y. Proof: The special case where x = λ or y = λ follows immediately from [Empty word is two-sided identity]. Let x, y ∈ Σ∗ , from the 9

deﬁnition of concatenation: { xi (x ◦ z)i = zi−len(x)

if 1 ≤ i ≤ len(x) if len(x) < i ≤ len(x) + len(z)

{ yi (y ◦ z)i = zi−len(y)

if 1 ≤ i ≤ len(y) if len(y) < i ≤ len(y) + len(z)

And

So len(x) = len(y) and ∀i : xi = yi . Hence by the deﬁnition of word equality x = y and concatenation is right cancellable. The proof that concatenation is left cancellable follows similarly. ¥ Theorem: [Intersection of linguistic structures]: Let (V, ◦) and (W, ◦) be linguistic structures. Then (V ∪ W, ◦) is a linguistic structure. Proof: Let x, y ∈ V ∩ W . As x, y ∈ V we have x ◦ y ∈ V . Also x, y ∈ W so x ◦ y ∈ W . By deﬁnition of union: x ◦ y ∈ V ∧ x ◦ y ∈ W =⇒ x ◦ y ∈ V ∩ W So: ∀x, y ∈ V ∩ W : x ◦ y ∈ V ∩ W Hence (V ∩ W, ◦) is a linguistic structure. ¥ Theorem: [Language product is distributive over union]: Let Σ be an alphabet. Let V, W and Y be formal languages over Σ. Then V (W ∪ Y ) = (V W ) ∪ (V Y ) Proof: V (W ∪ Y ) = {x ◦ y : x ∈ V ∧ y ∈ W ∪ Y } By the deﬁnition of set union: = {x ◦ y : x ∈ V ∧ (y ∈ W ∨ y ∈ Y )} Conjunction is distributive over disjunction: = {x ◦ y : (x ∈ V ∧ y ∈ W ) ∨ (x ∈ V ∧ y ∈ Y )} By the deﬁnition of language product and set union: = (V W ) ∪ (V Y ) Hence the result. ¥ 10

Theorem: [P-star is commutative monoid under union]: Let Σ be an alphabet and P(Σ∗ ) be the P-star of Σ. Then (P(Σ∗ ), ∪) is a monoid. Proof: Closure: As V ⊆ Σ∗ ∧ W ⊆ Σ∗ =⇒ (V ∪ W ) ⊆ Σ∗ and V, W ∈ P(Σ∗ ) we have that P(Σ∗ ) is closed under ∪. Associativity: Set union is associative. Identity: The empty language is a language over any alphabet and therefore is an element of P(Σ∗ ). The empty language is equivalent to the empty set and for any set X: X ∪ ∅ = X. So the empty language is the identity element. Commutativity: Set union is commutative. Hence (P(Σ∗ ), ∪) satisﬁes all the deﬁning properties of a monoid. ¥ Theorem: [P-star forms additive rig with unity]: Let Σ be an alphabet ,P(Σ∗ ) be the P-star of Σ and ◦L denote the language product operation. Then (P(Σ∗ ), ∪, ◦L ) is an additive rig with unity. That is to say it satisﬁes all three of these conditions: (1) (P(Σ∗ ), ∪) is a commutative monoid. (2) ◦L is distributive over ∪ (3) (P(Σ∗ ), ◦L ) is a monoid. Proof: (1) Follows directly from [P-star is commutative monoid under union]. (2) Follows directly from [Language product is distributive over union]. ¥ Deﬁnition: Inﬁx: Let Σ be an alphabet and x and y be words over Σ. Then y is an inﬁx of x, denoted y inf x, iﬀ y = λ or all of the following conditions hold: (1) len(y) ≤ len(x) (2) ∃i ∈ [1.. len(x) − len(y) + 1] : y1 = xi (3) ∀j ∈ [1.. len(y)] : yj = xi+1−j 11

Note: y is a non-trivial inﬁx iﬀ y 6= x. y is a non-empty inﬁx iﬀ y 6= λ. y is a proper inﬁx iﬀ it is both a non-trivial and non-empty inﬁx. Synonyms: Subword Deﬁnition: Inﬁx set: Let Σ be an alphabet and x be a word over Σ. Then the inﬁx set of x, denoted INFIX(x), is the set of all inﬁxes of x. Note: The proper inﬁx set of x, denoted INFIXpr(x), is the set of all proper inﬁxes of x. Deﬁnition: Base alphabet: Let Σ be an alphabet and V be a formal language over Σ. Then the base alphabet of V is denoted B(V ) and deﬁned: x : len(x) = 1 ∧ ∃y ∈ V : x inf y Theorem: [Condition for concatenation commutativity]: Let (V, ◦) be a linguistic structure. Then ◦ is commutative on V iﬀ: ∀n ∈ N0 : x ∈ V =⇒ @y ∈ V : y 6= x ∧ len(y) = len(x) Proof: Suﬃcient condition: Suppose ◦ is commutative on V and for some n ∈ N0 there does exist two non-equal words x and y of length n. From [The empty word is unique], these words cannot be the empty word. So from the deﬁnition of word equality: ∃i : xi 6= yi . Hence from the deﬁnition of concatenation: (x ◦ y)i 6= (y ◦ x)i . Which contradicts our assumption that ◦ was commutative. Necessary condition: ¥ Theorem: [Levi’s lemma case 1]: Let Σ be an alphabet, Σ∗ the Kleene star of Σ and ◦ denote concatenation. Then: ∀v, w, x, y ∈ Σ∗ : v ◦ w = x ◦ y and len(v) ≥ len(x) =⇒ ∃!z ∈ Σ∗ : v = x ◦ z and y = z ◦ w Proof: From the deﬁnition of word equality, ∀i : (v ◦ w)i = (x ◦ y)i . By the deﬁnition of concatenation, up to i = len(x) we have xi = vi . If len(v) = len(x) then z = λ. That this is unique follows from [The 12

empty word is unique]. For the other cases we will show, by induction, that the inﬁx t of (x ◦ z) deﬁned as ti = (x ◦ y)len(x)+i for i ∈ [1.. len(v) − len(x)] is the unique inﬁx satisfying z. For the base case where len(v) = len(x) + 1 then t = (x ◦ y)len(x)+1 . This is unique as it is a single letter of Σ and being an element of a set it is deﬁned to be unique. For the the inductive step, we suppose there exists a k ∈ [1.. len(v) − len(x) − 1] such that len(v) = len(x) + k and that there exists a unique inﬁx t satisfying z. Then for len(v) = len(x) + k + 1 we deﬁne t0 as t ◦ (x ◦ y)len(x)+len(t)+1 . t0 uniquely satisﬁes z up to t0len(t) by assumption. The remaining letter satisﬁes z and is unique by being a single letter. Hence the result. ¥ Deﬁnition: Preﬁx: Let Σ be an alphabet and x and y be words over Σ. Then y is a preﬁx of x, denoted y pre x iﬀ y = λ or y is an inﬁx of x and y 1 = x1 . Note: y is a non-trivial preﬁx iﬀ y 6= x. y is a non-empty preﬁx iﬀ y 6= λ. y is a proper preﬁx iﬀ it is both a non-trivial and non-empty preﬁx. Deﬁnition: Preﬁx set: Let Σ be an alphabet and x be a word over Σ. Then the preﬁx set of x, denoted PREF(x), is the set of all preﬁxes of x. Note: The proper preﬁx set of x, denoted PREFpr(x), is the set of all proper preﬁxes of x. Deﬁnition: Suﬃx: Let Σ be an alphabet and x and y be words over Σ. Then y is a suﬃx of x, denoted y suﬀ x iﬀ y = λ or y is a inﬁx of x and ylen(y) = xlen(x) . Note: y is a non-trivial suﬃx iﬀ y 6= x. y is a non-empty suﬃx iﬀ y 6= λ. y is a proper suﬃx iﬀ it is both a non-trivial and non-empty suﬃx. Deﬁnition: Suﬃx set: Let Σ be an alphabet and x be a word over Σ. Then the suﬃx set of x, denoted SUFF(x), is the set of all suﬃxes of x. Note: The proper suﬃx set of x, denoted SUFFpr(x), is the set of all proper suﬃxes of x. 13

Theorem: [Inﬁx relation is partial ordering]: Let Σ be an alphabet and Σ∗ the Kleene star of Σ. Then the preﬁx relation is a partial ordering on Σ∗ × Σ∗ . Proof: A relation is a partial ordering iﬀ it is reﬂexive, antisymmetric and transitive. We will show that the inﬁx relation has all three of these properties. Let x, y, z ∈ Σ∗ : Reﬂexivity: From the deﬁnition of inﬁx, every word is an inﬁx of itself. Antisymmetry: Supposing x inf y and y inf x. Then len(x) = len(y) by condition (1) of the deﬁnition of inﬁx and ∀i : xi = yi by conditions (2) and (3) of the deﬁnition of inﬁx. Hence by the deﬁnition of word equality: y = x. Transitivity: Supposing x inf y and y inf z. Then len(x) ≤ len(z) by condition (1) of the deﬁnition of inﬁx and conditions (2) and (3) are satisﬁed as well. ¥

2

Grammars

Deﬁnition: Chomsky grammar: A Chomsky grammar is a 4-tuple: G = (N, T, P, S), where: V is a non-empty ﬁnite set called the total vocabulary. Σ is a non-empty ﬁnite set such that Σ ⊆ V called the terminal alphabet. N = V − Σ is called the set of.

14

Formal languages summary

Work in progress.

Formal languages summary

Work in progress.