Language and Syntax:-

Every language displays a structure called its grammar or syntax. For example, a correct sentence always consists of a subject followed by a predicate, correct means well formed. This fact can be described by the following formula:

sentence = subject predicate.

If we add to this formula the two further formulas

subject = "John" | "Mary".

predicate = "eats" | "talks".

then we define herewith exactly four possible sentences, namely

John eats Mary eats

John talks Mary talks

where the symbol | is to be pronounced as or. We call these formulas syntax rules, productions, or simply syntactic equations. Subject and predicate are syntactic classes. A shorter notation for the above omits meaningful identifiers:

S = AB. L = {ac, ad, bc, bd}
A = "a" | "b".
B = "c" | "d".

We will use this shorthand notation in the subsequent, short examples. The set L of sentences which can be generated in this way, that is, by repeated substitution of the left-hand sides by the right-hand sides of the equations, is called the language.

The example above evidently defines a language consisting of only four sentences. Typically, however, a language contains infinitely many sentences. The following example shows that an infinite set may very well be defined with a finite number of equations. The symbol ∅ stands for the empty sequence.

S = A. L = {∅, a, aa, aaa, aaaa, ... }
A = "a" A | ∅.

It means to do so is recursion which allows a substitution (here of A by "a"A) be repeated arbitrarily often.

This example is again based on the use of recursion. But it generates not only sentences consisting of an arbitrary sequence of the same symbol, but also nested sentences:

S = A. L = {b, abc, aabcc, aaabccc ... }
A = "a" A "c" | "b".

It is clear that arbitrarily deep nestings can be expressed, a property particularly important in the definition of structured languages.

The last example exhibits the structure of expressions. The symbols E, T, F, and V stand for expression, term, factor, and variable.

E = T | A "+" T.
T = F | T "*" F.
F = V | "(" E ")".
V = "a" | "b" | "c" | "d".

From this example it is evident that a syntax does not only define the set of sentences of a language, but also provides them with a structure.

A language is defined by the following:

1. The set of terminal symbols. These are the symbols that occur in its sentences. They are said to be terminal, because they cannot be substituted by any other symbols. The substitution process stops with terminal symbols. In our first example this set consists of the elements a, b, c and d. The set is also called vocabulary.Terminals are represented by small letters.

2. The set of nonterminal symbols. They denote syntactic classes and can be substituted. In our first example this set consists of the elements S, A and B. Nonterminals are represented by capital letters.

3. The set of syntactic equations (also called productions). These define the possible substitutions of nonterminal symbols. An equation is specified for each nonterminal symbol.

4. The start symbol. It is a nonterminal symbol, in the examples above denoted by S.

A language is, therefore, the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions.

Let nonterminal symbols be identifiers as we know them from programming languages, that is, as sequences of letters (and possibly digits), for example, expression, term. Let terminal symbols be character sequences enclosed in quotes (strings), for example, "=", "|". For the definition of the structure of these equations it is convenient to use the tool just being defined itself:

syntax = production syntax | ∅.

production = identifier "=" expression "." .

expression = term | expression "|" term.

term = factor | term factor.

factor = identifier | string.

identifier = letter | identifier letter | identifier digit.

string = stringhead """.

stringhead = """ | stringhead character.

letter = "A" | ... | "Z".

digit = "0" | ... | "9".

This notation was introduced in 1960 by J. Backus and P. Naur in almost identical form for the formal description of the syntax of the language Algol 60. It is therefore called Backus Naur Form (BNF). As our example shows, using recursion to express simple repetitions is rather detrimental to readability. Therefore, we extend this notation by two constructs to express repetition and optionality. Furthermore, we allow expressions to be enclosed within parentheses. Thereby an extension of BNF called EBNF introduced by Wirth, in 1977 is postulated, which again we immediately use for its own, precise definition:

syntax = {production}.

production = identifier "=" expression "." .

expression = term {"|" term}.

term = factor {factor}.

factor = identifier | string | "(" expression ")" | "[" expression "]" | "{" expression "}".

identifier = letter {letter | digit}.

string = """ {character} """.

letter = "A" | ... | "Z".

digit = "0" | ... | "9".

A factor of the form {x} is equivalent to an arbitrarily long sequence of x, including the empty sequence.
A production of the form
A = AB | ∅.
is now formulated more briefly as A = {B}. A factor of the form [x] is equivalent to "x or nothing", that is, it expresses optionality. Hence, the need for the special symbol ∅ for the empty sequence vanishes. The idea of defining languages and their grammar with mathematical precision goes back to N.

Owing to the syntactic structure, the individual parts of the sentence and their meaning can be recognized independently, and together they yield the meaning of the whole.

Two things are conclded:

1. Interpretation of sentences always rests on the recognition of their syntactic structure.

2. Every sentence must have a single structure in order to be unambiguous.

If the second requirement is not satisfied, ambiguous sentences arise. We call a syntactic class ambiguous if it can be attributed several structures. A language is ambiguous if it contains at least one ambiguous syntactic class