CS421 Theory - Yoshii - Week 1 - Chap 1 ======================================= ##Inter questions will be part of HW1 To find them using unix, do grep "##" week1.doc TOPICS: ------/Strings /Languages /Defining Sets /INTRO ====== This class is basically about (programming) languages. We - will discuss: defining languages generating all strings of a language answering YES if a string is in a language /DEFINING LANGUAGES =================== In order to design a new language or describe an existing language, you have to be able to define exactly what it is. What are parts of a language? Take English as an example: It is made up of words but the words are made up of letters. Words are combined to create a sentence. You have to follow grammar rules to create a sentence. i.e. the sentence must be SYNTACTICALLY correct. e.g.good. e.g.bad. A boy kicked a ball. A kicked boy ball a. Sentences must also have valid meanings. i.e. the sentences must be SEMANTICALLY correct. e.g.bad. A ball kicked a boy. So, to define a language, you must define: - the letters making up words (alphabet), - the words and punctuation marks of the language (tokens), - the syntactic rules of the language, - the semantic rules of the language. ##Inter1* Give one sentence which is syntactically incorrect. ## ## Give another sentence which is syntactically correct but is semantically incorrect. /WHAT SHOULD THE DEFINITION ALLOW US TO DO? =========================================== Given the definition of a language, we should be able to systematically generate (i.e. list) all sentences belonging to the language. Given the definition of a language, we should be able to say "YES this belongs to the language" or "NO this does not belong to the language." /FOR NATURAL LANGUAGES ====================== Why is it difficult to do the above for natural languages such as English? Reason1: Natural languages change constantly. Reason2: Semantic rules of natural languages are complex requiring the knowledge of the world which is huge and also change constantly. Reason3: Natural languages are for communication between human beings, which are influenced by CONTEXT and PRAGMATICS. e.g.context. e.g.pragmatics. pronoun references (it, they, that) I sure wish I could see the file.... (as a request). Thus, natural languages are full of ambiguities. This difficulty presents serious problems when we want to build a natural language understanding or translation system. /FOR PROGRAMMING LANGUAGES ========================== To make it easy to define a programming language, you want to make sure its syntax and semantics are clean, simple and unambiguous. For a programming language, we should be able to build a compiler. Source Language ---> Compiler ----> Target Language (e.g. C++) | (assembly/machine code) error messages A compiler has two fundamental tasks: - ANALYSIS: accept only valid statements of the source language. - GENERATION: generate an equivalent in the target language. Note: for other text processors, the concept remains similar. The source may be a document marked with directives (e.g. Latex) while the target may be the processed document. The source may be a database query while the target may be the actions caused on the database. /COMPILER PARTS and CS421 ========================== source -> [Scanner] - tokens -> [Parser] - syntactic structure -> [Semantic routines] - IR -> [Optimizer] - IR -> [Generator] -> target All phases do error checking In the following, the Scanner and the Parser are closely related to the goals of this course. Things in " " will be covered in CS421. 1. Scanner (does Lexical Analysis) WHAT: - reads character by character; left-to-right - groups them into the tokens of the language (identifiers, integers, reserved words, delimiters, etc.) - eliminates unneeded info (e.g. comments) - processes compiler control directives (e.g. #define, #ifndef) HOW: - "regular expressions" define the tokens - "finite automata" recognizes tokens - "finite automata" can be created automatically from regular expressions TYPES: - hand-coded scanner - table-driven scanner 2. Parser (does Syntactic Analysis) WHAT: - groups tokens into higher level units (expressions, statements, etc.) - verifies correct syntax - recovers from (and some repairs) syntax errors to continue parsing - builds a syntax tree (or parse tree), or calls semantic routines directly during parsing. HOW: - units are defined by the "production rules of a grammar" (Context Free) which contain "recursive rules" for nested structures. TYPES: - recursive descent parser - table-driven parser 3. Semantic routines (does Semantic Analysis) WHAT: - checks static (compile-time) semantics of each construct (e.g. variables have been declared? operand types are correct?) and possible operand coercions (e.g. changing integer into real) - generates the internal representation (IR) suitable for the target machine HOW: - when a production rule is applied, associated semantic routines are activated TYPES: - usually hand coded - bulk of the effort in writing a compiler! ---------------- Week1 Thursday ------------------------------------ /PARTS OF A LANGUAGE ==================== In order to talk about a language, we must talk about the alphabet, strings made up of symbols from the alphabet, and rules governing how string are formed for the language. We will review the basic terms in discussing strings, languages, sets of strings and relations. /Strings -------[ In real life, strings may be words (identifiers) of a language or sentences (statements) of a language, depending on the context in which the term is used.] alphabet = a finite set of symbols = E = an unordered set of symbols enclosed in curly brackets e.g. E = {a, b, c} string = a finite sequence of 0 or more symbols from some alphabet e.g. When E is {a,b} aabb is a string on E babab is a string on E In this class, we will use letters such as u,v,w,x,y,z to name strings. |x| = length of string x e.g. | aabb | is 4 /\ is the number of symbols in x = an empty string where |/\| = 0 prefix of x = any number of leading symbols of x (includes /\ and x itself) suffix of x = any number of trailing symbols of x (includes /\ and x itself) ##Inter2* write all the prefixes of abc ## write all the suffixes of abc concatenation of x and y = xy, where /\x and x/\ are x ##Inter3* write xy where x = aaa and y = bbb palindromes = strings which read the same forward and backward ##Inter4* give an example of a palindrome using the alphabet {a,b,c} /Sets of Strings ---------------- (Remember: E is alphabet; /\ is an empty string) [ In real life, a set of strings may be a set of all identifiers/words in a language or a set of all sentences/statements in a language ] Sets are enclosed in curly brackets { } e.g. {aaa, bbb, ccc} is a set of 3 strings. E^* = the set of all strings over E E^k = the set of all length k strings over E ##Inter5* E = {a, b}. What is E^* ? Describe. ##Inter6* E = {a, b}. What is E^2 ? Give the set. E^0 = the set of all length 0 strings. = { /\ } (/\ is the only string with length 0) E^1 = E (because each symbol is a string of length 1) And E^+ = E^* - { /\ } (i.e. all but the empty string) ##Inter7* Are E^* and E^+ infinite even when E is finite? Why? Cardinality of a set is the number of members of the set. e.g. A = {a,b} |A| = 2 e.g. A = {a,b,c} |A| = 3 /Languages ---------language = a set of strings of symbols from some alphabet (i.e. a finite/infinite subset of E^*) following the rules of the language e.g. { } is a language with no strings e.g. { /\ } is a language with just the empty string e.g. a set of palindromes over E = {0, 1} is a language which is an infinite set. i.e. { /\, 0, 1, 00, 11, 010, 101, ...} e.g. a set of length 3 palindromes over E = {0, 1} is a language which is a finite set. i.e. {010, 101, 000, 111} e.g. English is a language which is a set of strings (sentences) from {a,..,z, numbers, punctuation marks} ##Inter8* Is English a finite set? Explain why or why not. /Defining Sets -------------To define a set of strings, we don't have to list all the strings in curly brackets (impossible to do this if the set is infinite); we can specify a set by a set former = { objects | restrictions } i.e. {x | P(x) } a set of x's such that P(x) is true {x in A | P(x) } a set of x's from A such that P(x) is true e.g. {i | i is an integer and there exists an integer j such that i = 2j} defines the set of all even integers. e.g. {u | u is a palindrome over {0, 1} } ##Inter9* define the set of all odd integers using a set former. /Relating Sets -------------In the followng A and B are sets. A = B A and B are equal i.e. they have exactly the same members A ( B - A is a subset of B (every member of A is in B) A ( B A is a subset of B but A does not equal to B (proper subset) A U B = the union of A and B {x | x is in A or x is in B} e.g. all the words in English plus all the words in French A ^ B = the intersection of A and B e.g. all words common to English and French ##Inter10* complete {x | x is in A ?? } for A ^ B A - B = the set of A members minus the B members e.g. all words in English minus those from French ##Inter11* complete {x | x is in A ??} for A - B A and B are disjoint if they do not share members. _ A = Universe - A (i.e. everything except what is in A) Remember the De Morgans's Law??? _____ x is a member of A U B then x is not a member of A U B x is not a member of A and x is not a member of B _ _ so, x is a member of A and x is a member of B _ _ thus, x is a member of A ^ B ##Inter12* _ _ _____ ## Now show that if x is a member of A ^ B then x is a member of A U B 2^A = power set of A = set of all subsets of A including the empty set and A itself i.e. a set of sets. e.g. A = {a,b} 2^A = { {}, {a}, {b}, {a,b}} with 4 members Cardinality of 2^A = |2^A| If A is finite then |2^A| = 2^|A| End.