Include Stdio, String, Stdlib, And Void Return

Includestdioh Includestringh Includestdlibhvoid Re

Implement a lexical analyzer that reads a C program from a file, removes comments (denoted by / ... /), and generates four symbol tables: the KEYWORD table, the IDENTIFIER table, the NUMBER table, and the TOKEN table. The KEYWORD table should contain all keywords defined in Louden with associated indices, including special symbols. The IDENTIFIER table should include all user-defined identifiers, each with a unique index. The NUMBER table should include all integers and floats used in the program, with each number assigned an attribute indicating whether it is an integer or a float. The TOKEN table should record all tokens generated, including their class index and associated value. The program should then output the original program with comments stripped, along with all four symbol tables. Read input from a file and output to a file, following Lexical Conventions of C- as per Louden.

Paper For Above instruction

Lexical analysis is a fundamental phase in compiler design that involves reading the source code and converting it into a series of tokens for syntactic analysis. For C programming language, this task includes identifying keywords, identifiers, constants, operators, delimiters, and literals. The complexity is increased by the need to remove comments and accurately classify tokens, as specified in Louden's specifications. This paper discusses the development of a robust lexical analyzer that performs these tasks efficiently, reads input from a file, and outputs the original program with comments stripped, along with symbol tables for keywords, identifiers, numbers, and tokens.

The lexical analyzer operates in stages. First, it reads the source code from an input file. As it processes the code, it strips off comments denoted by / ... /. To achieve accurate comment removal, the program must handle nested or multiline comments properly. Once comments are removed, the source code is tokenized by scanning character by character. The tokenizer classifies each sequence into tokens based on definitions provided by Louden. These include keywords such as 'if', 'else', 'int', etc., identifiers (user-defined names), constants (numbers), and operators/delimiters.

The program maintains four primary symbol tables:

  • KEYWORD table: Stores all language keywords, each associated with a predefined index. Keywords include 'if', 'else', 'int', 'void', etc., and special symbols as per Louden.
  • IDENTIFIER table: Stores all unique user-defined identifiers encountered during tokenization. Each identifier receives a unique index.
  • NUMBER table: Registers all numeric constants. Numbers are classified as integers or floats, with corresponding attributes.
  • TOKEN table: Records each token as it is generated, with class index (referring to one of the above tables) and its token-specific value (e.g., the actual identifier name or number).

Implementation involves reading the source code line-by-line and character-by-character, recognizing token boundaries based on whitespace and punctuation. When a token is identified, it is checked against the keyword list. If it's a keyword, it is stored in the keyword table; if not, it is classified as an identifier or number. The number classification involves checking for decimal points to distinguish integers from floats. All tokens are logged into the token table, maintaining their sequence for later processing.

After tokenization, the program outputs the original source code with all comments removed to a specified output file. Additionally, it prints the symbol tables, providing the indices and respective tokens. This output aids in further semantic analysis or syntax checking. The entire process demonstrates an efficient approach to lexical analysis, emphasizing accurate comment removal, comprehensive token classification, and organized symbol table management, conforming to the conventions discussed in Louden’s compiler design principles.

References

  • Louden, K. (2014). Compiler Construction: Principles and Practice. Cengage Learning.