Unit parsing

Description

Uses

Classes, Interfaces, Objects and Records

Functions and Procedures

Types

Constants

Variables

Description

This unit defines several classes to handle:

lexer definition: AParsedLanguage and its descendants
source scanning: AToken, AScanner and their descendants
parsing and statement handling: AParser, ASourceParser and their descendants
symbols and symbol tables: ASymbol, ASymbolTable and their descendants

All of this functionality routinely serves as the basis for my projects, and so I have now condensed what were several units into one for increased portability between projects.

Uses

classwork

Overview

Classes, Interfaces, Objects and Records

Name	Description
Class `AnOpcodeDictionaryEntry`	This class represents an opcode dictionary entry, which matches a single token string to its internal representation.
Class `AnOpcodeDictionary`	This class represents a dictionary of opcodes.
Class `ASyntaxRule`	This class represents a single syntax rule, which in its simplest form is a set of opcodes.
Class `ASyntaxRuleset`	This class represents a collection of syntax rules which, taken together, help to control how a parser processes source.
Class `AParsedLanguage`	This class serves as the basis for defining a language that will be parsed.
record `TSymbolReference`	This type defines the location of a symbol: both its scope (the symbol table to which it belongs) and its index within that table.
Class `ASymbol`	This class represents a basic symbol, which may be an identifier (such as a variable or function name), a numeric constant, or a string literal.
Class `ASymbolFromSource`	This class represents a symbol that is parsed from a source file and which will be handled at once or written to an intermediate code file.
Class `ASymbolRecalled`	This class represents a symbol which was previously parsed and written to an intermediate code file.
Class `ASymbolTable`	This class represents a symbol table, which matches literal token values to an instance of ASymbol or one of its descendants.
Class `ASymbolVector`	This class represents a symbol vector, which matches symbols to an index.
Class `ASymbolTableVector`	This class represents a symbol table vector, which organizes instances of ASymbolTable into a linear array.
Class `ASymbolVectorVector`	This class represents a vector of symbol vectors; in other words, a collection of instances of ASymbolVector.
Class `AToken`	This class represents a basic token which is parsed from a source stream or retrieved from an intermediate code stream.
Class `ASymbolicToken`	This class represents a symbol that is entered into a symbol table.
Class `ALineEndingToken`	This class represents the end of a source line.
Class `AStreamEndingToken`	This class represents the end of a source stream.
Class `ATokenList`	This class represents a list of tokens that can be used as a sequential list or a stack.
Class `ATokenFromSource`	This class represents a token that is parsed from a source stream.
Class `AnIdentifierToken`	This token represents an identifier read from the source.
Class `ANumericConstantToken`	This class represents a numeric constant parsed from the source.
Class `AStringLiteralToken`	This class represents a string literal parsed from the source.
Class `ASpecialToken`	This class represents a special token, which is usually a delimiter or symbolic operator recognized by a parsed language.
Class `ASpaceToken`	This class represents whitespace encountered in the source.
Class `ALineEndingTokenFromSource`	This class represents the end of a source line.
Class `AnErrorToken`	This class represents an erroneous or unrecognized token.
Class `AStreamEndingTokenFromSource`	This class represents the end of a source stream.
Class `AScanner`	This class represents a scanner that is used to return tokens from a stream.
Class `ASourceScanner`	This class represents a scanner this used to return tokens from a source code stream.
Class `ASourceInputStream`	This class is defined for convenience and need not be used, strictly speaking, since instances of ASourceScanner will happily accept any valid instance of ATextInputStream or its descendants.
Class `AParserNote`	This class represents a note that is logged by a parser to a given instance of ALog.
Class `AParserHint`	This class represents a hint that is logged by a parser to a given instance of ALog.
Class `AParserWarning`	This class represents a warning that is logged by a parser to a given instance of ALog.
Class `AParserSyntaxError`	This class represents a syntax error that is logged by a parser to a given instance of ALog.
Class `AParserFatalError`	This class represents a fatal error that is logged by a parser to a given instance of ALog.
Class `AParser`	This class represents a generic parser.
Class `AParsedLanguageParser`	This class represents a parser that is used to process source code of some kind using a parsed language definition.
Class `ASymbolParser`	This class represents a parser that processes a source file and enters any symbols found (variable names, function names, custom types, etc.) into its symbol tables.
Class `ASourceParser`	This class represents a parser that uses a language definition to parse an arbitrary source stream into intermediate code.

Types

ACharacterCategory = (...);

AParsedLanguageClass = class of AParsedLanguage;

TOpcode = longword;

TOpcodeList = array of TOpcode;

TScannerTokenBehavior = (...);

TScannerTokenBehaviors = set of TScannerTokenBehavior;

TSymbolScope = integer;

Constants

letkStringRepresentation = 'end of line (%d)';

LINE_ENDING_APPLE: string = #10;

LINE_ENDING_UNIX: string = #13;

LINE_ENDING_WINDOWS: string = #13#10;

parsFatalUnexpectedEOS = 'unexpected end of source';

parsSyntaxUnexpectedToken = 'unexpected %s in source';

plcsTypicalDigit = '0123456789';

plcsTypicalEndOfLine = #10#13;

plcsTypicalEndOfStream = #26;

plcsTypicalLetter = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' +
    'abcdefghijklmnopqrstuvwxyz_';

plcsTypicalNumeric = 'xXeE';

plcsTypicalWhitespace = #0#1#2#3#4#5#6#7#8#9#11#12#14#15#16#17#18#19#20 +
    #21#22#23#24#25#27#28#29#30#31#32#127;

plcsTypicalWord = '0123456789';

pmsgStringRepresentation = '%s: %d: %s';

RULE_BEGIN_STATEMENT = TOKCAT_RULE + 1;

RULE_END_STATEMENT = RULE_BEGIN_STATEMENT + 1;

RULE_NONE = TOKCAT_RULE + $FFFF;

scnrDefaultTokenBehaviors: TScannerTokenBehaviors = [
    SCAN_NO_WHITESPACE, SCAN_CONSOLIDATE_WHITESPACE
  ];

setkStringRepresentation = 'end of stream';

smtkStringRepresentation = '%s in scope %X: %d';

SYMCAT_EXTERNAL = $80000000;

SYMCAT_LITERAL = $00000002;

SYMCAT_PARAMETER = $00000006;

SYMCAT_STRUCTURE_MEMBER = $00000005;

SYMCAT_SUBROUTINE = $00000004;

SYMCAT_TYPE = $00000001;

SYMCAT_UNDEFINED = $00000000;

SYMCAT_USER = $00000100;

SYMCAT_VARIABLE = $00000003;

sympErrorDuplicateIdentifier = '%s duplicates symbol that was ' +
    'previously declared in "%s" on line %d';

SYMSCOPE_GLOBAL = 0;

SYMSCOPE_NONE = -1;

TOKCAT_CATMASK = $FFFF0000;

TOKCAT_DUMMY = 0;

TOKCAT_EOL = $000A0000;

TOKCAT_EOS = $001A0000;

TOKCAT_ERROR = $00FF0000;

TOKCAT_IDENTIFIER = $00010000;

TOKCAT_KEYWORD = $00020000;

TOKCAT_NUMBER = $00040000;

TOKCAT_RULE = $002A0000;

TOKCAT_SPACE = $00060000;

TOKCAT_SPECIAL = $00030000;

TOKCAT_STRING = $00050000;

toknStringRepresentation = '%s: %X';

Description

Types

ACharacterCategory = (...);

This type defines the character types that are recognized by a scanner for a parsed language.

Values

CHARCAT_DUMMY = 0: An unrecognized or erroneous character
CHARCAT_SPECIAL: A character that is not alphanumeric, but which is recognized by the language
CHARCAT_LETTER: A character that may begin an identifier or reserved word
CHARCAT_DIGIT: A character that may begin a numeric constant
CHARCAT_WORD: A character that may be part of an identifier or reserved word, but which may not begin it
CHARCAT_NUMERIC: A character that may be part of a numeric constant, but which may not begin it
CHARCAT_SPACE: A character that is recognized as white space, but which is not an indicator that the line or the stream has ended
CHARCAT_EOL: A character that may delimit the end of a source line
CHARCAT_EOS: A character that may delimit the end of a source stream
CHARCAT_ERROR: A character that is unrecognized

AParsedLanguageClass = class of AParsedLanguage;

This type refers to the class definition for all instances of AParsedLanguage and its descendants. It allows generic parsers to be defined which can accept any valid instance of AParsedLanguage. It also allows parsers for a given language to construct their own language definition instances, simplifying the steps required to parse a given language.

TOpcode = longword;

This type represents an opcode, which is a way of constructing a numeric representation of a token string that has special meaning to a parser. The internal representation of an opcode is faster for a computer to manage than the associated token string; as a result, almost all parsers use some kind of logic to match a token string to an internal representation.

This type is defined here to make parser code more flexible.

TOpcodeList = array of TOpcode;

This type represents a dynamic array of opcodes. It is primarly used by ASyntaxRule.

TScannerTokenBehavior = (...);

This type defines the ways in which instances of AScanner handle instances of certain tokens:

SCAN_NO_WHITESPACE: Discard whitespace tokens when they are encountered. When this behavior is enabled, AScanner.CurrentToken will never refer to an instance of ASpaceToken, and AScanner.next will read tokens from the source until one is encountered that is not determined to be whitespace.
SCAN_CONSOLIDATE_WHITESPACE: Consolidate consecutive instances of the same whitespace character into a single instance of ASpaceToken. When this behavior is enabled, consecutive instances of the same whitespace character in the source will be collected into one instance of ASpaceToken; otherwise, multiple instances of ASpaceToken will be returned.
Obviously, this flag has no effect if SCAN_NO_WHITESPACE is enabled.
SCAN_CONSOLIDATE_LINE_ENDINGS: Consolidate consecutive instances of the same line ending character into a single instance of ALineEndingToken. When this behavior is enabled, consecutive instances of the same line ending character in the source will be collected into one instance of ALineEndingToken; otherwise, multiple instances of ALineEndingToken will be returned.

Values

SCAN_NO_WHITESPACE: Discard whitespace tokens
SCAN_CONSOLIDATE_WHITESPACE: Consolidate whitespace into a single token
SCAN_CONSOLIDATE_LINE_ENDINGS: Consolidate line endings into a single token

TScannerTokenBehaviors = set of TScannerTokenBehavior;

This type defines a set of one or more token behaviors. For more information on the behaviors available, see TScannerTokenBehavior.

TSymbolScope = integer;

This type represents the scope of a symbol, which is a way of indicating the symbol table to which it belongs. Generally speaking, each program block has its own scope.

This type is defined to make parser code more flexible.

Constants

letkStringRepresentation = 'end of line (%d)';

This constant defines how a string representation of ALineEndingToken is constructed when ALineEndingToken.toString is called.

The integer placeholder is filled with the number of line endings represented by the token, as retrieved by a call to ALineEndingToken.lineCount.

LINE_ENDING_APPLE: string = #10;

This constant defines the character which makes up a typical Apple-style line ending. It can be used by language definitions to specify those characters which are considered line endings.

LINE_ENDING_UNIX: string = #13;

This constant the defines the character which makes up a typical Unix-style line ending. It can be used by language definitions to specify those characters which are considered line endings.

LINE_ENDING_WINDOWS: string = #13#10;

This constant defines the sequence of characters which makes up typical Windows-style line endings. It is used in the tokenizer source and can also be used by language definitions to specify those characters which are considered line endings.

parsFatalUnexpectedEOS = 'unexpected end of source';

This string defines the error message and format used when the end of a stream is encountered unexpectedly.

As defined here, the error message requires no additional parameters.

parsSyntaxUnexpectedToken = 'unexpected %s in source';

This constant defines the format of the error message logged by instances of AParser when AParser.resyncToToken and AParser.resyncTo encounter an unexpected token in the source.

The string placeholder is filled with a string representation of the unexpected token, as returned by a call to AToken.toString.

plcsTypicalDigit = '0123456789';

This string defines the characters which are typically allowed to begin numeric constants in various languages. It is defined here for convenience, so that other language specifications may reference it.

plcsTypicalEndOfLine = #10#13;

This string defines the characters which are typically counted as line ending characters by various languages. It is defined here for convenience, so that other language specifications may reference it.

plcsTypicalEndOfStream = #26;

This string defines the characters which are typically used to mark the end of a stream by various languages. It is defined here for convenience, so that other language specifications may reference it.

plcsTypicalLetter = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' +
    'abcdefghijklmnopqrstuvwxyz_';

This string defines the characters which are typically allowed to begin identifiers in various languages. It is defined here for convenience, so that other language specifications may reference it.

plcsTypicalNumeric = 'xXeE';

This string defines the characters which are typically allowed to be part of a numeric constant, though they may not begin it, by various languages. It is defined here for convenience, so that other language specifications may reference it.

plcsTypicalWhitespace = #0#1#2#3#4#5#6#7#8#9#11#12#14#15#16#17#18#19#20 +
    #21#22#23#24#25#27#28#29#30#31#32#127;

This string defines the characters which are typically counted as whitespace by various languages. It is defined here for convenience, so that other language specifications may reference it.

plcsTypicalWord = '0123456789';

This string defines the characters which are typically allowed to be part of an identifier, though they may not begin it, in various languages. It is defined here for convenience, so that other language specifications may reference it.

pmsgStringRepresentation = '%s: %d: %s';

This constant determines the format of the string returned by calls to AParserNote.toString, AParserHint.toString, AParserWarning.toString, AParserSyntaxError.toString, and AParserFatalError.toString when there is a valid (named) source.

The first string placeholder is the name of the source stream being parsed, as determined by a call to AStream.name. If the source stream represents a file stream, then this will contain the name of the file being parsed. The integer placeholder will be filled with the current line number in the source stream, as determined by a call to AScanner.lineNumber. The second string placeholder is filled with the error or status message, which is constructed by calling the inherited toString message for each of the above classes: ALoggedNote.toString, ALoggedHint.toString, ALoggedWarning.toString, ALoggedError.toString, or ALoggedFatalError.toString.

RULE_BEGIN_STATEMENT = TOKCAT_RULE + 1;

Rules in ASyntaxRuleset can be retrieved by an arbitrary name or value; however, it is common to define a symbolic constant for each rule so that the rule name or value does not have to be hard-coded. In addition, there are some basic rules that will probably be common across all types of parsers: a rules defining basic token types and a rule to indicate which tokens are allowed to begin and end a statement.

This constant is used to define the tokens that may begin a statement. It is not automatically used by the base instance of AParsedLanguage, but is provided here as a way to standardize the way in which one of the more fundamental rules of a parsed language is defined.

RULE_END_STATEMENT = RULE_BEGIN_STATEMENT + 1;

This constant is used to define the tokens that may end a statement. It is not automatically used by the base instance of AParsedLanguage, but is provided here as a way to standardize the way in which one of the more fundamental rules of a parsed language is defined.

RULE_NONE = TOKCAT_RULE + $FFFF;

Rules in ASyntaxRuleset can be retrieved by an arbitrary name or value; however, it is common to define a symbolic constant for each rule so that the rule name or value does not have to be hard-coded. In addition, there are some basic rules that will probably be common across all types of parsers: rules defining basic token types and indicating which tokens are allowed to begin and end a statement.

This constant is used to define an invalid rule (no rule).

scnrDefaultTokenBehaviors: TScannerTokenBehaviors = [
    SCAN_NO_WHITESPACE, SCAN_CONSOLIDATE_WHITESPACE
  ];

This constant defines the default handling flags used by instances of AScanner when whitespace and line ending tokens are encountered.

The default behavior is to ignore whitespace and to consolidate it into a single token if it is not ignored. To set the behavior of a scanner, call AScanner.setTokenBehaviors.

setkStringRepresentation = 'end of stream';

This constant defines how a string representation of AStreamEndingToken is constructed when AStreamEndingToken.toString is called.

As defined here, the format requires no additional parameters.

smtkStringRepresentation = '%s in scope %X: %d';

This constant defines how a string representation of ASymbolicToken is constructed when ASymbolicToken.toString is called.

The string placeholder will be filled with the display name of the class, as returned by a call to APrintingObject.displayName. The first and second integer placeholders are filled with the scope and index of the symbol represented by the token, as returned by a call to ASymbolicToken.symbol.

SYMCAT_EXTERNAL = $80000000;

This flag indicates that a symbol has been declared as external to the source. It is designed to be combined with another symbol category code, such as SYMCAT_SUBROUTINE (to indicate an external function) or SYMCAT_VARIABLE (to indicate an external variable).

SYMCAT_LITERAL = $00000002;

This constant enumerates one of the ways in which a symbol may be defined in a source stream. It is not directly used by the base implementations of ASymbol, ASymbolFromSource, or ASymbolRecalled – except when they read and write themselves using a binary stream – but they are provided to help standardize the ways in which symbols are represented across various parser implementations.

This constant represents a symbol that is declared as a constant value: either a numeric constant or a string literal.

SYMCAT_PARAMETER = $00000006;

This constant represents a symbol that is declared as a parameter accepted by a subroutine: one of the variables that may be passed to a subroutine.

SYMCAT_STRUCTURE_MEMBER = $00000005;

This constant represents a symbol that is declared as a member of a structured type: a record, class, or other memory structure.

SYMCAT_SUBROUTINE = $00000004;

This constant represents a symbol that is declared as a subroutine: a function, procedure, or other labeled block of code.

SYMCAT_TYPE = $00000001;

This constant represents a symbol that is declared as a data type.

SYMCAT_UNDEFINED = $00000000;

This constant represents a symbol that was encountered in the source stream, but which has no definition (an undeclared identifier of some kind). This is the default value for ASymbol.category that is set by ASymbol.init, although descendant classes may modify this value.

SYMCAT_USER = $00000100;

This constant is a placeholder for user-defined symbol categories. User- defined symbol categories should begin with this value. Values lower than this constant are reserved in case there is a need for future expansion of the base parsing library.

SYMCAT_VARIABLE = $00000003;

This constant represents a symbol that is declared as a variable: an instance of a known type.

sympErrorDuplicateIdentifier = '%s duplicates symbol that was ' +
    'previously declared in "%s" on line %d';

This string controls the format of the error message output when there is an attempt to define a symbol that has the same name as one which has already been defined in the current scope. This message is output as a syntax error by ASymbolParser.EnterSymbolInto.

The first string placeholder is filled with the name of the symbol, as returned by a call to ASymbol.toString. The second string placeholder is filled with the name of the source stream in which the symbol was declared, as returned by a call to ASymbolFromSource.sourceName. The integer placeholder is filled with the line number at which the symbol was declared, as returned by a call to ASymbolFromSource.sourceLine.

SYMSCOPE_GLOBAL = 0;

This constant enumerates one of the more common symbol table scopes.

This constant represents a symbol that belongs to the global symbol table (i.e., which is visible from everywhere in the source program).

SYMSCOPE_NONE = -1;

This constant enumerates one of the more common symbol table scopes.

This constant represents a symbol that has no scope (belongs to no symbol table). This is the default value for ASymbol.scope that is set by ASymbol.init, although descendant classes may modify this value.

TOKCAT_CATMASK = $FFFF0000;

Each of these constants defines a category of token that is recognized by a scanner for a parsed language. The category code resides in the high word of a TOpcode, while the specific internal representation of that token occupies the low word. It is possible to mask out the internal representation using TOKCAT_CATMASK, in order to quickly determine the type of token represented by a specific opcode.

This constant is used to mask out all but the bits which identify the category into which the opcode falls. One simply ANDs the TOpcode with this value; the resulting value will be one of the category constants.

TOKCAT_DUMMY = 0;

This constant is used to define an invalid, "null" token – one that has not been properly initialized. This is the initial state of all base instances of AToken.

TOKCAT_EOL = $000A0000;

This constant is used to define a sequence of at least one character that represents the end of a line in the source being parsed. If the token has been processed by the base instance of ALineEndingTokenFromSource, then the token text may contain one or two characters, depending upon whether the source file has Windows, Apple, or Unix/Linux-based line endings.

TOKCAT_EOS = $001A0000;

This constant is used to define a token that represents the end of the source stream. Depending on the language being parsed, there may be an actual sequence of characters that denotes the end of the source; it is more likely, however, that a scanner will return this type of token when the stream being scanned indicates that the end has been reached.

TOKCAT_ERROR = $00FF0000;

This constant is used to define an unrecognized or erroneous token. The base instances of ANumericConstantToken will place themselves in this category if the number failed to meet basic checks of validity; this allows the parser to complain that something in the source looks like a number or string constant, but is not actually one.

TOKCAT_IDENTIFIER = $00010000;

This constant is used to define an identifier, which is a series of alphanumeric characters that does not have a pre-defined meaning in the syntax of the language being parsed. Identifiers usually represent variables or routines within the source being parsed.

TOKCAT_KEYWORD = $00020000;

This constant is used to define a keyword, which is a series of alphanumeric characters that has a pre-defined meaning in the syntax of the language being parsed. Keywords can represent language constructs, such as a conditional statement, or reserved method names.

TOKCAT_NUMBER = $00040000;

This constant is used to define a valid numeric constant. If the token has been parsed by the base instance of ANumericConstantToken, and the token opcode contains this category, then it means that the sequence of characters contained by the token evaluates to a valid number.

TOKCAT_RULE = $002A0000;

Each of these constants defines a category of token that is recognized by a scanner for a parsed language. The category code resides in the high word of a TOpcode, while the specific internal representation of that token occupies the low word. It is possible to mask out the internal representation using TOKCAT_CATMASK, in ordr to quickly determine the type of token represented by a specific opcode.

This constant is used to define a rule, thus allowing instances of ASyntaxRule to contain specific tokens as well as references to other rules.

TOKCAT_SPACE = $00060000;

This constant is used to define a token that is recognized at whitespace. Note that end-of-line and end-of-stream markers are not included in this category; these have their own token categories: TOKCAT_EOL and TOKCAT_EOS, respectively. It is up to the scanner or parser to determine whether or not a whitespace token will be processed or ignored; some languages, like Python, find whitespace significant.

TOKCAT_SPECIAL = $00030000;

This constant is used to define a special token, which may consist of one or more characters with pre-defined meaning in the syntax of the language. Special tokens often represent operators that are represented with a symbol, or delimiters that separate identifiers or which indicate the continuation (or end) of a line.

TOKCAT_STRING = $00050000;

This constant is used to define a valid string literal; however, strings are NOT parsed by the default scanner and tokenizer methods defined within parsing. This is because there are a variety of ways to define a string (such as in Python, where there are strings and docstrings), so it is left to the parser to process the string. However, the token code remains valid and can be output into an intermediate code stream.

toknStringRepresentation = '%s: %X';

This constant defines how a string representation of AToken is constructed when AToken.toString is called.

The string placeholder will be filled with the display name of the class, as returned by a call to APrintingObject.displayName. The integer placeholder will be filled with the value of the token opcode, as returned by a call to AToken.opcode.

Generated by PasDoc 0.13.0 on 2015-01-10 17:13:18

causerie

Unit parsing

Description

Uses

Overview

Classes, Interfaces, Objects and Records

Types

Constants

Description

Types

Values

Values

Constants