Atari BASIC - Description - The Tokenizer

The Tokenizer

Like most BASIC interpreters, Atari BASIC uses a token structure to handle lexical processing for better performance and reduced memory size. The tokenizer converts lines using a small buffer in memory, and the program is stored as a parse tree. The token output buffer (addressed by a pointer at LOMEM – 80, 8116) is 1 page (256 bytes) long, and any tokenized statement that is larger than the buffer will generate a BASIC error (14 – line too long). Indeed, the syntax checking described in the "Program editing" section is a side effect of converting each line into a tokenized form before it is stored. Sinclair BASIC uses a similar approach, though it varies between models.

The output from the tokenizer is then moved into more permanent storage in various locations in memory. A set of pointers (addresses) indicates these locations: variables are stored in the variable name table (pointed to at VNTP – 82, 8316) and the values are stored in the variable value table (pointed to at VVTP – 86, 8716). Strings have their own area (pointed to at STARP – 8C, 8D16) as does the runtime stack (pointed to at RUNSTK – 8E, 8F16) used to store the line numbers of looping statements (FOR...NEXT) and subroutines (GOSUB...RETURN). Finally, the end of BASIC memory usage is indicated by an address stored at MEMTOP – 90, 9116) pointer.

By indirecting the variable names in this way, a reference to a variable needs only two bytes to address its entry into the appropriate table; the whole name does not need to be stored each time. This also makes variable renaming relatively trivial if the program is in storage, as it is simply a case of changing the single instance of its name in the table and the only difficulty is if the name changes length (and even then, only if it gets longer): indeed, obfuscated code can be produced for a finished program by renaming variables in the name tables – possibly all to the same name. This doesn't confuse the interpreter since internally it is using the index values not the names. Of course, new code will be difficult to add because the tokenizer has to search the name table to find a variable's index, and can get confused if names are not unique (though it is OK to have names in both the "string" and "variable" namespaces, e.g. HELLO = 10 and HELLO$ = "WORLD", because they have separate tables, that is to say, separate namespaces.)

Atari BASIC uses a unique way to recognize abbreviated reserved words. In Microsoft BASIC there are a few predefined short forms, (like ? for PRINT and ' for REM). Atari BASIC allows any keyword to be abbreviated using a period, at any point in writing it. So L. will be expanded to LIST, as will LI. and (redundantly) LIS.. To expand an abbreviation the tokenizer will search through its list of reserved words and find the first that matches the portion supplied. To improve the chance of a programmer's correctly guessing an abbreviation, to save typing, and to improve the speed of the lookup, the list of reserved words is sorted to place the more-commonly used commands first. REM is at the very top, and can be typed in just as .. This also speeds lexical analysis generally, since although the time to search is in theory proportional to the length of the list, in practice it will find common keywords very quickly, to the extent that good programmers know when a line is syntactically incorrect even before the parser says so, because the time taken to search the list to find it wanting gives an indication that something is wrong.

So whereas Microsoft BASIC uses separate tokens for its few short forms, ATARI BASIC has only one token for each reserved word – when the program is later LISTed it will always write out the full words (since only one token represents all possible forms, it can do no other). There are two exceptions to this: PRINT has a synonym, ?, and LET has a synonym which is the empty string (so 10 LET A = 10 and 10 A = 10 mean the same thing). These are separate tokens, and so will remain as such in the program listing.

Some other contemporary BASICs have variants of keywords that include spaces (for example GO TO). Atari BASIC does not. The main exception here is keywords for communicating with peripherals (see the "Input/Output" section, below) such as OPEN # and PRINT #; it rarely occurs to many programmers that the " #" is actually part of the tokenized keyword and not a separate symbol; and that for example "PRINT" and "PRINT #0" are the very same thing, just presented differently. It may be that the BASIC programmers kept the form # to conform with other BASICs (The syntax derives from Fortran), though it is entirely unnecessary, and probably a hindrance, for tokenizing, and is not used in other languages for the Atari 8-bit family.

Expanding tokens in the listing can cause problems when editing. The Atari line input buffer is three lines (120 characters); up to three "physical lines" make one "logical line". After that a new "logical line" is automatically created. This doesn't matter much for output but it does for input, because the operating system will not return characters to the tokenizer after the third line, treating them as the start of a new "logical line". (The operating system keeps track of the mapping between physical and logical lines as they are inserted and deleted; in particular it marks each physical line with a flag for being a continuation line or a new logical line.) But using abbreviations when typing in a line can result, once they have been expanded on output, to a line that is longer than three lines and, a more minor concern, some whitespace characters can be omitted on input, so for example PRINT"HELLO" will be listed as PRINT "HELLO", one character longer. If one then wants to edit the line, now split across two logical lines, one must replace the expanded commands back with their abbreviations to be submit them back to the tokenizer. The moral of this is, generally, don't try to squeeze more out of a line than is reasonable, or, in the alternate, obfuscate variables by makinggggggthemverrrryverrrrylongindeed.

Literal line numbers in statements such as GOTO are calculated at run time using the same floating-point mathematical routines as other BASIC functions. This calculation allows subroutines to be referred to by variables: for instance GOTO EXITOUT is as good as GOTO 2000, if one sets EXITOUT to 2000. This is much more useful than it might sound; literals are stored in the 6-byte floating-point variable format, but variables are stored as a two-byte pointer to their place in the variable value table, at VVTP, so by using a variable the GOTO target is just two bytes instead of six. Of course the actual number takes another six, but this is only stored once, so if the same line number is used more than twice, one saves some memory, though the line number takes a little longer to look up. (The design choice to store what can only legally be integers as floating-point numbers is discussed below.)

It's quite common in BASIC to use GOTO or GOSUB to a line number, so real savings can be made to replace these numbers with variables. It also means that if the programmer is careful always to use the variable and not the literal, subroutines can be easily renumbered (moved around in the program), because only the variable value needs to be changed. This makes it common to see, at the start of an Atari BASIC program, a sequence of LET statements assigning line numbers to variables.

Read more about this topic:  Atari BASIC, Description