Saturday, 7 May 2022

Go FORTH and build your own language!

FORTH is a wonderfully simple and compact programming language. Take a look at Rosetta Code Language Comparison.  Nearly every column on their table for FORTH is either... N/A, None or No!  I laughed at the column about the language being standardised (ANSIISO/IEC 15145:1997). Nothing could be further from the truth. At the beginning, a vanilla FORTH doesn't even have variables! You make them yourself by coding the word var (or whatever you think a variable should be called) using

 : var create 1 allot ; which is a compile time word 

every time the word var is invoked at run-time, create looks at the next word in the word list, let's call it counter for this example, assigns it a dictionary key, saves the current heap slot along with the instruction to push the heap slot to the data stack as a dictionary value and finally, advances to a new slot on the heap. So next time you call the word counter (not var. It's done what it needed to do), FORTH looks up the dictionary, and pushes it's heap slot to the data stack. Now you can retrieve the contents of the heap slot, using the built-in word @ or write a new value to the heap slot using the built-in word !

You're probably lost already, but this isn't a lesson in how to use FORTH. I cut my teeth on FORTH written in Python and highly recommend it. Instead I want to posit my pearl of FORTH. Somebody said somewhere that the pearl of Forth is it's create does> word pair, which in brief, let's you pair a compile time action (create) with a run time action (does>). This word pairing let's a user create their own language on top of FORTH. In the past people have created Object Oriented FORTHS and even BASIC. However, this is just syntactic sugar. I've discovered quite early in my FORTH journey that it's better to stay as close to the machine (Python virtual machine in my case) as you can.

No... for me the pearl of FORTH is that anybody can write their own FORTH and consequently there are hundreds of them out there. Few of them follow the FORTH standard. 

The reason you can write your own FORTH is due to it's very simple compile / interpret loop. FORTH doesn't have a compiler in the standard sense. Instead it has compiling words like var above. In FORTH there is no syntax, just a stream of white space separated tokens called words which are consumed in a single pass. The compile / interpret loop works as follows:

1.    We begin in interpret mode. If the word presented is a : we enter compile mode.  Otherwise consume the word and execute it immediately.

2.    If we are in compile mode, consume each word, compile it, but don't execute it.  Instead once ; is reached, store the compiled pcode in the dictionary for future execution, when the word we have just made (var above) is called again.

3.    Go back to 1.

A number of implications flow from this simple pcode compiler:

  • because the the words themselves act as switches to turn the compiler on and off, compiled words can be redefined further down the word list. Word (function) redefinition is the way a FORTH program deals with a function needing to do slightly different things based on it's context (function overloading in C++).
  • anybody can change how a FORTH compiler behaves by creating new built-in words that interact directly with the compile / interpret loop in a similar way as : and ; do.
  • a compiled word is a list of built-in words and perhaps other compiled words. At run-time those words call each other in sequence. My FORTH, like a lot of them, uses the concept of indirect threaded code where the compiled words are a list of function calls in a Python list stored in a dictionary. To execute each underlying Python function the FORTH word (key) calls the dictionary, which then calls the Python built-in function's memory address (value) which it looked up and stored during the compile process. In direct threaded code each function would call the next, similar to a linked list. The advantage of indirect threading, is that the built-in function doesn't need to know anything about calling or being called and so:
  • Compiled words can be defined that contain words that haven't been defined yet, as long as all words have been defined by execution time. 
  • Recursion is possible. 
  • All of FORTH's built-in tokens including the compiler control tokens : and ; can be renamed by storing them as constants. 
  • the words list is consumed as it is compiled / interpreted. This makes it easy to add new words to the words list at run-time, just create a word that reads in a new file and adds it's contents to the top of the words list. This lessens the need for namespaces as words are only introduced when needed rather than being present when the program loads.
  • you can increase the performance of your code by treating the word list as your string storage, something you would never do in a compiled language. This works particularly well when you want to print out large swaths of HTML which, just like FORTH, has no syntax. I put it all in my FORTH file (which is actually a collection of database records) and use the FORTH interpreter to print it directly to standard output (std out), bypassing the stack and heap. Want to store the string on the data stack instead (SQL queries come to mind for that)? Same approach as for std out except the word is different that puts it on the stack.
  • being able to poke around in the compiler gives you a very good understanding of how your FORTH works and the result is that you work to it's strengths and avoid it's weaknesses. The strengths I have noticed so far:
    • the code base is very malleable (the refactoring everybody talks about) and it is easy to change things so that you can re-use code e.g. printing a table is commonplace so I have standard table printing code where I redefine what can happen before, in and after a table cell. Now it's easy to print out HTML, tab delimited format, CSV or whatever.
    •  no types. If you want a type you have to explicitly cast it. Using HTML its all strings, so most of the time I don't want to be fooling around with types anyway. The only ones I've implemented so far are str and int. The other day I wanted a currency type. I found the appropriate Javascript and dropped my number between it. A quick and dirty solution that I wouldn't want to use for a table with hundreds of values, but formatting variables by surrounding them with FORTH string constants is standard practice for me.
    • avoid local variables because they make it difficult to refactor. A web app is a collection of independent pages, so the chances of global variables standing on each other is reduced. I have a number of predefined global variables which I use over and over. Then I came across the concept of shadowing. A FORTH variable is really an array of one (1 allot).
              var counter
    1 allot

    You now have a variable with two slots.  To refer to the first slot you simply call it by its name:

    counter

    counter 1 +

    gets you the second slot.

    This shadowing concept means you can store a backup of your variable in the same variable name. 

    The Weaknesses 

    • being able to have local variables would be nice occasionally 😀. Stack twiddling is not useful work. My solution has been to turn the stack into a deque and to use the bottom of the stack to store one or two local variables. Many FORTHS use the top of the control stack to store a local variable. I haven't tried that, preferring the deque concept instead because you don't have to monitor it so closely. If I'm iterating through a table there's going to be a lot of values on the stack, so there's no chance the bottom value will be accessed from above. I've added a bottom peek instruction which works well, but would be a disaster at the top, since FORTH words almost without exception pop from and push to the stack.
    • it's very easy to write inscrutable code. To address this, I indent my code following the typical conventions e.g. nested if statements. Plastering the code with comments is standard practice in other languages. Instead I try to use words that make sense when being read from left to right. Done right you end up with statements that intuitively make sense e.g. empty counter ! means put a '' into the counter variable and store it on the heap. Traditional FORTHS have a lot of one and two character words like @ ! . , : ; etc. I've tried to avoid those except for the most common ones that everyone knows. If I'm reimplementing a Python function I use the same name that Python has. I don't eschew comments completely. A longer explanation of what the code does is placed in the first record, although I've found the need for commenting is not that critical because the unit name and database record name combination gives you plenty of clues as to what the code is doing. Despite these things, it is not easy to read FORTH code. You have to be in the "zone" which requires time. C code is much easier to understand when you first see it.
    A FORTH written in C would remain a toy without many many man hours, but that's not what I'm doing here. I'm using FORTH as a way to simplify Python down to doing one thing... emitting scripts to std out or to file. Creating those scripts is faster than doing them directly in Python, because FORTH doesn't do any syntax checking (Python's tab/space syntax checking drives me mad!) and FORTH never crashes. It just emits the Python error and remains in it's REPL (Read Evaluate Print Loop). The scripts are typically SQL, HTML, CSS and Javascript, but I could  target any tool that is scriptable. LaTEX and Python are likely to be targets in future.