Saturday 7 May 2022

Go FORTH and build your own language!

FORTH is a wonderfully simple and compact programming language. Take a look at Rosetta Code Language Comparison.  Nearly every column on their table for FORTH is either... N/A, None or No!  I laughed at the column about the language being standardised (ANSIISO/IEC 15145:1997). Nothing could be further from the truth. At the beginning, a vanilla FORTH doesn't even have variables! You make them yourself by coding the word var (or whatever you think a variable should be called) using

 : var create 1 allot ; which is a compile time word 

every time the word var is invoked at run-time, create looks at the next word in the word list, let's call it counter for this example, assigns it a dictionary key, saves the current heap slot along with the instruction to push the heap slot to the data stack as a dictionary value and finally, advances to a new slot on the heap. So next time you call the word counter (not var. It's done what it needed to do), FORTH looks up the dictionary, and pushes it's heap slot to the data stack. Now you can retrieve the contents of the heap slot, using the built-in word @ or write a new value to the heap slot using the built-in word !

You're probably lost already, but this isn't a lesson in how to use FORTH. I cut my teeth on FORTH written in Python and highly recommend it. Instead I want to posit my pearl of FORTH. Somebody said somewhere that the pearl of Forth is it's create does> word pair, which in brief, let's you pair a compile time action (create) with a run time action (does>). This word pairing let's a user create their own language on top of FORTH. In the past people have created Object Oriented FORTHS and even BASIC. However, this is just syntactic sugar. I've discovered quite early in my FORTH journey that it's better to stay as close to the machine (Python virtual machine in my case) as you can.

No... for me the pearl of FORTH is that anybody can write their own FORTH and consequently there are hundreds of them out there. Few of them follow the FORTH standard. 

The reason you can write your own FORTH is due to it's very simple compile / interpret loop. FORTH doesn't have a compiler in the standard sense. Instead it has compiling words like var above. In FORTH there is no syntax, just a stream of white space separated tokens called words which are consumed in a single pass. The compile / interpret loop works as follows:

1.    We begin in interpret mode. If the word presented is a : we enter compile mode.  Otherwise consume the word and execute it immediately.

2.    If we are in compile mode, consume each word, compile it, but don't execute it.  Instead once ; is reached, store the compiled pcode in the dictionary for future execution, when the word we have just made (var above) is called again.

3.    Go back to 1.

A number of implications flow from this simple pcode compiler:

  • because the the words themselves act as switches to turn the compiler on and off, compiled words can be redefined further down the word list. Word (function) redefinition is the way a FORTH program deals with a function needing to do slightly different things based on it's context (function overloading in C++).
  • anybody can change how a FORTH compiler behaves by creating new built-in words that interact directly with the compile / interpret loop in a similar way as : and ; do.
  • a compiled word is a list of built-in words and perhaps other compiled words. At run-time those words call each other in sequence. My FORTH, like a lot of them, uses the concept of indirect threaded code where the compiled words are a list of function calls in a Python list stored in a dictionary. To execute each underlying Python function the FORTH word (key) calls the dictionary, which then calls the Python built-in function's memory address (value) which it looked up and stored during the compile process. In direct threaded code each function would call the next, similar to a linked list. The advantage of indirect threading, is that the built-in function doesn't need to know anything about calling or being called and so:
  • Compiled words can be defined that contain words that haven't been defined yet, as long as all words have been defined by execution time. 
  • Recursion is possible. 
  • All of FORTH's built-in tokens including the compiler control tokens : and ; can be renamed by storing them as constants. 
  • the words list is consumed as it is compiled / interpreted. This makes it easy to add new words to the words list at run-time, just create a word that reads in a new file and adds it's contents to the top of the words list. This lessens the need for namespaces as words are only introduced when needed rather than being present when the program loads.
  • you can increase the performance of your code by treating the word list as your string storage, something you would never do in a compiled language. This works particularly well when you want to print out large swaths of HTML which, just like FORTH, has no syntax. I put it all in my FORTH file (which is actually a collection of database records) and use the FORTH interpreter to print it directly to standard output (std out), bypassing the stack and heap. Want to store the string on the data stack instead (SQL queries come to mind for that)? Same approach as for std out except the word is different that puts it on the stack.
  • being able to poke around in the compiler gives you a very good understanding of how your FORTH works and the result is that you work to it's strengths and avoid it's weaknesses. The strengths I have noticed so far:
    • the code base is very malleable (the refactoring everybody talks about) and it is easy to change things so that you can re-use code e.g. printing a table is commonplace so I have standard table printing code where I redefine what can happen before, in and after a table cell. Now it's easy to print out HTML, tab delimited format, CSV or whatever.
    •  no types. If you want a type you have to explicitly cast it. Using HTML its all strings, so most of the time I don't want to be fooling around with types anyway. The only ones I've implemented so far are str and int. The other day I wanted a currency type. I found the appropriate Javascript and dropped my number between it. A quick and dirty solution that I wouldn't want to use for a table with hundreds of values, but formatting variables by surrounding them with FORTH string constants is standard practice for me.
    • avoid local variables because they make it difficult to refactor. A web app is a collection of independent pages, so the chances of global variables standing on each other is reduced. I have a number of predefined global variables which I use over and over. Then I came across the concept of shadowing. A FORTH variable is really an array of one (1 allot).
              var counter
    1 allot

    You now have a variable with two slots.  To refer to the first slot you simply call it by its name:

    counter

    counter 1 +

    gets you the second slot.

    This shadowing concept means you can store a backup of your variable in the same variable name. 

    The Weaknesses 

    • being able to have local variables would be nice occasionally 😀. Stack twiddling is not useful work. My solution has been to turn the stack into a deque and to use the bottom of the stack to store one or two local variables. Many FORTHS use the top of the control stack to store a local variable. I haven't tried that, preferring the deque concept instead because you don't have to monitor it so closely. If I'm iterating through a table there's going to be a lot of values on the stack, so there's no chance the bottom value will be accessed from above. I've added a bottom peek instruction which works well, but would be a disaster at the top, since FORTH words almost without exception pop from and push to the stack.
    • it's very easy to write inscrutable code. To address this, I indent my code following the typical conventions e.g. nested if statements. Plastering the code with comments is standard practice in other languages. Instead I try to use words that make sense when being read from left to right. Done right you end up with statements that intuitively make sense e.g. empty counter ! means put a '' into the counter variable and store it on the heap. Traditional FORTHS have a lot of one and two character words like @ ! . , : ; etc. I've tried to avoid those except for the most common ones that everyone knows. If I'm reimplementing a Python function I use the same name that Python has. I don't eschew comments completely. A longer explanation of what the code does is placed in the first record, although I've found the need for commenting is not that critical because the unit name and database record name combination gives you plenty of clues as to what the code is doing. Despite these things, it is not easy to read FORTH code. You have to be in the "zone" which requires time. C code is much easier to understand when you first see it.
    A FORTH written in C would remain a toy without many many man hours, but that's not what I'm doing here. I'm using FORTH as a way to simplify Python down to doing one thing... emitting scripts to std out or to file. Creating those scripts is faster than doing them directly in Python, because FORTH doesn't do any syntax checking (Python's tab/space syntax checking drives me mad!) and FORTH never crashes. It just emits the Python error and remains in it's REPL (Read Evaluate Print Loop). The scripts are typically SQL, HTML, CSS and Javascript, but I could  target any tool that is scriptable. LaTEX and Python are likely to be targets in future.

     

    Sunday 17 April 2022

    Storing Your Code in a Database 2

    Now that I've been working on my Web application for a while I've realised a few things. 

    • I've essentially created an Access database, but using a Web front-end. In Access the code is divided into collections e.g. forms, queries, reports etc. When you create an application you create the individual components and then stitch them all together. I'm doing the same thing, but have yet to determine the most effective collections to have.
    • I'm using my own traditional PC app (Classmaker) as my IDE. This is much faster to use than a Web IDE. It also means I can keep the same Web page up and simply refresh it to see my changes. I could code directly against a Cloud based code database using SSH tunneling as Classmaker uses Isectd to communicate with it's database. 
    • Because I'm using a code database via Classmaker, its relatively easy to move code components around and I'm doing that a lot at the moment. I think once the app is mature, I will have a number of collections each containing scores of components that I call from a few Web forms. The Web forms themselves probably won't have much code in them.
    • Rolling back changes to code is easy to do in a database. I just have a flag attached to the record. To make changes in a live environment I would duplicate the record, but with the flag set. If the flag is set that record is not selected for parsing.
    • I'm using FORTH as my application language. It's my own implementation written on top of Python and it's optimised for working with disconnected recordsets and strings. FORTH is a tricky language which requires you to be in the "zone" to be productive. Because I'm architecting my development environment as I go (eating my own dog food!), not a lot of progress towards the app is happening yet. I'm hopeful the gains on my investment will occur later.
    • FORTH does have several advantages though:
      • It has no syntax. This is nice. I can just write the code and format it any way I want to.
      • My FORTH doesn't have local variables (a discussion about the need for local variables deserves it's own blog post 😏), just the stack and global variables. This means it is a great language, maybe the best there is... for code refactoring. Combine this with a database and it's a simple task to bundle some repetitive code into a word and shift it to it's own record.
      • FORTH is fast. You wouldn't think so, being an interpreted language written on top of an interpreted language, but...
        •  The language core (both Python and FORTH) is compiled into pcode (assembly language for the Python virtual machine) and stored in memory, when the web server is started, so when you call a core compiled FORTH word you're calling a memory resident subroutine. The best analogy for what I'm doing is Apache's modPy, except I'm using Isectd to do it, not Apache and it's modFORTH rather than modPy. Apache thinks it's calling a CGI program, so I could use any CGI web server. The result would be the same.
        • In FORTH, code is compiled/executed in a single pass. The word list is examined and if a subroutine is to be compiled that happens and the compiled code branch memory address is inserted into the dictionary. Further down the word list if that word appears the dictionary is looked up and the compiled word is executed. Whenever I branch to a database record it's word list is inserted into the trunk word list at that point and execution continues, so the database code is parsed only as it is called. JIT compilation/execution? Whatever you call it, no code is prematurely compiled apart from the language core.
    • When using C++ or even Python it can be very difficult to work out just where an error has come from and what it is. In a database this can be narrowed down to a specific record. The smaller the record the less code to go through. With FORTH it's easier as you can locate in the word list the exact word the code failed on. This means that syntax errors can be immediately identified. Logic errors are more tricky, since they may originate earlier in the code base. I've found that most of the Python errors that are raised from a FORTH logic error are meaningless. It gets worse! If you've created a word that comprises many other words and that word should fail, you know it failed, but you don't know which subword caused the problem, because the subwords were compiled much earlier in the program. The lesson I'm learning is.. don't prematurely refactor. At least you can dump the contents of the stack (essentially all your local variables) at any point in your code to see what's going on. I tend to work back from the bug, inserting dump statements at random points in the code.

    Monday 11 April 2022

    Storing Your Code in a Database 1

    I've always been concerned about how to handle different versions of the same code for different customers. I've seen several different approaches over the years. 

    The one that seemed to work best involved code stored in a Pick database. At my first job, we used Advanced Revelation, a DOS version of Pick. The Pick system mixed code files written in Pick Basic and multi-value data files together in a hashed database. We used to take a global source code file and append an asterisk and the customers name to the end of it when a customisation was required e.g. accounts*customerA, accounts*customerB etc. Code management modules were written that compiled accounts*customerA into the global accounts executable and then bundled all the executable records together into an archive that the customer extracted onto their system. Another department at that same work place were developing a C++ app, storing the code in standard text files. They got into all sorts of trouble handling variations, mainly because, I think, when developing using text files, to avoid clutter you tend to let the text files grow too large and then it becomes very difficult to refactor them. 

    I've seen a FoxPro app where most of the app is standard, but you were able to request minor modifications e.g. customised reports. How this was handled at the vendor end, I'm not sure, but I did notice fields in the database that my customer didn't use.

    Another approach is to have a single code base, but include an ini file that let's different customers switch on and off the functionality they need. Ini files used like this can grow to be very large e.g. Apache Web Server. I think the ini file approach is going to become unworkable as the code base becomes increasingly convoluted trying to account for every possibility.

    I have no experience doing this, but you could build a conventional app using text files and then store these inside a versioning database e.g. Fossil

    Today we have hosted web apps. I don't think much has changed except that you no longer have the upgrade pain that came with distributed apps. I'm guessing the standard approach is to build an app in a CMS and to associate code files and database records with registered users. So you end up with one huge CMS app and one large relational database. The database is a concern, because over time it is going to end up with lots of redundant fields from customisations. You must be passing some kind of unique token between the server and the client to ensure that the client can't trespass onto someone else's data and if that token gets hacked, the hacker might be able to gain access to your entire customer base! To avoid that database engineers resort to using GUIDs to identify records, so it is virtually impossible to pull up a database record using a random key.

    I'm working on a different approach.

    • Each customer has their own database for data. Consequently you can keep it simple with integer id's and integer foreign keys instead of a mess of GUIDs which as well as being confusing massively impact on performance as a index key. The database server manages multiple small databases.
    • Their code, though, resides in a SINGLE database for code. The code database is organised like a library with shelves, books and chapters (pages exist too but these are called indirectly). A shelf is an entire application which comprises many books, but the customer directly accesses just four books, a private book (administration), a public book (customer web pages that can be used by anyone), a protected book (customer web pages that can only be accessed by their registered users) and a global book. The customer books contain only the chapters which are customised for them. Every time a client requests a web page, the customer books are browsed first. If the book's chapter is missing then the chapter is sourced from the global book instead.
    • It's impossible for a customer to request another customer's book because to read their book you have to be standing right in front of it. This is pushing the analogy a bit far! I have just ONE very simple cgi file (all it does is collect cgi parameters and cookies, sends them to the code database and returns the reply to the web browser) that calls the code database, but there are MANY COPIES of that file in a shallow directory tree. Each copy has it's own ini file which specifies the name of the book it can access and which database. None of that information passes over the Internet. Access to the cgi files is password protected where necessary (using SSL encrypted basic authentication).