Saturday, 7 May 2022

Go FORTH and build your own language!

FORTH is a wonderfully simple and compact programming language. Take a look at Rosetta Code Language Comparison.  Nearly every column on their table for FORTH is either... N/A, None or No!  I laughed at the column about the language being standardised (ANSIISO/IEC 15145:1997). Nothing could be further from the truth. At the beginning, a vanilla FORTH doesn't even have variables! You make them yourself by coding the word var (or whatever you think a variable should be called) using

 : var create 1 allot ; which is a compile time word 

every time the word var is invoked at run-time, create looks at the next word in the word list, let's call it counter for this example, assigns it a dictionary key, saves the current heap slot along with the instruction to push the heap slot to the data stack as a dictionary value and finally, advances to a new slot on the heap. So next time you call the word counter (not var. It's done what it needed to do), FORTH looks up the dictionary, and pushes it's heap slot to the data stack. Now you can retrieve the contents of the heap slot, using the built-in word @ or write a new value to the heap slot using the built-in word !

You're probably lost already, but this isn't a lesson in how to use FORTH. I cut my teeth on FORTH written in Python and highly recommend it. Instead I want to posit my pearl of FORTH. Somebody said somewhere that the pearl of Forth is it's create does> word pair, which in brief, let's you pair a compile time action (create) with a run time action (does>). This word pairing let's a user create their own language on top of FORTH. In the past people have created Object Oriented FORTHS and even BASIC. However, this is just syntactic sugar. I've discovered quite early in my FORTH journey that it's better to stay as close to the machine (Python virtual machine in my case) as you can.

No... for me the pearl of FORTH is that anybody can write their own FORTH and consequently there are hundreds of them out there. Few of them follow the FORTH standard. 

The reason you can write your own FORTH is due to it's very simple compile / interpret loop. FORTH doesn't have a compiler in the standard sense. Instead it has compiling words like var above. In FORTH there is no syntax, just a stream of white space separated tokens called words which are consumed in a single pass. The compile / interpret loop works as follows:

1.    We begin in interpret mode. If the word presented is a : we enter compile mode.  Otherwise consume the word and execute it immediately.

2.    If we are in compile mode, consume each word, compile it, but don't execute it.  Instead once ; is reached, store the compiled pcode in the dictionary for future execution, when the word we have just made (var above) is called again.

3.    Go back to 1.

A number of implications flow from this simple pcode compiler:

  • because the the words themselves act as switches to turn the compiler on and off, compiled words can be redefined further down the word list. Word (function) redefinition is the way a FORTH program deals with a function needing to do slightly different things based on it's context (function overloading in C++).
  • anybody can change how a FORTH compiler behaves by creating new built-in words that interact directly with the compile / interpret loop in a similar way as : and ; do.
  • a compiled word is a list of built-in words and perhaps other compiled words. At run-time those words call each other in sequence. My FORTH, like a lot of them, uses the concept of indirect threaded code where the compiled words are a list of function calls in a Python list stored in a dictionary. To execute each underlying Python function the FORTH word (key) calls the dictionary, which then calls the Python built-in function's memory address (value) which it looked up and stored during the compile process. In direct threaded code each function would call the next, similar to a linked list. The advantage of indirect threading, is that the built-in function doesn't need to know anything about calling or being called and so:
  • Compiled words can be defined that contain words that haven't been defined yet, as long as all words have been defined by execution time. 
  • Recursion is possible. 
  • All of FORTH's built-in tokens including the compiler control tokens : and ; can be renamed by storing them as constants. 
  • the words list is consumed as it is compiled / interpreted. This makes it easy to add new words to the words list at run-time, just create a word that reads in a new file and adds it's contents to the top of the words list. This lessens the need for namespaces as words are only introduced when needed rather than being present when the program loads.
  • you can increase the performance of your code by treating the word list as your string storage, something you would never do in a compiled language. This works particularly well when you want to print out large swaths of HTML which, just like FORTH, has no syntax. I put it all in my FORTH file (which is actually a collection of database records) and use the FORTH interpreter to print it directly to standard output (std out), bypassing the stack and heap. Want to store the string on the data stack instead (SQL queries come to mind for that)? Same approach as for std out except the word is different that puts it on the stack.
  • being able to poke around in the compiler gives you a very good understanding of how your FORTH works and the result is that you work to it's strengths and avoid it's weaknesses. The strengths I have noticed so far:
    • the code base is very malleable (the refactoring everybody talks about) and it is easy to change things so that you can re-use code e.g. printing a table is commonplace so I have standard table printing code where I redefine what can happen before, in and after a table cell. Now it's easy to print out HTML, tab delimited format, CSV or whatever.
    •  no types. If you want a type you have to explicitly cast it. Using HTML its all strings, so most of the time I don't want to be fooling around with types anyway. The only ones I've implemented so far are str and int. The other day I wanted a currency type. I found the appropriate Javascript and dropped my number between it. A quick and dirty solution that I wouldn't want to use for a table with hundreds of values, but formatting variables by surrounding them with FORTH string constants is standard practice for me.
    • avoid local variables because they make it difficult to refactor. A web app is a collection of independent pages, so the chances of global variables standing on each other is reduced. I have a number of predefined global variables which I use over and over. Then I came across the concept of shadowing. A FORTH variable is really an array of one (1 allot).
              var counter
    1 allot

    You now have a variable with two slots.  To refer to the first slot you simply call it by its name:

    counter

    counter 1 +

    gets you the second slot.

    This shadowing concept means you can store a backup of your variable in the same variable name. 

    The Weaknesses 

    • being able to have local variables would be nice occasionally 😀. Stack twiddling is not useful work. My solution has been to turn the stack into a deque and to use the bottom of the stack to store one or two local variables. Many FORTHS use the top of the control stack to store a local variable. I haven't tried that, preferring the deque concept instead because you don't have to monitor it so closely. If I'm iterating through a table there's going to be a lot of values on the stack, so there's no chance the bottom value will be accessed from above. I've added a bottom peek instruction which works well, but would be a disaster at the top, since FORTH words almost without exception pop from and push to the stack.
    • it's very easy to write inscrutable code. To address this, I indent my code following the typical conventions e.g. nested if statements. Plastering the code with comments is standard practice in other languages. Instead I try to use words that make sense when being read from left to right. Done right you end up with statements that intuitively make sense e.g. empty counter ! means put a '' into the counter variable and store it on the heap. Traditional FORTHS have a lot of one and two character words like @ ! . , : ; etc. I've tried to avoid those except for the most common ones that everyone knows. If I'm reimplementing a Python function I use the same name that Python has. I don't eschew comments completely. A longer explanation of what the code does is placed in the first record, although I've found the need for commenting is not that critical because the unit name and database record name combination gives you plenty of clues as to what the code is doing. Despite these things, it is not easy to read FORTH code. You have to be in the "zone" which requires time. C code is much easier to understand when you first see it.
    A FORTH written in C would remain a toy without many many man hours, but that's not what I'm doing here. I'm using FORTH as a way to simplify Python down to doing one thing... emitting scripts to std out or to file. Creating those scripts is faster than doing them directly in Python, because FORTH doesn't do any syntax checking (Python's tab/space syntax checking drives me mad!) and FORTH never crashes. It just emits the Python error and remains in it's REPL (Read Evaluate Print Loop). The scripts are typically SQL, HTML, CSS and Javascript, but I could  target any tool that is scriptable. LaTEX and Python are likely to be targets in future.

     

    Sunday, 17 April 2022

    Storing Your Code in a Database 2

    Now that I've been working on my Web application for a while I've realised a few things. 

    • I've essentially created an Access database, but using a Web front-end. In Access the code is divided into collections e.g. forms, queries, reports etc. When you create an application you create the individual components and then stitch them all together. I'm doing the same thing, but have yet to determine the most effective collections to have.
    • I'm using my own traditional PC app (Classmaker) as my IDE. This is much faster to use than a Web IDE. It also means I can keep the same Web page up and simply refresh it to see my changes. I could code directly against a Cloud based code database using SSH tunneling as Classmaker uses Isectd to communicate with it's database. 
    • Because I'm using a code database via Classmaker, its relatively easy to move code components around and I'm doing that a lot at the moment. I think once the app is mature, I will have a number of collections each containing scores of components that I call from a few Web forms. The Web forms themselves probably won't have much code in them.
    • Rolling back changes to code is easy to do in a database. I just have a flag attached to the record. To make changes in a live environment I would duplicate the record, but with the flag set. If the flag is set that record is not selected for parsing.
    • I'm using FORTH as my application language. It's my own implementation written on top of Python and it's optimised for working with disconnected recordsets and strings. FORTH is a tricky language which requires you to be in the "zone" to be productive. Because I'm architecting my development environment as I go (eating my own dog food!), not a lot of progress towards the app is happening yet. I'm hopeful the gains on my investment will occur later.
    • FORTH does have several advantages though:
      • It has no syntax. This is nice. I can just write the code and format it any way I want to.
      • My FORTH doesn't have local variables (a discussion about the need for local variables deserves it's own blog post 😏), just the stack and global variables. This means it is a great language, maybe the best there is... for code refactoring. Combine this with a database and it's a simple task to bundle some repetitive code into a word and shift it to it's own record.
      • FORTH is fast. You wouldn't think so, being an interpreted language written on top of an interpreted language, but...
        •  The language core (both Python and FORTH) is compiled into pcode (assembly language for the Python virtual machine) and stored in memory, when the web server is started, so when you call a core compiled FORTH word you're calling a memory resident subroutine. The best analogy for what I'm doing is Apache's modPy, except I'm using Isectd to do it, not Apache and it's modFORTH rather than modPy. Apache thinks it's calling a CGI program, so I could use any CGI web server. The result would be the same.
        • In FORTH, code is compiled/executed in a single pass. The word list is examined and if a subroutine is to be compiled that happens and the compiled code branch memory address is inserted into the dictionary. Further down the word list if that word appears the dictionary is looked up and the compiled word is executed. Whenever I branch to a database record it's word list is inserted into the trunk word list at that point and execution continues, so the database code is parsed only as it is called. JIT compilation/execution? Whatever you call it, no code is prematurely compiled apart from the language core.
    • When using C++ or even Python it can be very difficult to work out just where an error has come from and what it is. In a database this can be narrowed down to a specific record. The smaller the record the less code to go through. With FORTH it's easier as you can locate in the word list the exact word the code failed on. This means that syntax errors can be immediately identified. Logic errors are more tricky, since they may originate earlier in the code base. I've found that most of the Python errors that are raised from a FORTH logic error are meaningless. It gets worse! If you've created a word that comprises many other words and that word should fail, you know it failed, but you don't know which subword caused the problem, because the subwords were compiled much earlier in the program. The lesson I'm learning is.. don't prematurely refactor. At least you can dump the contents of the stack (essentially all your local variables) at any point in your code to see what's going on. I tend to work back from the bug, inserting dump statements at random points in the code.

    Monday, 11 April 2022

    Storing Your Code in a Database 1

    I've always been concerned about how to handle different versions of the same code for different customers. I've seen several different approaches over the years. 

    The one that seemed to work best involved code stored in a Pick database. At my first job, we used Advanced Revelation, a DOS version of Pick. The Pick system mixed code files written in Pick Basic and multi-value data files together in a hashed database. We used to take a global source code file and append an asterisk and the customers name to the end of it when a customisation was required e.g. accounts*customerA, accounts*customerB etc. Code management modules were written that compiled accounts*customerA into the global accounts executable and then bundled all the executable records together into an archive that the customer extracted onto their system. Another department at that same work place were developing a C++ app, storing the code in standard text files. They got into all sorts of trouble handling variations, mainly because, I think, when developing using text files, to avoid clutter you tend to let the text files grow too large and then it becomes very difficult to refactor them. 

    I've seen a FoxPro app where most of the app is standard, but you were able to request minor modifications e.g. customised reports. How this was handled at the vendor end, I'm not sure, but I did notice fields in the database that my customer didn't use.

    Another approach is to have a single code base, but include an ini file that let's different customers switch on and off the functionality they need. Ini files used like this can grow to be very large e.g. Apache Web Server. I think the ini file approach is going to become unworkable as the code base becomes increasingly convoluted trying to account for every possibility.

    I have no experience doing this, but you could build a conventional app using text files and then store these inside a versioning database e.g. Fossil

    Today we have hosted web apps. I don't think much has changed except that you no longer have the upgrade pain that came with distributed apps. I'm guessing the standard approach is to build an app in a CMS and to associate code files and database records with registered users. So you end up with one huge CMS app and one large relational database. The database is a concern, because over time it is going to end up with lots of redundant fields from customisations. You must be passing some kind of unique token between the server and the client to ensure that the client can't trespass onto someone else's data and if that token gets hacked, the hacker might be able to gain access to your entire customer base! To avoid that database engineers resort to using GUIDs to identify records, so it is virtually impossible to pull up a database record using a random key.

    I'm working on a different approach.

    • Each customer has their own database for data. Consequently you can keep it simple with integer id's and integer foreign keys instead of a mess of GUIDs which as well as being confusing massively impact on performance as a index key. The database server manages multiple small databases.
    • Their code, though, resides in a SINGLE database for code. The code database is organised like a library with shelves, books and chapters (pages exist too but these are called indirectly). A shelf is an entire application which comprises many books, but the customer directly accesses just four books, a private book (administration), a public book (customer web pages that can be used by anyone), a protected book (customer web pages that can only be accessed by their registered users) and a global book. The customer books contain only the chapters which are customised for them. Every time a client requests a web page, the customer books are browsed first. If the book's chapter is missing then the chapter is sourced from the global book instead.
    • It's impossible for a customer to request another customer's book because to read their book you have to be standing right in front of it. This is pushing the analogy a bit far! I have just ONE very simple cgi file (all it does is collect cgi parameters and cookies, sends them to the code database and returns the reply to the web browser) that calls the code database, but there are MANY COPIES of that file in a shallow directory tree. Each copy has it's own ini file which specifies the name of the book it can access and which database. None of that information passes over the Internet. Access to the cgi files is password protected where necessary (using SSL encrypted basic authentication).

    Sunday, 22 August 2021

    My 64bit Windows 7 Software Ecosystem

    The tools listed below are where I spend most of my work time, ordered from most used to least:

    Daily for time sheets, checking bank statements and general emails
    Accounting - GNUCash
    Web browsing - Mozilla Firefox
    Email - Mozilla Thunderbird

    Twice weekly for invoicing
    Planning - Classmaker
    Mail Merge reports - Microsoft Word 2003 (the data source is an HTML file emitted by Classmaker)
    PDF Creation - Print to CutePDF

    Once a week
    Backup - Classmaker
    Off-site backup - Google Drive

    Once a fortnight
    Scanning - Windows Paint
    Creating multipage PDF's from scanned images - IrfanView
    Spreadsheet - LibreOffice Calc
    Word Processor - LibreOffice Writer

    Once in a blue moon!
    Chart creation - LibreOffice Draw
    Desktop publishing - Scribus

    When things go horribly wrong!
    Puppy Linux Tahr 6.0 on a USB stick

    •  Puppy has saved me several times over the years when Windows 7 hasn't booted or file corruptions have occurred.
      • The most recent incident involved OLE embedding a LibreOffice Draw file inside a LibreOffice Writer document.  Don't do this!  That technology is unstable and you will find yourself locked out of the Writer document as somehow ALL the file permissions get removed.  The only way to retrieve the file is to boot into Linux, move the file into a Linux partition and then copy it back.
      • Instead of OLE embedding, export your Draw file to disk as a PNG image and then insert that into your Writer document.

    Friday, 18 June 2021

    Safe Python Eval

    Many of us want to use a safe simple scripting language to extend our Internet applications and I have seen many discussions on the Internet on how to implement Python so that it's eval function can't be exploited. The problem is, nobody can be sure that they've covered every scenario, because Python has become complex over time, with multiple ways to achieve the same objective.

    My scenario is a typical one. I have a web based database application hosted on a remote server. Now I want it's users to access it's database to produce ad hoc reports, without compromising the database or anything else! I could do the typical thing of providing a screen where users can submit Select queries, but SQL by itself isn't enough to get a report that is useful except in the most simple scenarios. To begin with you can only work with one query result at a time. Reports frequently require multiple query results to be combined. Tables of course can be joined using SQL, but the result usually has lots of redundant data fields that must be omitted. A declarative language like SQL just can't do it alone. It needs to be combined with a procedural language. After all, that's how my web application works. The procedural capability of Python is combined with SQL to produce web pages that make sense to users.

    After surveying the scripting languages out there, Python came in at number one, BUT, it has an eval function that can access the file system or carry out other security exploits. Vanilla Python isn't a sensible choice as an extension scripting language.

    I have looked at FORTH in the past, ATLAST and Ficl were front runners, but they are traditional FORTHs with weak string manipulating capability. Strong string capability is what I need, because my database queries return as nested Python string lists. I came up with that format years ago before JSON became popular, using Python eval to parse the database query, returning a table that Python can iterate over using it's index functions.

    Then I came across FORTH written in Python, a toy project for educational purposes. I realised it wasn't going to be fast, because FORTH is an interpreted language written in an interpreted language! What it did have going for it was very strong string capability and the ability to cherry pick the Python functions required, leaving everything else behind. Suddenly, eval wasn't a problem as I could overlay it with some checks to make sure eval could only eval a list. FORTH is a simple typeless stack based language with the entire environment comprising two stacks and a heap (Python lists), a dictionary (Python dictionary) and a heap pointer (Python integer). FORTH has no syntax! The language is just a stream of white space delimited tokens called words. Traditional FORTH has no error checking and crashes are common place. But having Python underneath FORTH  makes it uncrashable. Sure your FORTH code will frequently bomb, but the FORTH REPL (Read Evaluate Print Loop) just displays the Python error and awaits your next command.

    I made some alterations to the Python code to get a FORTH that isn't quite FORTH anymore (as the saying goes 'when you've seen one FORTH - you've seen one FORTH')! Everything became a string, both the stack and the heap able to handle strings or lists of any size in addition to numbers. Actually Python is cleverer than that. Lists can store variables of any type, so my database lists stay as lists (under the hood, they're probably C arrays) which are passed by reference rather than value. Numbers stay as numbers and dates as dates unless they are cast explicitly, something I've brought over from Python. But apart from that the toy project has stayed largely untouched. All it can do is step, branch, loop, create variables, read variables and print to the console. But at just over 300 lines of Python, it's Turing complete and this is all you need to create a report. Already I have written a complex report utilising two tables that cannot be joined, the report output is styled HTML.

    Perhaps the most interesting thing I've done is to create a persistant library of Forth definitions. The way I've done that is to store the FORTH environment after loading forth.py (the REPL) and forth.fr (the persistant library) as Python command line arguments. The working code FORTH file is then called manually in the REPL. Each time it is called the FORTH environment is restored to the finish of forth.fr. This means you can call the working code file over and over without having to restart the REPL.

    But there's still a problem!  FORTH over Python might be safe, but a user could write a function that never returns. Endless loops are easy to create inadvertently in procedural languages. Code needs to be added to Isectd that checks how long a worker has been busy for. If it has been busy for longer than 20 seconds and has no clients attached then Isectd sends a kill command to the process which will automatically restart after it has been killed.

    The working FORTH code is included in PocketClassmaker install file from the downloads area of my website.

    Monday, 16 December 2019

    Fox-Toolkit

    I have been using Fox-Toolkit version 1.3.15 as my development environment for 15 years now. Prior to that I used Visual Basic 5, which I loved. Ever since moving to Fox-Toolkit I've been searching for a modern equivalent to VB5. I've never found it. Over and over, following these programming language journeys, I've concluded that Fox-Toolkit really is a brilliant piece of software engineering, all of it done by one man, Dutch born, Jeroen van der Zijp, beginning in 1997.

    Here's why I think it's brilliant:

    1. It's just a library written in C++, but Jeroen dislikes the STL and doesn't use it. The resulting language resembles C with inheritance rather than C++. Most of time I don't even use it like you would C or C++ as I have no need to concern myself with advanced concepts like inheritance, pointers and all the rest. Instead, I use it the way you used VB... without an IDE. In VB, you created forms and modules. Most of the code was written as event code associated with a widget and shared code was placed in modules. In Fox it's the same except that the form is just a standard *.cpp text file with Fox's window creation boilerplate and a module is a standard *.cpp text file without Fox's window creation boilerplate.

    2. Because it's written in C++:
    • it's runs extremely fast
    • providing a project is set up correctly, compile times are quick, because most of the code is already compiled previously and only needs to be linked
    • any external library written in C or C++ is easily integrated
    • calls to the Windows API are a breeze
    • I have multiple free compilers at my disposal
    • the C syntax is terse and easy to read
    • there are many code indenters available to tidy your code
    • if I need to, I can program at a systems level, rather than at application level without having to move to another environment.
    3. It doesn't use native widgets. Instead it emulates the look and feel of Windows 95, by using operating system primitives for drawing lines, filling spaces and obtaining mouse and keyboard input. This is the best part, because, Fox cannot be broken by a third party, since those primitive API calls are extremely unlikely to change. Also... I like Windows 95! Microsoft keeps tinkering with it's look and feel for marketing reasons, but in truth, nothing substantial has appeared since Windows 95 in terms of GUI usability.

    4. It has a very powerful string class called FXString. I use this constantly. Most of my programming revolves around relational databases which I query using my own disconnected recordset object, written in Fox, which parses a JSON like string into linked lists of FXString. Very simple, very fast and rock solid performance.

    5. The window geometry manager avoids fixed window sizes, so Fox applications resize themselves sanely. VB didn't come close in this regard.

    6. Fox comes with it's own code editor called Adie. Of course, any code editor is viable. Notepad++ is popular. But I like Adie because it's so simple.

    7. The widgets collection is huge and with a bit of digging in the code you discover things are available that you don't expect. For example, recently I found that the FXTable widget lets you copy it's contents to the clipboard in tab delimited format. Spreadsheets parse this into columns and rows directly without the usual dialog palaver that comes with a CSV file. Other time savers include INI file creation which comes baked into Fox allowing window positions to be saved on exiting your application.

    8.  Being open source with a liberal licence, you have access to all the source code which is reasonably easy to follow, so you can see exactly how things work. The situation has yet to arise, but if a widget doesn't have a feature I want, I can inherit a new widget from it that does exactly what I want.

    9. Cross-platform development is the goal of many tools, Fox included. But, I have come to see that real world application programming is about being able to call into an operating system's API to achieve outcomes that aren't clunky workarounds. For instance, I needed to convert rtf to pdf sanely and programmatically. The best way of doing this was to call LibreOffice in headless mode, but that resulted in console windows appearing. I found the appropriate Windows API call to suppress the console windows and substituted it for the normal system() call. While I could have done something similar in other languages it would have been much more difficult to implement. With Fox it was just a bit of cutting and pasting.

    Perhaps the biggest benefit of all from using Fox, is that you cannot code yourself into a corner.  So many times you start to use a tool only to find that there is some aspect of it that constrains your development. In 15 years of developing with Fox that is yet to happen. I've produced desktop applications and web sites, just using Fox-Toolkit. Check it out.

    Friday, 29 June 2018

    Disconnected Recordsets

    If you Google the term 'disconnected recordset' the results returned refer to old style Microsoft ADO recordsets which are able to use database cursors, but Microsoft has stopped using that terminology, because their newer ADO.NET recordsets are always disconnected. Database cursors have dropped out of favour because they maintain state on a database, which isn't scalable, particularly on web servers.

    Reading some of Microsoft's documentation today surprised me, because ADO.NET's approach to database access is nearly identical to my own hand-crafted disconnected Recordset object, which I came up with nearly 20 years ago, roughly the same time ADO.NET 1.0 was released.

    What does my Recordset object do?
    1. It parses a tokenised string from the database driver or from a file into a double linked list which is then referred to as a recordset.
    2. It contains methods to iterate forwards and backwards through a recordset.
    3. It contains methods to retrieve the contents of a field by name or position from the current row in a recordset.
    4. It contains date and time methods.
    5. It can sort a recordset.
    6. It can emit a recordset as a tokenised string.
    7. It can write a recordset back to the database in one round trip.
    8. All data access is carried out under transaction management.
    9. Additional methods can be added to the object as needed, without impacting earlier code.
    Why did Microsoft move to disconnected recordsets from cursors?  I have found the following benefits:

    1. Multiple recordset objects share the same database connection which is opened automatically when the first query is run.
    2. A recordset object can act as a record datatype without using a database.
    3. The recordset object hides a lot of the repetitive insert, update, delete queries that come with SQL. SQL queries are still used a lot though. Without in depth knowledge of SQL, you can't use relational databases effectively.
    4. As the recordset is a tokenised string it can be written to disk and retrieved again.
    5. Transaction management can span multiple recordset objects.
    OK, what's this continual banging on about tokenised strings!  It turns out that these are the key feature that makes the whole disconnected recordset thing work.  Here's a tokenised string:

    [[{'name':'ID','type':'i','null':0,'width':4},{'name':'DESCRIPTION','type':'s','null':1,'width':30},{'name':'REORDER','type':'i','null':0,'width':4}],
    [[          5,'Moved',          5],
    [         10,'Critical',         10],
    [         20,'Urgent',         20],
    [         30,'Normal',         30],
    [         40,'Wish',         40]]]

    Through a historical accident, it so happens that this tokenised string is also a Python list. Perhaps if I was to reimplement it today I would use JSON which is more widely used. Nevertheless, this format works well. It can be evaluated directly using Python and I have written a parsing routine in C++ that converts it to a double linked list in a single parse.

    JSON can exceed two dimensions and use schemas, but the column and row format is what makes my tokenised string so efficient to parse and to manipulate once parsed. Columns and rows are easy to understand and column numbers rather than column names can be used for faster data retrieval. Not that this makes much difference, since a linked list is very fast to iterate over, so I usually use column names instead of column numbers to maintain code legibility.

    Looking at the string you can observe that it is restricted to columns and rows, but it isn't a flat file. Instead, this format lies somewhere between a flat file format like dBase which has fixed column widths and is very fast to query using file offsets and CSV files which don't have quite enough information to be useful as a recordset.

    Here's another tokenized string:

    [[{'name':'ID','type':'i','null':0,'width':4},{'name':'MSGID','type':'i','null':0,'width':4},{'name':'FKISSUES','type':'i','null':0,'width':4},{'name':'FKPEOPLE','type':'i','null':0,'width':4},{'name':'BODY','type':'s','null':1,'width':32765},{'name':'STATUS','type':'i','null':0,'width':4},{'name':'LAST_ACTIVITY','type':'d','null':0,'width':8}],
    [[          4,          4,          3,          6,'What would we talk about?',         30,'2018-06-02 15:10:07.0000'],
    [          6,          6,          3,          6,'I need to keep going with this software.',         30,'2018-06-07 21:01:40.0000'],
    [          7,          7,          3,          6,'Another note...',         30,'2018-06-19 11:16:40.0000']]]

    Notice that the field descriptors contain the field type. This is vital for sensible parsing of the recordset. Type inference which is what you have with CSV files can only get you so far.  Knowing the data type of the field beforehand allows for clever data massaging e.g. my recordset object has an optional flag for elapsed time which represents dates as duration from now rather than the raw format you see here.

    There's a few things that aren't so great with disconnected recordsets:

    1. They can't store binary data. I get around this by converting to Base64 which is slow. Over time, I've come to realise that binary data is much easier to use if it is kept out of a database anyway.
    2. How do you deal with deeply nested data? Relational databases don't store data hierarchically.  Instead they use relations between tables. It is possible to store data hierarchically using table relations, although a little tricky (a whole blog post could be devoted to that subject). Recently, I have found the need to store data hierarchically, but I do that using the database tables themselves, so it has no effect on my recordsets.
    3. What about shallow nested data? I use Base64 encoding to store recordsets inside recordsets, since none of the Base64 characters match the recordset's delimiters, but I've only encountered one instance where shallow nesting was needed.
    4. The parsing routines must escape delimiting characters. I use octals to avoid the escaped representation containing letters which hexadecimal requires.

    Technical Details 

    • disconnected Recordset client (data view)
      • C++ using Fox-Toolkit which has an excellent string class that I use heavily in my projects.
      • eval in Python is all you need.
    • disconnected Recordset driver (data adapter)
      • native driver written in C for the FirebirdSQL relational database.
    • disconnected Middleware (connection)
      • both the client and server depend on IsectMP to work.  IsectMP is a tiny RPC multi-plexer written in C that I maintain.  It contains the FirebirdSQL driver, but not the C++ client.