Friday, 29 June 2018

Disconnected Recordsets

If you Google the term 'disconnected recordset' the results returned refer to old style Microsoft ADO recordsets which are able to use database cursors, but Microsoft has stopped using that terminology, because their newer ADO.NET recordsets are always disconnected. Database cursors have dropped out of favour because they maintain state on a database, which isn't scalable, particularly on web servers.

Reading some of Microsoft's documentation today surprised me, because ADO.NET's approach to database access is nearly identical to my own hand-crafted disconnected Recordset object, which I came up with nearly 20 years ago, roughly the same time ADO.NET 1.0 was released.

What does my Recordset object do?
  1. It parses a tokenised string from the database driver or from a file into a double linked list which is then referred to as a recordset.
  2. It contains methods to iterate forwards and backwards through a recordset.
  3. It contains methods to retrieve the contents of a field by name or position from the current row in a recordset.
  4. It contains date and time methods.
  5. It can sort a recordset.
  6. It can emit a recordset as a tokenised string.
  7. It can write a recordset back to the database in one round trip.
  8. All data access is carried out under transaction management.
  9. Additional methods can be added to the object as needed, without impacting earlier code.
Why did Microsoft move to disconnected recordsets from cursors?  I have found the following benefits:

  1. Multiple recordset objects share the same database connection which is opened automatically when the first query is run.
  2. A recordset object can act as a record datatype without using a database.
  3. The recordset object hides a lot of the repetitive insert, update, delete queries that come with SQL. SQL queries are still used a lot though. Without in depth knowledge of SQL, you can't use relational databases effectively.
  4. As the recordset is a tokenised string it can be written to disk and retrieved again.
  5. Transaction management can span multiple recordset objects.
OK, what's this continual banging on about tokenised strings!  It turns out that these are the key feature that makes the whole disconnected recordset thing work.  Here's a tokenised string:

[[{'name':'ID','type':'i','null':0,'width':4},{'name':'DESCRIPTION','type':'s','null':1,'width':30},{'name':'REORDER','type':'i','null':0,'width':4}],
[[          5,'Moved',          5],
[         10,'Critical',         10],
[         20,'Urgent',         20],
[         30,'Normal',         30],
[         40,'Wish',         40]]]

Through a historical accident, it so happens that this tokenised string is also a Python list. Perhaps if I was to reimplement it today I would use JSON which is more widely used. Nevertheless, this format works well. It can be evaluated directly using Python and I have written a parsing routine in C++ that converts it to a double linked list in a single parse.

JSON can exceed two dimensions and use schemas, but the column and row format is what makes my tokenised string so efficient to parse and to manipulate once parsed. Columns and rows are easy to understand and column numbers rather than column names can be used for faster data retrieval. Not that this makes much difference, since a linked list is very fast to iterate over, so I usually use column names instead of column numbers to maintain code legibility.

Looking at the string you can observe that it is restricted to columns and rows, but it isn't a flat file. Instead, this format lies somewhere between a flat file format like dBase which has fixed column widths and is very fast to query using file offsets and CSV files which don't have quite enough information to be useful as a recordset.

Here's another tokenized string:

[[{'name':'ID','type':'i','null':0,'width':4},{'name':'MSGID','type':'i','null':0,'width':4},{'name':'FKISSUES','type':'i','null':0,'width':4},{'name':'FKPEOPLE','type':'i','null':0,'width':4},{'name':'BODY','type':'s','null':1,'width':32765},{'name':'STATUS','type':'i','null':0,'width':4},{'name':'LAST_ACTIVITY','type':'d','null':0,'width':8}],
[[          4,          4,          3,          6,'What would we talk about?',         30,'2018-06-02 15:10:07.0000'],
[          6,          6,          3,          6,'I need to keep going with this software.',         30,'2018-06-07 21:01:40.0000'],
[          7,          7,          3,          6,'Another note...',         30,'2018-06-19 11:16:40.0000']]]

Notice that the field descriptors contain the field type. This is vital for sensible parsing of the recordset. Type inference which is what you have with CSV files can only get you so far.  Knowing the data type of the field beforehand allows for clever data massaging e.g. my recordset object has an optional flag for elapsed time which represents dates as duration from now rather than the raw format you see here.

There's a few things that aren't so great with disconnected recordsets:

  1. They can't store binary data. I get around this by converting to Base64 which is slow. Over time, I've come to realise that binary data is much easier to use if it is kept out of a database anyway.
  2. How do you deal with deeply nested data? Relational databases don't store data hierarchically.  Instead they use relations between tables. It is possible to store data hierarchically using table relations, although a little tricky (a whole blog post could be devoted to that subject). Recently, I have found the need to store data hierarchically, but I do that using the database tables themselves, so it has no effect on my recordsets.
  3. What about shallow nested data? I use Base64 encoding to store recordsets inside recordsets, since none of the Base64 characters match the recordset's delimiters, but I've only encountered one instance where shallow nesting was needed.
  4. The parsing routines must escape delimiting characters. I use octals to avoid the escaped representation containing letters which hexadecimal requires.

Technical Details 

  • disconnected Recordset client (data view)
    • C++ using Fox-Toolkit which has an excellent string class that I use heavily in my projects.
    • eval in Python is all you need.
  • disconnected Recordset driver (data adapter)
    • native driver written in C for the FirebirdSQL relational database.
  • disconnected Middleware (connection)
    • both the client and server depend on IsectMP to work.  IsectMP is a tiny RPC multi-plexer written in C that I maintain.  It contains the FirebirdSQL driver, but not the C++ client.