Thursday, February 28

j: deserialize from wordpress

In http://r-nd-m.blogspot.com/2018/02/jq-serialize-for-wordpress.html I was using jq to import json content into wordpress. There was significant follow-on work, as I was learning wordpress and mostly working at the database level, but the results were mostly satisfactory. (Though, in retrospect, maybe I should have been using javascript (node) rather than jq.)

Anyways, now I'm needing to do some more low-level work (need to migrate images to s3, because of wpengine storage constraints), but before I can adequately plan that out, I need to verify some assumptions.

And, for that, I want to build a list of all image urls for all crops (among other things, to see if any crops are shared across images, as a consequence of some earlier repair work).

So... time to build a wordpress deserializer in J.

Here's a draft:

fromwp=:3 :0
  'off val'=. 0 wp_unserialize y
  assert. off -: #y
  val
)

wp_unserialize=:4 :0
  NB. almost all numbers are character counts (numeric results are the exception)
  select. x { y
    case.'N' do.
      assert. 'N;' -: (x+0 1) { y
      (x+2);<a:
    case.'b' do.
      b=. (x+i.4) { y
      assert. (b -: 'b:0;') +. b-:'b:1;'
      (x+4);b-:'b:1;'
    case.'i' do.
      assert. 'i:' -: y{~x+0 1
      s=. (x+2) ';'findCh y
      (x+3+s);_ ". y{~x+2+i.s
    case.'d' do.
      assert. 'd:' -: y{~x+0 1
      s=. (x+2) ';'findCh y
      (x+3+s);_ ". y{~x+2+i.s
    case.'s' do.
      l=. _ ". y{~x+2+i. (x+2) ':'findCh y
      o=. #":l
      r=. y{~x+o+4+i.l
      assert. ('s:',(":l),':"',r,'";') -: y{~x+i.6+o+l
      (x+o+l+6);r
    case.'a' do.
      assert. 'a:' -: y{~x+0 1
      l=. _ ". y{~x+2+i. (x+2) ':'findCh y
      o=. #":l
      assert. ('a:',(":l),':{') -: y{~x+i.4+o
      'x r'=. (x+4+o) l wp_unserialize_array y
         NB. iterate through array - use a helper for this for clarity
      assert. '}'-:x{y
      (x+1);<r
    case. do.
      'unsupported serialization type' throw.
  end.
)

wp_unserialize_array=:1 :0
:
  kv=.2 0$''
  for.i.m do.
    'x k'=.x wp_unserialize y
    'x v'=.x wp_unserialize y
    kv=.kv,.k,&<v
  end.
  x;<kv
)
       
findCh=:1 :0
:
    peek=. y{~x+i.x-~(x+100)<.#y
    if. m e. peek do.
      peek i.m
    else.
      (x}.y)i.m
    end.
)

Some notes:

  • J arrays are a bit different from php arrays. So I elected to represent php arrays as two row arrays in J (first row is keys, second is values).
  • Technically wordpress arrays could include serialized objects. But those aren't relevant here, so I don't support them here.
  • I had been working with a result exported from mysql (using the -s comman line option on mysqlclient) and read into J using readdsv. This mangled double quote characters. I now use (<;._2@,&TAB);._2 instead of readdsv.
  • As an aside, mysql exports four characters with backslash escapes (backslash, newline, tab and ascii nul) - I handle this issue before running fromwp, using: 
    • fromwp rplc&('\\';'\';'\0';({.a.);'\t';TAB;'\n';LF) exportedstring

Here's what I use to index from these two-row arrays which represent php arrays:

idx=:4 :0
  (a:,~{:y) {::~"1 0 ({. y) i. <^:(0=L.) x
)

Left arg is key, right arg is array, those arrays are always boxed arrays. This routine unboxes its result. Invalid keys give empty results.

Generally, when debugging this code, I enable suspensions (there's a 13!: foreign I could use, but there's also a cmd-K (or control-K) that turns on debugging under jqt). Usually just looking at the variables and character subsequences is enough to make problems obvious.

(I certainly don't memorize every little bit of this code. Yes, it's a bit ugly - parsers are like that. Either ugly, or so abstracted that you can't find the ugly stuff (and, thus, still not memorizable).)

Anyways, here's an example (or part of it):

   fromwp (<-1 1){::rawmeta
┌─────┬──────┬─────────────────────────────────────────────────┬────────...
│width│height│file                                             │sizes   ...
├─────┼──────┼─────────────────────────────────────────────────┼────────...
│4000 │2650  │outdoor-lighting-options-good-fences-sun-1016.jpg│┌───────...
│     │      │                                                 ││thumbna...
│     │      │                                                 │├───────...
│     │      │                                                 ││┌──────...
│     │      │                                                 │││file  ...
│     │      │                                                 ││├──────...
│     │      │                                                 │││outdoo...
│     │      │                                                 ││└──────...
│     │      │                                                 │└───────...
└─────┴──────┴─────────────────────────────────────────────────┴────────...

If it weren't clipped, that representation of the array would be 1241 characters wide.

Friday, February 9

jq: serialize for wordpress

This kind of just rolls off the tongue, right?

def towp: 
  if "null"==type then "N;"
  elif "boolean"==type then if . then "b:1;" else "b:0;" end
  elif "number"==type then if .==(.|floor) and .<9e15 then "i:"+(.|tostring)+";" else "d:"+(.|tostring)+";" end
  elif "string"==type then "s:"+(.|length|tostring)+":\""+.+"\";"
  else "a:"+(.|length|tostring)+":{"+(.|[to_entries|map(to_entries)[]|map(.value|towp)|add]|add)+"}"
  end;

For some reason, even though wordpress uses php, its serialized objects use a slightly different format from what's documented at http://php.net/manual/en/function.serialize.php -- see https://wpengine.com/support/wordpress-serialized-data/ for a decent description and https://codex.wordpress.org/Function_Reference/maybe_serialize for a more official description. (Edit: revisiting those pages shows that wordpress is now using php's serialize.)

Note, in particular, that in this wordpress format, strings get enclosing quotes but the reported length does not include those quotes. I do not yet know what this looks like for quotes and backslashes within a string - there are at least four possibilities (two hideously broken) for how that might be handled.

One issue, I imagine, is that when you delete an element from a php array, indices do not change for items after that point. The php serialize format does not preserve indices, but the wordpress serialize format does.

Anyways... it's a defining characteristic of software that "what works" generally takes precedence over "what's formally correct" (at least, if you want it to work).

Thus, although I keep harping on the distinction between "boolean" as used in software and its history:
... I still have to use the word "boolean" in my code (and in my web searches when I want to be reminded of related syntactic issues or whatever else -- on the plus side, long words that are relatively meaningless do have some advantages when searching, as long as you know you need to search for tham).

But I guess that's related to one of the nice things about standards... there's so many to choose from.

And I guess that's also related to how you don't really want to use this on stuff that's already a string (unless sometimes it has to be something else, and you want to preserve that...).

And, also, related to how blogger's layout is so ornate that it pretty much has to be fixed width (which means it will be wrong for some desktops (they pretty much have to disable the theme support for phones - it's really that insane). [There is an option to revert the blog to "classic themes", but the preview on that is not really a preview at all, and I need to be doing other things now...]

And, I guess another loosely related issue is how incredibly difficult it can be to report bugs. (There's just so many people in the world - billions - and most of them do not have the frame of mind to understand what a meaningful bug report is. So as more and more people come online, things get more and more messed up and things that used to work start failing as a consequence. ... Eventually, the failures may become visible enough to get fixed anyways, but all too often huge issues can be neglected for decades, or longer. And, just figuring out where (or if) they should be fixed can be daunting.)

Anyways... here's the reverse transform:

def _fwp($P):
  $P.off as $j|
  .[$j:$j+1] as $typ|
  if "end"==$P.op then
    if ";"==$typ then
      $P|.off|=.+1
    else
      error("expected ';' at position "+($j|tostring)+" got '"+$typ+"' ")
    end
  elif "endarray"==$P.op then
    if "}"==$typ then
      $P|(.off|=.+1)|(.len=$P.stack[-1])|(.stack|=.[:-1])
    else
      error("expected '}' at position "+($j|tostring)+" got '"+$typ+"'")
    end
  elif "start"==$P.op then
    if "N"==$typ then
      _fwp($P|.op="end"|.r=null|.off|=.+1)
    elif ":"!=.[$j+1:$j+2] then
      error("expected : at position "+($j|tostring)+" got '"+.[$j+1:$j+2]+"'")
    elif "b" == $typ then
      .[$j+2:$j+3] as $t|
      if "0"==$t then
        _fwp($P|.op="end"|.r=false|.off|=.+3)
      elif "1"==$t then
        _fwp($P|.op="end"|.r=true|.off|=.+3)
      else
        error("expected 0 or 1 at position "+($j+2|tostring)+" got '"+$t+"'")
      end
    elif ("i"==$typ) or ("d"==$typ) then
      (.[$j+2:]|match("[^;]*")) as $match|
      _fwp($P|.op="end"|.r=($match.string|tonumber)|.off|=.+2+$match.length)
    elif "s"==$typ then
      (.[$j+2:]|match("[^:]*")) as $match|
      ($match.string|tonumber) as $strlen|
      ($j+4+$match.length) as $J|
      if ":\""==.[$J-2:$J] then
        .[$J:$J+$strlen] as $str|
        _fwp($P|.op="end"|.r=$str|.off=$J+$strlen+1)
      else
        error("expected ':' at "+($J|tostring)+" got '"+.[$J-1:$J]+"'")
      end
    elif "a"==$typ then
      (.[$j+2:]|match("[^:]*")) as $match|
      if 0==$match.length then
        error("invalid array length at position "+($j|tostring)+" got nothing")
      else
        ($match.string|tonumber) as $alen|
        ($j+3+$match.length) as $j|
        if ":"==.[$j-1:$j] then
          _fwp($P|.op="startarray"|(.stack|=(.+[$P.len]))|.len=$alen|.off=$j)
        else
          error("expected : at start of array at position "+($j|tostring)+" got '"+.[$j-1:$j]+"'")
        end
      end
    else
      error("unrecognized type "+$typ+" at position "+($j|tostring))
    end
  elif "startarray"==$P.op then
    if "{"==.[$j:$j+1] then
      if "i"==.[$j+1:$j+2] then
        _fwp($P|.op="array"|.r=[]|.off=$j+1)
      elif "s"==.[$j+1:$j+2] then
        _fwp($P|.op="object"|.r={}|.off=$j+1)
      elif "}"==.[$j+1:$j+2] then
        _fwp($P|.op="endarray"|.r={}|.off=$j+1)
      else
        error("invalid index type '"+.[$j:$j+1]+"' at position "+($j|tostring))
      end
    else
      error("expected { at start of array at position "+($j|tostring)+" got '"+.[$j:$j+1]+"'")
    end
  elif "array"==$P.op then
    if 0==$P.len then
      _fwp($P|.op="endarray")
    else
      $P.r as $r|
      _fwp($P|.op="start") as $P|
      $P.r as $key|
      if "number"==($key|type) then
        _fwp($P|.op="start") as $P|
        $P.r as $val|
        _fwp($P|.op="array"|.len|=(.-1)|.r|=$r+[$val])
      else
        error("invalid array index "+($key|tostring)+" at position "+($j|tostring))
      end
    end
  elif "object"==$P.op then
    if 0==$P.len then
      _fwp($P|.op="endarray")
    else
      $P.r as $r|
      _fwp($P|.op="start") as $P|
      $P.r as $key|
      if "string"==($key|type) then
        _fwp($P|.op="start") as $P|
        $P.r as $val|
        _fwp($P|.op="object"|.len|=(.-1)|.r=$r+{($key): $val})
      else
        error("invalid object index "+($key|tostring)+" at position "+($j|tostring))
      end
    end
  else
    error("program bug (this should never happen)")
  end;

def fromwp:
  _fwp({"op":"start","off":0,"len":"none","stack":[]}).r;

You wind up needing to a parser in jq for the reverse transform, and it mostly has to go into a single recursive function because functions can't refer to other functions which have yet to be defined, and I couldn't think of any way of breaking out significant chunks that made sense, with that limitation.

It doesn't help that (at least in version 1.5) jq's error handling is kind of useless (does not tell you where in the code the error occurred). So I threw in some forced errors to help track down problems (I could do better about reporting invalid where invalid numbers occurred, but try/catch in jq 1.5 mixed with error statements in code like this can lose track of context - in some cases which I found difficult to isolate I was seeing draft versions of this code trying to parse an error statement instead of the text it was supposed to be parsing.)

Another quirk here is that I threw in the support for double quotes around strings (why bother using a numeric string length that does not include those quotes? Scary design process there...) at the last minute, and I'm not bothering to check for a closing quote - I'm just skipping over that position without checking (which, ok, works just fine... but is sloppy and allows for future specification entropy in a bad way).

(Another issue with parsing nunbers is that I might be asking jq to inspect the entire rest of the unparsed string to see where the number ends - I tried limiting it, using .[$j+2:30] instead of .[$j:], but jq would decide in some cases that that was an error (near the end of the string). I decided I did not want the complexity if trying to micromanage that issue - things are messy enough already and I didn't want the hard-to-debug number parsing to be even more obscure, so just went with the simple .[$j:] approach.)

This is painfully slow on large (50k) objects, but it seems to work...

I should build a report_parse_error({offset, expected}) routine based on J's multi-line reporting style and use that here. That should make for easier to read error messages.

I also should build an extract_next_number($offset) filter and use it here - that would let me be smart about how big of a string I'm searching for the end of the number in (can do minimum of +30 or remainder of string), which should be a big performance win.