Thursday, February 28

j: deserialize from wordpress

In http://r-nd-m.blogspot.com/2018/02/jq-serialize-for-wordpress.html I was using jq to import json content into wordpress. There was significant follow-on work, as I was learning wordpress and mostly working at the database level, but the results were mostly satisfactory. (Though, in retrospect, maybe I should have been using javascript (node) rather than jq.)

Anyways, now I'm needing to do some more low-level work (need to migrate images to s3, because of wpengine storage constraints), but before I can adequately plan that out, I need to verify some assumptions.

And, for that, I want to build a list of all image urls for all crops (among other things, to see if any crops are shared across images, as a consequence of some earlier repair work).

So... time to build a wordpress deserializer in J.

Here's a draft:

fromwp=:3 :0
  'off val'=. 0 wp_unserialize y
  assert. off -: #y
  val
)

wp_unserialize=:4 :0
  NB. almost all numbers are character counts (numeric results are the exception)
  select. x { y
    case.'N' do.
      assert. 'N;' -: (x+0 1) { y
      (x+2);<a:
    case.'b' do.
      b=. (x+i.4) { y
      assert. (b -: 'b:0;') +. b-:'b:1;'
      (x+4);b-:'b:1;'
    case.'i' do.
      assert. 'i:' -: y{~x+0 1
      s=. (x+2) ';'findCh y
      (x+3+s);_ ". y{~x+2+i.s
    case.'d' do.
      assert. 'd:' -: y{~x+0 1
      s=. (x+2) ';'findCh y
      (x+3+s);_ ". y{~x+2+i.s
    case.'s' do.
      l=. _ ". y{~x+2+i. (x+2) ':'findCh y
      o=. #":l
      r=. y{~x+o+4+i.l
      assert. ('s:',(":l),':"',r,'";') -: y{~x+i.6+o+l
      (x+o+l+6);r
    case.'a' do.
      assert. 'a:' -: y{~x+0 1
      l=. _ ". y{~x+2+i. (x+2) ':'findCh y
      o=. #":l
      assert. ('a:',(":l),':{') -: y{~x+i.4+o
      'x r'=. (x+4+o) l wp_unserialize_array y
         NB. iterate through array - use a helper for this for clarity
      assert. '}'-:x{y
      (x+1);<r
    case. do.
      'unsupported serialization type' throw.
  end.
)

wp_unserialize_array=:1 :0
:
  kv=.2 0$''
  for.i.m do.
    'x k'=.x wp_unserialize y
    'x v'=.x wp_unserialize y
    kv=.kv,.k,&<v
  end.
  x;<kv
)
       
findCh=:1 :0
:
    peek=. y{~x+i.x-~(x+100)<.#y
    if. m e. peek do.
      peek i.m
    else.
      (x}.y)i.m
    end.
)

Some notes:

  • J arrays are a bit different from php arrays. So I elected to represent php arrays as two row arrays in J (first row is keys, second is values).
  • Technically wordpress arrays could include serialized objects. But those aren't relevant here, so I don't support them here.
  • I had been working with a result exported from mysql (using the -s comman line option on mysqlclient) and read into J using readdsv. This mangled double quote characters. I now use (<;._2@,&TAB);._2 instead of readdsv.
  • As an aside, mysql exports four characters with backslash escapes (backslash, newline, tab and ascii nul) - I handle this issue before running fromwp, using: 
    • fromwp rplc&('\\';'\';'\0';({.a.);'\t';TAB;'\n';LF) exportedstring

Here's what I use to index from these two-row arrays which represent php arrays:

idx=:4 :0
  (a:,~{:y) {::~"1 0 ({. y) i. <^:(0=L.) x
)

Left arg is key, right arg is array, those arrays are always boxed arrays. This routine unboxes its result. Invalid keys give empty results.

Generally, when debugging this code, I enable suspensions (there's a 13!: foreign I could use, but there's also a cmd-K (or control-K) that turns on debugging under jqt). Usually just looking at the variables and character subsequences is enough to make problems obvious.

(I certainly don't memorize every little bit of this code. Yes, it's a bit ugly - parsers are like that. Either ugly, or so abstracted that you can't find the ugly stuff (and, thus, still not memorizable).)

Anyways, here's an example (or part of it):

   fromwp (<-1 1){::rawmeta
┌─────┬──────┬─────────────────────────────────────────────────┬────────...
│width│height│file                                             │sizes   ...
├─────┼──────┼─────────────────────────────────────────────────┼────────...
│4000 │2650  │outdoor-lighting-options-good-fences-sun-1016.jpg│┌───────...
│     │      │                                                 ││thumbna...
│     │      │                                                 │├───────...
│     │      │                                                 ││┌──────...
│     │      │                                                 │││file  ...
│     │      │                                                 ││├──────...
│     │      │                                                 │││outdoo...
│     │      │                                                 ││└──────...
│     │      │                                                 │└───────...
└─────┴──────┴─────────────────────────────────────────────────┴────────...

If it weren't clipped, that representation of the array would be 1241 characters wide.