Thursday, July 2

An Implementation of J

The Iverson Software folks have put up a free copy of their original book documenting J source code:

It's all pictures, so not searchable as text without OCR (but text search traditionally has been horrible in its treatment of J so there is some poetic justice here...).


(I had a few people tell me they wanted to see more of that kind of thing. I have not mustered enough round tuits to go there, but this is a step in that direction, and this book has been an important part of my internal understanding of the language implementation.)

Thursday, April 9

word searching

So... I was reading and I ran across this bit:

Ways of Index Partitioning
  • By doc: each shard has index for subset of docs
    • pro: each shard can process queries independently 
    • pro: easy to keep additional per-doc information 
    • pro: network traffic (requests/responses) small 
    • con: query has to be processed by each shard 
    • con: O(K*N) disk seeks for K word query on N shards  
  • By word: shard has subset of words for all docs 
    • pro: K word query => handled by at most K shards 
    • pro: O(K) disk seeks for K word query 
    • con: much higher network bandwidth needed 
      • data about each word for each matching doc must be collected in one place 
    • con: harder to have per-doc information

Hopefully they have moved beyond that. I mean, it's not like they don't have the time nor the talent. Also, some of the search algorithms I have heard described strongly suggest word-based indexing.

That said, here's some thoughts:

  1. Hybrid approaches are valid. You can speed up doc searches if you only search docs which contain the word. This isn't all that useful when the word is "the" but can be quite useful when the word is "drachma".
  2. You can have word indexes which select documents, and you can have word indexes which select elements within documents (paragraphs, sentences, table cells, titles, ...).
  3. You can have word indices which also track adjacent words.
  4. Words themselves are an exact list of characters but you can build out lists of "this word likely means this list of words given this context" for people where that makes sense. Identifying contexts takes work, of course - and feedback.
If you are looking for a single word, having a canned search result for that word means the search is trivial - precached. That said, you could also think of this from an oxford english dictionary approach - where words tend to have many definitions and examples, all of which are slight variations on each other. Here, you try to give the user a way of selecting the context they are currently interested in. (For example: are they researching spam, are they dealing with biochemistry, are they searching for cute cat pics, are they building a bridge, etc...)

But when you have multiple words, it gets more interesting. Here, you can start with the rarest word, and use that as the basis for your search result, constraining it by the more frequent words. If you are doing a phrase search (which should maybe be the default - falling back first to words in the same document element and then if that fails to words in the same document and possibly if that fails to words linking to the document... (but note that these need some description to the user of what they're getting and how to fall back to the slower approaches - and you can impose a small search delay on the fallbacks to hint at the underlying mechanical costs)) and if you have adjacent words stored, you might be able to conduct the search entirely without even visiting indices for the most common words in the phrase.

For example "office of the president" has a specific meaning which is different from other meanings involving the words "office" and "president", but "of" and "the" are typically "stop words", but if every entry in the "office" word index recorded its word position in the document and the words immediately preceding and following it, and if all word indices were structured the same way, you would just need to query the indices for "office" and "president" to complete this search.

Bandwidth? Seriously, how was that measured?

That said, there are and will be costs to this - having to do with normalization rules and so on. In addition to logical "word position" you also almost certainly need physical "byte offset" or maybe "character position" for some of the UI related tasks. And, these word indices never exclude the need for document indices. So you wind up having both and several versions of both), with a certain amount of aging...

So, yeah, hopefully Google is already long past the stage of doing things this way. 

If I were Google I'd be having several different competing ways of representing word searches and I'd be watching each of them for signs that they are returning substandard results. (And, yeah, that would have to be a mix of metrics (counting) and samples (random inspection), and I'd be wanting to contrast current presentation with those from years ago - both for signs of stability and signs of current areas of progress.)

[I'd also want to be hooking up with people whose areas of focus are outside the computer realm - farmers, plumbers, etc. People doing tangible work which directly benefits other people. Not for their impulse quips, but just to make sure I had my head on straight.]

Sunday, March 22


I've been thinking a lot about Libertarianism.

In one sense, it's advocating personal responsibility as a solution to society's needs. There's a lot of truth to that, but I'm not seeing it happen on the scale it needs to happen.

I have heard Libertarianism expressed as "Let the market solve it" - the idea being that markets are efficient at solving problems. The problem, though, is that - mathematically speaking - markets probably are not efficient:

I have heard Libertarianism expressed as deregulation. The problem with that, though, is that often the wrong things get deregulated.

Specifically: dollars are a construct of the government. Without the system of government regulations which demand their use for some things and place various restrictions on their use, dollars are just numbers. And if that is what you want, it's not clear why government should need to make any changes - we already have the ability to use numbers.

But we could also think of Libertarianism as a part of a reasonable system of checks and balances. When we think of questions like "who watches the watchers", Libertarianism apparently says: as few people as possible, let's just have police and military and get rid of courts, libraries, food quality regulations, nasa, college, school, etc. etc.


So, ok, there's some validity here. But what this demands is that (a) everyone work incredibly hard at making things right, and (b) that we put up with a lot of nastiness. Of course, to some degree we do that already.

I guess the problem is: taken to its logical extreme, libertarianism would be one of the most repressive forms of government you'd ever heard of.

We get to keep:

* Police (not much)
* Military (not much)
* Crime (um wait...)

So what's the distinction between that and what we've got now? The argument goes that our government is criminal. And, looking at incarceration rates, and various other problems such as civil forfeiture, we do indeed have big problems.

But those sorts of problems don't just go away. Those problems are people problems.

But if Libertarianism is the answer you don't get to the solution by opposing government action. Getting rid of libraries and regulations prohibiting the sale of rotten food is not going to make anything better. You get to the solution by helping people out as much as you can.

Put differently: "big government" is not just votes, but it's also dollars. If you get rid of government you get rid of dollars. But if that's going to work you have to be able to make things work - as much as possible - without dollars. And, to be frank, we need a lot of that kind of activity.

We have some sizable percent of our population apparently "idle" - not working, not employed. Or at least, not in the sense of exchanging their time for dollars. Libertarianism is valid to the degree that it can get work done without using dollars. And there's some significant examples of that, of course.

You could say that working for no pay as the ultimate expression of Capitalism - it's the limit condition of competitiveness. And setting up shelters for people, Feeding people, making sure they have productive uses for their time? That's all good.

But what this really means is that Libertarianism is advocating the idea that we ignore the economy - let the 1% have all the money, it's irrelevant. Riches are nothing to do with dollars, and people should have the right to solve problems without it.

Just don't expect it to be all pleasantness and beauty. It's going to be hard work.

Of course, this kind of thinking has relevance not only in the context of libertarianism but in the context of economics and of government and politics also. None of them are particularly valid or meaningful when taken in pure form. These lines of thought need context and connections before they can be thought of as valid.

It's also going to involve letting lots of people in the rest of the world kill each other. It's asking us to stand by and watch while people kill, maim and torture each other. It means not being afraid while nuclear weapons are developed and deployed. Or maybe it means the opposite. Maybe it means intervening using contractors and ... wait, isn't that what we're currently doing?

(And, I know - read this book... but what do I do about the mistakes in that book?)

Anyways, it's a head scratcher for me.

I agree with a lot of the principles advocated by Libertarianists. But at the same time, I'm not seeing the sorts of action taking place which those principles would suggest need to happen.

Yes, we need solutions to problems which government is not addressing, and which the dollar-based economy is not addressing. But you're not going to get those solutions using government activity nor using the economy, except in some sort of minimal sense. You're going to get those solutions by (a) getting people to cooperate and solve their problems, and (b) showing that you've done so and showing other people how to do it.

And a few people are doing that. But - personally - where I live I don't even know how to find people that need problems solved of the sort where I feel I can usefully contribute.

How about you?