Building ranges #16

jtremback · 2015-03-05T01:59:43Z

Hi, I'm replacing making indexes by concatenating values with ÿ and using an array encoded with bytewise instead. In my current code, when I retrieve a range, I make it like this:

  var range = {
    // ÿiÿ identifies an index doc
    // esc(query.k.join(',')) makes an identifier for the index
    // gte/lte.join('ÿ') joins the ranges with the delimiter
    gte: 'ÿiÿ' + query.k.join(',') + 'ÿ' + gte.join('ÿ') + 'ÿ',
    lte: 'ÿiÿ' + query.k.join(',') + 'ÿ' + lte.join('ÿ') + 'ÿÿ'
  }

At the end of the lte, you can see 'ÿÿ', which ensures that the range is inclusive. How should I handle this with bytewise? Concat an extra '\xf0' on the end of the lte array?

The text was updated successfully, but these errors were encountered:

deanlandolt · 2015-03-05T02:48:50Z

I'd reserved '\xff' ('ÿ') specifically to act as a high key sentinel for this use case, and I've been planning to add an API for encoding queries to make this easy. In the meantime you should be able to get the desired behavior by appending undefined to your array (which, as you seem to have already noticed, will serialized to '\xf0' -- the highest type tag accessible in the standard encoding scheme (for now, the only encoding scheme, until the above-mentioned query encoding API is added).

Just to be clear, you'd concat undefined to the array, not '\xf0', which would be encoded just like any other string, prefixed with type tag '\x70'. This would come before complex objects like arrays and objects in the sort order, so it's probably not what you're looking for. The same goes for Buffer([ 255 ]), which comes even before strings in the sort order.

One caveat to this undefined last-element approach: it falls down if array values in your index are allowed to contain undefined elements (i.e. sparse arrays). In this case, you could always manually munge the final undefined byte that was added in the encoded buffer value. Or you could nag me for that query API -- having a practical use case would be good motivation!

dominictarr · 2015-03-05T06:32:35Z

@jtremback don't use \xff, that was a mistake, and if you have unicode characters, it may sort lower than them, creating unexpected results. Instead use \uffff, or even better, encode your keys as arrays with bytewise and use null/undefined to sort them as @deanlandolt suggests.

jtremback · 2015-03-05T07:40:27Z

Thanks! I had been attempting to concat null on the end. Switched it to undefined and my tests pass.

  var range = {
    gte: bytewise.encode(['i', query.k.join(',')].concat(gte)),
    lte: bytewise.encode(['i', query.k.join(',')].concat(lte).concat([undefined]))
  }

jtremback · 2015-03-05T08:18:26Z

Ah, one last issue was solved by using stable-stringify on everything except for null and undefined.

dominictarr · 2015-03-05T09:05:27Z

@jtremback instead of doing query.k.join(',') you could just leave the query as a nested array - that way you completely avoid any problems of that being a weird order - and , sorts after most symbols !@#$^&() etc

jtremback · 2015-03-05T18:40:44Z

I'm actually not trying to sort on query.k, it is more of a namespacing thing. You can think of it as a sublevel for a particular index.

// Generate a range that retreives the documents requested by the query
function makeRange (query, level_opts) {
  // Avoid having to write queries with redundant array notation
  if (!Array.isArray(query.k)) { query.k = [ query.k ] }
  if (!Array.isArray(query.v)) { query.v = [ query.v ] }

  // Gathers values in query value field, generating gte - lte
  var acc = query.v.reduce(function (acc, item) {
    // Avoid having to write queries with redundant array notation
    if (!Array.isArray(item)) { item = [ item ] }
    // Push bottom of range (first array element) into gte
    acc.gte.push(esc(item[0]))
    // If it is not a range, use same value for lte, if it is use top of range
    acc.lte.push(esc(item.length > 1 ? item[1] : item[0]))

    return acc
  }, { gte: [], lte: [] })

  // Eliminate null values
  var lte = compact(acc.lte)
  var gte = compact(acc.gte)

  var range = {
    gte: bytewise.encode(['i', query.k.join(',')].concat(gte).concat([null])),
    lte: bytewise.encode(['i', query.k.join(',')].concat(lte).concat([undefined]))
  }

  if (query.reverse) { range.reverse = true }

  range = merge(level_opts || {}, range)

  return range
}

loveencounterflow · 2015-04-19T15:26:36Z

@dominictarr using \uffff is no better than using \xff—\uffff is still miles away from the topmost Unicode code point, which is at U+10ffff.

dominictarr · 2015-04-20T05:02:11Z

@loveencounterflow okay but how do you write that in JS?

> new Buffer('\u10ffff', 'utf8')
<Buffer e1 83 bf 66 66>
> new Buffer('\uffff', 'utf8')
<Buffer ef bf bf>
> new Buffer('\u1ffff', 'utf8')

\uffff is the highest I am able to create <Buffer ef bf bf>

deanlandolt · 2015-04-20T13:32:29Z

AFAIK \uffff is an invalid unicode code point, and you can't construct a valid code point that sorts lexicographically higher, so it's a perfectly reasonable high-key sentinel.

But the fact that you can still construct a legal js string that sorts higher (regardless of whether it's properly encoded) means you could still miss some string keys, so it'd be a lot cooler if the high (and low) key sentinels existed separate from the data types you're encoding -- and I've been working on in my recent refactoring work. You'll be able to reference boundary.min and boundary.max values on the different types (called sorts) that can be encoded, i.e.: https://github.com/deanlandolt/bytewise-core/blob/master/base.js#L108-L110

Still a WIP, but should be tightened up soon -- I need this to work smoothly for the query generation functionality in bytewise-uri. Oh, and since this wouldn't get utf8-encoded we could probably just use \xff appended to any string as a high-key sentinel, as it would sort higher than any encoded value (even \uffff).

loveencounterflow · 2015-04-21T14:15:52Z

@dominictarr "how do you write that in JS?"—your example is a bit on the wrong side since what you really intend to do in the first example is new Buffer( String.fromCodePoint( 0x10ffff ), 'utf-8' ), but what your code actually does is new Buffer( String.fromCodePoint( 0x10ff ) + 'ff', 'utf-8'). Gotta be mindful here with those code points, code units, hexadecimal numbers, unicode escapes, UTF-8 byte sequences, whatnot. \uffff is the highest reachable codepoint for those \uXXXX escapes; in older environments, it's also the highest reachable codepoint (1) without resorting to manually constructing surrogate pairs that (2) still encodes to valid UTF-8—NodeJS before some 0.8 or so version used to output a so-called CESU-8 sequence instead (that represents all single codepoints above 0xffff as two consecutive code units (i.e. surrogate pairs)).

'\uffff' is not good enough for me as an upper bound sentinel since i'm operating a lot on characters above '\uffff'. Since UTF-8 does preserve code point ordering, i'd end up with sequences like '中' ... '\uffff' ... '𠂝', which is wrong, I need '中' ... '𠂝' ... HI.

dominictarr · 2015-04-21T23:30:34Z

@loveencounterflow hmm... so my node version doesn't have String.fromCodePoint() it has from char code...
but that doesn't work... actually:

String.fromCharCode(0x10ffff) == '\uffff'
//=> true (!!!)

deanlandolt · 2015-04-21T23:31:18Z

@dominictarr iojs FTW

dominictarr · 2015-04-21T23:59:02Z

so is there any way to do this in joyent/node? or browsers for that matter...

looks like this needs to exist:
https://www.npmjs.com/package/from-code-point

dominictarr · 2015-04-22T00:03:01Z

oh no, looks like this one does it: https://www.npmjs.com/package/codepoint

deanlandolt · 2015-04-22T02:12:52Z

Nice find!

I'll get the next and previous methods in some time tomorrow.

loveencounterflow · 2015-04-22T13:47:50Z

fromCodePoint and codePointAt is a pair of relatively new methods introduced in ES6; they're accessed as String.fromCodePoint and ''.codePointAt, respectively; they are available in NodeJS 0.12 using the --harmony flag and in iojs even without flags. The MDN articles linked above have shims for those methods, so it should be easy to make them available anywhere. I'd say that unless there's a specific reason to use the older String.fromCharCode and ''.charCodeAt, these should always be avoided in favor for their more modern counterparts.

The complication that we're seeing here arises from the fact that when JavaScript was conceived in 1995, Brendan Eich was smart enough to embrace Unicode. Ironically, however, Unicode was at that point mere months away from going from a 16bit-codespace to a 32bit-codespace, and because of they way it was standardized, it was (seemingly permanently) doomed to deal with codepoints above 0xffff using the surrogate mechanism (which was initially intended as a makeshift mechanism to allow a smooth transition to The Real Thing. Before you swear: Python for one is caught in the same time loop except it's worse b/c there are so-called 'narrow' and 'wide' builds).

As an aggravating factor, some methods (and, hence, people's concepts) got confused about what a 'character' should be. Long story short: String.fromCharCode and ''.charCodeAt should really be named String.fromCodeUnit and ''.codeUnitAt, and, correspondingly, what '...'.length gives you is a count of code units in that string, not the count of code points (e.g. '中' (U+4e2d) is 1 character expressed as 1 code unit, hence '中' === 1; but '𠂝' (U+2009d) is 1 character expressed as 2 code units, hence '𠂝'.length === 2).

On the one hand, things have gotten much simpler ever since NodeJS switched from CESU-8 to proper UTF-8, but on the other hand, other platforms may not be as smart; I sadly have no details on that because i do very little JS-in-the-browser these days. That said, CESU-8 only differs in how it deals with code points above 0xffff (it encodes the two surrogate code points separately, resulting in 6 bytes where UTF-8 has 4; the good news is that even in CESU-8, code point ordering is still preserved).

@dominictarr "String.fromCharCode(0x10ffff) == '\uffff'"—I believe that is caused by fromCharaCode ony looking at the lower 16 bits of the number.

@deanlandolt "AFAIK \uffff is an invalid unicode code point"—precise terminology is needed. There are (1) numbers that are outside of what Unicode considers candidates for code point assignment; those candidates are restricted to integer numbers 0x00 <= cid <= 0x10ffff, so all numbers bigger than that, all negative and all non-integer numbers are not even legal in requests e.g. to render a given CID as UTF-8. (2) there are numbers that Unicode has permanently not assigned, (3) numbers that Unicode may assign in the future, and (4) numbers that Unicode has already assigned. \uffff belongs to category 2 (more precisely, to a special subgroup in that category).

deanlandolt · 2015-05-12T16:42:41Z

This is almost done -- api and rationale explained in this ticket: #17 (comment)

@jtremback let me know if the API sketched in that ticket addresses your use case.

jtremback · 2015-05-12T18:13:49Z

Hey- I'll be able to check this thoroughly tonight. FFWIW I made it work in my application with the old bytewise, but these changes look exciting!

EDIT: Thanks! I haven't run into any issues with null and undefined so far, but I'll switch my code over ASAP.

jtremback closed this as completed Mar 5, 2015

deanlandolt mentioned this issue Apr 20, 2015

How to formulate prefix-based queries / LevelDB lte-boundaries #17

Closed

deanlandolt reopened this Apr 21, 2015

deanlandolt closed this as completed May 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building ranges #16

Building ranges #16

jtremback commented Mar 5, 2015

deanlandolt commented Mar 5, 2015

dominictarr commented Mar 5, 2015

jtremback commented Mar 5, 2015

jtremback commented Mar 5, 2015

dominictarr commented Mar 5, 2015

jtremback commented Mar 5, 2015

loveencounterflow commented Apr 19, 2015

dominictarr commented Apr 20, 2015

deanlandolt commented Apr 20, 2015

loveencounterflow commented Apr 21, 2015

dominictarr commented Apr 21, 2015

deanlandolt commented Apr 21, 2015

dominictarr commented Apr 21, 2015

dominictarr commented Apr 22, 2015

deanlandolt commented Apr 22, 2015

loveencounterflow commented Apr 22, 2015

deanlandolt commented May 12, 2015

jtremback commented May 12, 2015

Building ranges #16

Building ranges #16

Comments

jtremback commented Mar 5, 2015

deanlandolt commented Mar 5, 2015

dominictarr commented Mar 5, 2015

jtremback commented Mar 5, 2015

jtremback commented Mar 5, 2015

dominictarr commented Mar 5, 2015

jtremback commented Mar 5, 2015

loveencounterflow commented Apr 19, 2015

dominictarr commented Apr 20, 2015

deanlandolt commented Apr 20, 2015

loveencounterflow commented Apr 21, 2015

dominictarr commented Apr 21, 2015

deanlandolt commented Apr 21, 2015

dominictarr commented Apr 21, 2015

dominictarr commented Apr 22, 2015

deanlandolt commented Apr 22, 2015

loveencounterflow commented Apr 22, 2015

deanlandolt commented May 12, 2015

jtremback commented May 12, 2015