Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building ranges #16

Closed
jtremback opened this issue Mar 5, 2015 · 18 comments
Closed

Building ranges #16

jtremback opened this issue Mar 5, 2015 · 18 comments

Comments

@jtremback
Copy link

Hi, I'm replacing making indexes by concatenating values with ÿ and using an array encoded with bytewise instead. In my current code, when I retrieve a range, I make it like this:

  var range = {
    // ÿiÿ identifies an index doc
    // esc(query.k.join(',')) makes an identifier for the index
    // gte/lte.join('ÿ') joins the ranges with the delimiter
    gte: 'ÿiÿ' + query.k.join(',') + 'ÿ' + gte.join('ÿ') + 'ÿ',
    lte: 'ÿiÿ' + query.k.join(',') + 'ÿ' + lte.join('ÿ') + 'ÿÿ'
  }

At the end of the lte, you can see 'ÿÿ', which ensures that the range is inclusive. How should I handle this with bytewise? Concat an extra '\xf0' on the end of the lte array?

@deanlandolt
Copy link
Owner

I'd reserved '\xff' ('ÿ') specifically to act as a high key sentinel for this use case, and I've been planning to add an API for encoding queries to make this easy. In the meantime you should be able to get the desired behavior by appending undefined to your array (which, as you seem to have already noticed, will serialized to '\xf0' -- the highest type tag accessible in the standard encoding scheme (for now, the only encoding scheme, until the above-mentioned query encoding API is added).

Just to be clear, you'd concat undefined to the array, not '\xf0', which would be encoded just like any other string, prefixed with type tag '\x70'. This would come before complex objects like arrays and objects in the sort order, so it's probably not what you're looking for. The same goes for Buffer([ 255 ]), which comes even before strings in the sort order.

One caveat to this undefined last-element approach: it falls down if array values in your index are allowed to contain undefined elements (i.e. sparse arrays). In this case, you could always manually munge the final undefined byte that was added in the encoded buffer value. Or you could nag me for that query API -- having a practical use case would be good motivation!

@dominictarr
Copy link
Collaborator

@jtremback don't use \xff, that was a mistake, and if you have unicode characters, it may sort lower than them, creating unexpected results. Instead use \uffff, or even better, encode your keys as arrays with bytewise and use null/undefined to sort them as @deanlandolt suggests.

@jtremback
Copy link
Author

Thanks! I had been attempting to concat null on the end. Switched it to undefined and my tests pass.

  var range = {
    gte: bytewise.encode(['i', query.k.join(',')].concat(gte)),
    lte: bytewise.encode(['i', query.k.join(',')].concat(lte).concat([undefined]))
  }

@jtremback
Copy link
Author

Ah, one last issue was solved by using stable-stringify on everything except for null and undefined.

@dominictarr
Copy link
Collaborator

@jtremback instead of doing query.k.join(',') you could just leave the query as a nested array - that way you completely avoid any problems of that being a weird order - and , sorts after most symbols !@#$^&() etc

@jtremback
Copy link
Author

I'm actually not trying to sort on query.k, it is more of a namespacing thing. You can think of it as a sublevel for a particular index.

// Generate a range that retreives the documents requested by the query
function makeRange (query, level_opts) {
  // Avoid having to write queries with redundant array notation
  if (!Array.isArray(query.k)) { query.k = [ query.k ] }
  if (!Array.isArray(query.v)) { query.v = [ query.v ] }

  // Gathers values in query value field, generating gte - lte
  var acc = query.v.reduce(function (acc, item) {
    // Avoid having to write queries with redundant array notation
    if (!Array.isArray(item)) { item = [ item ] }
    // Push bottom of range (first array element) into gte
    acc.gte.push(esc(item[0]))
    // If it is not a range, use same value for lte, if it is use top of range
    acc.lte.push(esc(item.length > 1 ? item[1] : item[0]))

    return acc
  }, { gte: [], lte: [] })

  // Eliminate null values
  var lte = compact(acc.lte)
  var gte = compact(acc.gte)

  var range = {
    gte: bytewise.encode(['i', query.k.join(',')].concat(gte).concat([null])),
    lte: bytewise.encode(['i', query.k.join(',')].concat(lte).concat([undefined]))
  }

  if (query.reverse) { range.reverse = true }

  range = merge(level_opts || {}, range)

  return range
}

@loveencounterflow
Copy link

@dominictarr using \uffff is no better than using \xff\uffff is still miles away from the topmost Unicode code point, which is at U+10ffff.

@dominictarr
Copy link
Collaborator

@loveencounterflow okay but how do you write that in JS?

> new Buffer('\u10ffff', 'utf8')
<Buffer e1 83 bf 66 66>
> new Buffer('\uffff', 'utf8')
<Buffer ef bf bf>
> new Buffer('\u1ffff', 'utf8')

\uffff is the highest I am able to create <Buffer ef bf bf>

@deanlandolt
Copy link
Owner

AFAIK \uffff is an invalid unicode code point, and you can't construct a valid code point that sorts lexicographically higher, so it's a perfectly reasonable high-key sentinel.

But the fact that you can still construct a legal js string that sorts higher (regardless of whether it's properly encoded) means you could still miss some string keys, so it'd be a lot cooler if the high (and low) key sentinels existed separate from the data types you're encoding -- and I've been working on in my recent refactoring work. You'll be able to reference boundary.min and boundary.max values on the different types (called sorts) that can be encoded, i.e.: https://github.com/deanlandolt/bytewise-core/blob/master/base.js#L108-L110

Still a WIP, but should be tightened up soon -- I need this to work smoothly for the query generation functionality in bytewise-uri. Oh, and since this wouldn't get utf8-encoded we could probably just use \xff appended to any string as a high-key sentinel, as it would sort higher than any encoded value (even \uffff).

@loveencounterflow
Copy link

@dominictarr "how do you write that in JS?"—your example is a bit on the wrong side since what you really intend to do in the first example is new Buffer( String.fromCodePoint( 0x10ffff ), 'utf-8' ), but what your code actually does is new Buffer( String.fromCodePoint( 0x10ff ) + 'ff', 'utf-8'). Gotta be mindful here with those code points, code units, hexadecimal numbers, unicode escapes, UTF-8 byte sequences, whatnot. \uffff is the highest reachable codepoint for those \uXXXX escapes; in older environments, it's also the highest reachable codepoint (1) without resorting to manually constructing surrogate pairs that (2) still encodes to valid UTF-8—NodeJS before some 0.8 or so version used to output a so-called CESU-8 sequence instead (that represents all single codepoints above 0xffff as two consecutive code units (i.e. surrogate pairs)).

'\uffff' is not good enough for me as an upper bound sentinel since i'm operating a lot on characters above '\uffff'. Since UTF-8 does preserve code point ordering, i'd end up with sequences like '中' ... '\uffff' ... '𠂝', which is wrong, I need '中' ... '𠂝' ... HI.

@dominictarr
Copy link
Collaborator

@loveencounterflow hmm... so my node version doesn't have String.fromCodePoint() it has from char code...
but that doesn't work... actually:

String.fromCharCode(0x10ffff) == '\uffff'
//=> true (!!!)

@deanlandolt
Copy link
Owner

@dominictarr iojs FTW

@deanlandolt deanlandolt reopened this Apr 21, 2015
@dominictarr
Copy link
Collaborator

so is there any way to do this in joyent/node? or browsers for that matter...

looks like this needs to exist:
https://www.npmjs.com/package/from-code-point

@dominictarr
Copy link
Collaborator

oh no, looks like this one does it: https://www.npmjs.com/package/codepoint

@deanlandolt
Copy link
Owner

Nice find!

I'll get the next and previous methods in some time tomorrow.

@loveencounterflow
Copy link

fromCodePoint and codePointAt is a pair of relatively new methods introduced in ES6; they're accessed as String.fromCodePoint and ''.codePointAt, respectively; they are available in NodeJS 0.12 using the --harmony flag and in iojs even without flags. The MDN articles linked above have shims for those methods, so it should be easy to make them available anywhere. I'd say that unless there's a specific reason to use the older String.fromCharCode and ''.charCodeAt, these should always be avoided in favor for their more modern counterparts.

The complication that we're seeing here arises from the fact that when JavaScript was conceived in 1995, Brendan Eich was smart enough to embrace Unicode. Ironically, however, Unicode was at that point mere months away from going from a 16bit-codespace to a 32bit-codespace, and because of they way it was standardized, it was (seemingly permanently) doomed to deal with codepoints above 0xffff using the surrogate mechanism (which was initially intended as a makeshift mechanism to allow a smooth transition to The Real Thing. Before you swear: Python for one is caught in the same time loop except it's worse b/c there are so-called 'narrow' and 'wide' builds).

As an aggravating factor, some methods (and, hence, people's concepts) got confused about what a 'character' should be. Long story short: String.fromCharCode and ''.charCodeAt should really be named String.fromCodeUnit and ''.codeUnitAt, and, correspondingly, what '...'.length gives you is a count of code units in that string, not the count of code points (e.g. '中' (U+4e2d) is 1 character expressed as 1 code unit, hence '中' === 1; but '𠂝' (U+2009d) is 1 character expressed as 2 code units, hence '𠂝'.length === 2).

On the one hand, things have gotten much simpler ever since NodeJS switched from CESU-8 to proper UTF-8, but on the other hand, other platforms may not be as smart; I sadly have no details on that because i do very little JS-in-the-browser these days. That said, CESU-8 only differs in how it deals with code points above 0xffff (it encodes the two surrogate code points separately, resulting in 6 bytes where UTF-8 has 4; the good news is that even in CESU-8, code point ordering is still preserved).

@dominictarr "String.fromCharCode(0x10ffff) == '\uffff'"—I believe that is caused by fromCharaCode ony looking at the lower 16 bits of the number.

@deanlandolt "AFAIK \uffff is an invalid unicode code point"—precise terminology is needed. There are (1) numbers that are outside of what Unicode considers candidates for code point assignment; those candidates are restricted to integer numbers 0x00 <= cid <= 0x10ffff, so all numbers bigger than that, all negative and all non-integer numbers are not even legal in requests e.g. to render a given CID as UTF-8. (2) there are numbers that Unicode has permanently not assigned, (3) numbers that Unicode may assign in the future, and (4) numbers that Unicode has already assigned. \uffff belongs to category 2 (more precisely, to a special subgroup in that category).

@deanlandolt
Copy link
Owner

This is almost done -- api and rationale explained in this ticket: #17 (comment)

@jtremback let me know if the API sketched in that ticket addresses your use case.

@jtremback
Copy link
Author

Hey- I'll be able to check this thoroughly tonight. FFWIW I made it work in my application with the old bytewise, but these changes look exciting!

EDIT: Thanks! I haven't run into any issues with null and undefined so far, but I'll switch my code over ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants