wip,src: add utf8 consumer/validator by chrisdickinson · Pull Request #1319 · nodejs/node

chrisdickinson · 2015-04-01T20:46:22Z

WIP. Opening this now to double check that this is headed in the right direction.

This is based on utf-8-validate. The primary differences are the use of clz (vs. a lookup table) to compute the number of extra bytes and the introduction of the error/glyph callbacks.

The Utf8Consume function will iterate over valid glyphs, calling a provided OnGlyph callback and an OnError callback as necessary. Provided error strategies are "Halt" and "Skip" – which either halts the consumer at the first error, or skips past them as appropriate.

The strategies were added to accommodate the desire to build a utf8-to-utf16 translator as part of this work.

bnoordhuis · 2015-04-01T20:56:40Z

src/util.cc

Style: no C-style casts (here and elsewhere.)

Fixed. Sorry about that, missed while porting.

bnoordhuis · 2015-04-01T21:12:58Z

src/util.cc

0xC2? Shouldn't that be 0xC0?

Hm. I'll look into that – the original version has it as 0xC2.

piscisaureus · 2015-04-01T21:30:43Z

The msvc "equivalent" is _BitScanReverse. https://github.com/libuv/libuv/blob/3346082132dc2ff809dfd5d25d451c0b33905f53/src/win/tty.c#L1442

Edit: you already knew :) sorry

bnoordhuis · 2015-04-01T21:32:24Z

src/util.cc

I think what you want here can be implemented portable and reasonably efficient using the following:

inline uint32_t log2(uint8_t v) { const uint32_t r = (v > 15) << 2; v >>= r; const uint32_t s = (v > 3) << 1; v >>= s; v >>= 1; return r | s | v; } inline uint32_t clz(uint8_t v) { // clz(0) == 7. Add a zero check if that's an issue. return 7 - log2(v); }

Forgot to mention, the behavior of static_cast<int>(...) << 24 is implementation-defined on platforms where ints are 32 bits (i.e. all of them.) You're not allowed to shift values into the sign bit.

bnoordhuis · 2015-04-01T21:48:44Z

src/util.cc

Shouldn't this mask off the high bits? Also, when is extrabytes == 5 for valid UTF-8?

bnoordhuis · 2015-04-01T21:57:10Z

src/util.cc

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Compilers are pretty good in replacing idiomatic range comparisons with bit hacks. Have you checked whether it's really necessary to have these ... ?

Har, har.

(But it's true that both clang and gcc manage to pull it off.)

jbergstroem · 2015-04-02T00:26:52Z

Now that #1199 landed -- could we perhaps add unit tests for this?

chrisdickinson · 2015-04-02T03:04:54Z

Nixed clz support, since a quick benchmark showed that @bnoordhuis' implementation was faster.

bnoordhuis · 2015-04-02T22:41:30Z

src/util.cc

Style issue: the first argument goes on the same line as the function name and the other arguments should line up below it. The only time that's deviated from is when the 80 column limit is exceeded.

bnoordhuis · 2015-04-02T23:13:20Z

I second @jbergstroem's sentiment on unit tests. :-)

petkaantonov · 2015-04-03T18:36:30Z

Also always interesting to run on an utf-8 decoder is the utf8 decoder stress test

wip,src: add utf8 consumer/validator

58c62a2

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

nix C-style casts

e7918fe

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

chrisdickinson added 2 commits April 1, 2015 14:42

style fixes, switch from char* to uint8_t*

09cd52f

switch remaining functions from static to inline

5cb6d3d

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

src/util.cc Outdated

Copy link
Copy Markdown

Member

bnoordhuis Apr 1, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this mask off the high bits? Also, when is extrabytes == 5 for valid UTF-8?

mscdex added the c++ Issues and PRs that require attention from people who are familiar with C++. label Apr 1, 2015

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

chrisdickinson added 2 commits April 1, 2015 19:11

review fixes

7cd5465

nix clz – turns out @bnoordhuis' implementation is faster 🏁

e4bc82f

bnoordhuis reviewed Apr 2, 2015
View reviewed changes

Uh oh!

Conversation

chrisdickinson commented Apr 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piscisaureus commented Apr 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbergstroem commented Apr 2, 2015

Uh oh!

chrisdickinson commented Apr 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnoordhuis commented Apr 2, 2015

Uh oh!

petkaantonov commented Apr 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants