# simdjson talk

"Data Engineering at the Speed of Your Disk" - Daniel Lemire - 2020

• Modern disks are pretty fast
• PCIe 4 disks: 5 GB/s sequential read
• Modern datacenter VPC network real fucking fast
• 50 Gb/s (6.25 GiB/s) and better
• If we can't eat data at many GB/s, we'll literally be CPU-bound reading from disk or network!
• How fast can we allocate memory?
// almost instant (just virtual memory)
let mut buf = Vec::with_capacity(size);

// actually instantiate physical pages
for i in (0..buf.len()).step_by(page_size) {
buf[i] = 0;
}


3.5 GB/s (Linux, Skylake 3.4 GHz, 4 KiB pages)

• How fast can we, e.g., remove spaces from a string?
• Assume 1% space density (minified json)
for i in 0..buf.len() {
if buf[i] > 32 /* space as decimal = 32 */ {
buf[i + 1] = buf[i];
}
}


1.6 GB/s

• Just working byte-by-byte will already have limit of ~3.5 GB/s b/c our processor is 3.5 GHz.
• We're already slower than disk if we work byte-by-byte.
• so how are we going to go faster? SIMD : )

AVX-512 instructions can operate on an entire cache-line at once : 0

• Working with SIMD

• (1) use SIMD intrinsics. produces very disgusting code though.

• (2) pray / massage code so compiler auto-vectorizes : )

• despacer with SIMD

// filled with spaces
__m128i spaces = _mm_set1_epi8(' ');

for (i = 0; i + 15 < size; i += 16) {
__m128i x = _mm_loadu_si128(bytes + i);

// set a 1 byte anywhere we have an x byte <= 32
__m128i anywhite = _mm_cmpeq_epi8(spaces, _mm_max_epu8(spaces, x));

// basically a pdep, but deposit the non-space bytes at the front

// store in-place
_mm_storeu_si128(bytes + pos, x);
// increment our in-place output pointer only by the # non-spaces.
}

• base64 encode / decode lemire/fastbase64

• can encode w/ SIMD in only 3 instructions : 0 (mod loads/stores)
• decoding closer to 7 instructions
• once your data is not just in L1/L2 cache, this is just as fast as a memcpy
• utf-8 encode / decode

• byte-by-byte approach only 300 MB/s D :

if b1 < 0x80 {
return true; // ascii
}
if b1 < 0xe0 {
if b1 < 0xc2 || b2 > 0xbf {
return false;
}
} else if b1 < 0xf0 {
// 3 byte form
if b2 > 0xbf || (b1 == 0xe0 && b2 < 0xa0) || (b1 == 0xed && b2 >= 0xa0) {
// ..
} else {
// ..
}
} else {
// 4 byte form
// ..
}

• using SIMD: 8 GB/s

• using 32-byte chunks, ~20 instructions. branch-less

• https://github.com/lemire/fastvalidate-utf-8

• JSON for Modern C++ library (nlohmann-json)

• 0.1 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

• rapidjson

• 0.3 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

• both of these libs nowhere near disk/network speed

• what is our "roofline" benchmark?

let mut all_line_lens = 0;
all_line_lens += line.len();
}

• 1.4 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

• but you can actually go faster!

• simdjson

• 2.5 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)

• some of the tricks used in simdjson

• Find the span of the string

• Given a mask containing quotes, return a mask set to 1 in the string span regions.
• __1________1________1___1 quotes
• ___11111111__________111_ string regions
mask = quote xor (quote << 1);
// ..

• string span can actually be done in one instruction : 0

• Number parsing is expensive

• strtod
• 90 MB/s
• 38 cycles / byte
• 10 branch misses / float
• yikes!
• SIMD approach

fn contains_8_digits(buf: &[u8; 8]) -> bool {
let val = u64::from_ne_bytes(buf);
(   (val & 0xf0f0_f0f0_f0f0_f0f0) |
(((val + 0x0606_0606_0606_0606) & 0xf0f0_f0f0_f0f0_f0f0) >> 4)
) == 0x3333_3333_3333_3333
}

fn parse_8_digits_unrolled(buf: &[u8; 8]) -> u32 {
let mut val = u64::from_ne_bytes(buf);
val = ((val & 0x0f0f_0f0f_0f0f_0f0f) * 2561) >> 8;
val = ((val & 0x00ff_00ff_00ff_00ff) * 6553601) >> 16;
((val & 0x0000_ffff_0000_ffff) * 42949672960001) >> 32
}


https://github.com/lemire/fast-float-rust