simdjson talk
2022-03-26 ยท 4 min read
"Data Engineering at the Speed of Your Disk" - Daniel Lemire - 2020
- Modern disks are pretty fast
- PCIe 4 disks: 5 GB/s sequential read
- Modern datacenter VPC network real fucking fast
- 50 Gb/s (6.25 GiB/s) and better
- If we can't eat data at many GB/s, we'll literally be CPU-bound reading from disk or network!
- How fast can we allocate memory?
// almost instant (just virtual memory)
let mut buf = Vec::with_capacity(size);
// actually instantiate physical pages
for i in (0..buf.len()).step_by(page_size) {
buf[i] = 0;
}
3.5 GB/s (Linux, Skylake 3.4 GHz, 4 KiB pages)
- How fast can we, e.g., remove spaces from a string?
- Assume 1% space density (minified json)
for i in 0..buf.len() {
if buf[i] > 32 /* space as decimal = 32 */ {
buf[i + 1] = buf[i];
}
}
1.6 GB/s
- Just working byte-by-byte will already have limit of ~3.5 GB/s b/c our processor is 3.5 GHz.
- We're already slower than disk if we work byte-by-byte.
- so how are we going to go faster? SIMD : )
AVX-512 instructions can operate on an entire cache-line at once : 0
Working with SIMD
(1) use SIMD intrinsics. produces very disgusting code though.
(2) pray / massage code so compiler auto-vectorizes : )
despacer with SIMD
// filled with spaces
__m128i spaces = _mm_set1_epi8(' ');
for (i = 0; i + 15 < size; i += 16) {
// load next chunk
__m128i x = _mm_loadu_si128(bytes + i);
// set a 1 byte anywhere we have an x byte <= 32
__m128i anywhite = _mm_cmpeq_epi8(spaces, _mm_max_epu8(spaces, x));
// condense anywhite mask into a packed u64 mask
uint64_t mask16 = _mm_movemask_epi8(anywhite);
// basically a pdep, but deposit the non-space bytes at the front
x = _mm_shuffle_epi8(x, despace_mask16[mask16 & 0x7fff]);
// store in-place
_mm_storeu_si128(bytes + pos, x);
// increment our in-place output pointer only by the # non-spaces.
pos += 16 - mask16.count_ones();
}
base64 encode / decode lemire/fastbase64
- can encode w/ SIMD in only 3 instructions : 0 (mod loads/stores)
- decoding closer to 7 instructions
- once your data is not just in L1/L2 cache, this is just as fast as a memcpy
utf-8 encode / decode
byte-by-byte approach only 300 MB/s D :
if b1 < 0x80 {
return true; // ascii
}
if b1 < 0xe0 {
if b1 < 0xc2 || b2 > 0xbf {
return false;
}
} else if b1 < 0xf0 {
// 3 byte form
if b2 > 0xbf || (b1 == 0xe0 && b2 < 0xa0) || (b1 == 0xed && b2 >= 0xa0) {
// ..
} else {
// ..
}
} else {
// 4 byte form
// ..
}
using SIMD: 8 GB/s
using 32-byte chunks, ~20 instructions. branch-less
JSON for Modern C++ library (nlohmann-json)
0.1 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
rapidjson
0.3 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
both of these libs nowhere near disk/network speed
what is our "roofline" benchmark?
let mut all_line_lens = 0;
for line in reader.lines() {
all_line_lens += line.len();
}
1.4 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
but you can actually go faster!
simdjson
2.5 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
some of the tricks used in simdjson
Find the span of the string
- Given a mask containing quotes, return a mask set to 1 in the string span regions.
__1________1________1___1
quotes___11111111__________111_
string regions
mask = quote xor (quote << 1);
mask = mask xor (mask << 2);
mask = mask xor (mask << 4);
mask = mask xor (mask << 8);
// ..
string span can actually be done in one instruction : 0
Number parsing is expensive
strtod
- 90 MB/s
- 38 cycles / byte
- 10 branch misses / float
- yikes!
SIMD approach
fn contains_8_digits(buf: &[u8; 8]) -> bool {
let val = u64::from_ne_bytes(buf);
( (val & 0xf0f0_f0f0_f0f0_f0f0) |
(((val + 0x0606_0606_0606_0606) & 0xf0f0_f0f0_f0f0_f0f0) >> 4)
) == 0x3333_3333_3333_3333
}
fn parse_8_digits_unrolled(buf: &[u8; 8]) -> u32 {
let mut val = u64::from_ne_bytes(buf);
val = ((val & 0x0f0f_0f0f_0f0f_0f0f) * 2561) >> 8;
val = ((val & 0x00ff_00ff_00ff_00ff) * 6553601) >> 16;
((val & 0x0000_ffff_0000_ffff) * 42949672960001) >> 32
}