Benchmarks

Overview

libvroom includes a comprehensive benchmark suite for measuring parsing performance across different dimensions: file sizes, column counts, thread scaling, and data types.

Comparative Performance: libvroom vs zsv

The following benchmarks compare libvroom against zsv, a high-performance CSV parser. Tests were run on Apple Silicon (M3 Max, 14 cores) using hyperfine for accurate measurements.

Single-Threaded Performance

libvroom’s SIMD-optimized parsing provides significant speedups over zsv in single-threaded mode:

File Size libvroom Time libvroom Throughput zsv Time zsv Throughput Speedup
10 MB 3.2 ms 3.1 GB/s 12.7 ms 787 MB/s 4.0x
50 MB 12.6 ms 4.0 GB/s 54.3 ms 921 MB/s 4.3x
100 MB 22.9 ms 4.4 GB/s 105.2 ms 951 MB/s 4.6x
200 MB 42.4 ms 4.7 GB/s 206.1 ms 970 MB/s 4.9x

Key observations:

  • libvroom achieves 3-5 GB/s single-threaded throughput on Apple Silicon
  • Performance advantage increases with file size (4.0x at 10MB to 4.9x at 200MB)
  • Larger files benefit more from SIMD vectorization as memory bandwidth becomes the bottleneck

Multi-Threaded Performance

Both parsers support multi-threaded parsing. Here’s how they compare at 100MB:

Parser Configuration Time Throughput
libvroom 1 thread 22.9 ms 4.4 GB/s
libvroom 4 threads 18.2 ms 5.5 GB/s
libvroom 8 threads 18.0 ms 5.6 GB/s
zsv 1 thread 105.2 ms 951 MB/s
zsv 4 threads 31.3 ms 3.2 GB/s
zsv –parallel (auto) 16.1 ms 6.2 GB/s

For multi-threaded workloads:

  • libvroom’s single-threaded performance often matches or exceeds multi-threaded zsv
  • zsv’s --parallel mode with auto-detection is highly optimized for its architecture
  • At 4 threads, libvroom is 1.7x faster than zsv (18.2 ms vs 31.3 ms)

Performance Summary

Based on benchmarks run on Apple Silicon (M3 Max, 14 cores):

Benchmark Throughput Notes
Simple CSV (1 thread) ~734 MB/s Small file baseline
Many Rows (1 thread) ~1.37 GB/s Row-heavy workload
Wide Columns (1 thread) ~1.69 GB/s Column-heavy workload
100KB file ~1.84 GB/s L2/L3 cache fits
1MB file ~19.2 GB/s Multi-threaded

File Size Scaling

Performance varies with file size due to cache hierarchy effects:

File Size Throughput Cache Level
1 KB ~19 MB/s L1 cache
10 KB ~192 MB/s L1/L2 cache
100 KB ~1.84 GB/s L2/L3 cache
1 MB ~19.2 GB/s L3/memory + threading

Larger files benefit significantly from multi-threaded parsing as data exceeds cache sizes.

Thread Scaling at Large File Sizes

For large files (100MB+), multi-threading provides meaningful speedups:

100MB File

Threads libvroom Time Throughput Scaling
1 22.9 ms 4.4 GB/s 1.00x
2 19.9 ms 5.0 GB/s 1.15x
4 18.2 ms 5.5 GB/s 1.26x
8 18.0 ms 5.6 GB/s 1.27x

200MB File

Threads libvroom Time Throughput Scaling
1 42.4 ms 4.7 GB/s 1.00x
4 33.5 ms 6.0 GB/s 1.27x
8 32.3 ms 6.2 GB/s 1.31x
14 32.4 ms 6.2 GB/s 1.31x

Thread scaling observations:

  • Diminishing returns after 4 threads on this workload
  • Single-threaded performance is already very high due to SIMD optimization
  • Memory bandwidth becomes the limiting factor for large files

Running Benchmarks

Build and Run

# Build with Release optimizations
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run all benchmarks
./build/libvroom_benchmark

# Run specific benchmark categories
./build/libvroom_benchmark --benchmark_filter="BM_Parse.*"
./build/libvroom_benchmark --benchmark_filter="BM_FileSizes.*"

Output Formats

# Console output (default)
./build/libvroom_benchmark

# JSON output for analysis
./build/libvroom_benchmark --benchmark_format=json --benchmark_out=results.json

# CSV output
./build/libvroom_benchmark --benchmark_format=csv --benchmark_out=results.csv

Automated Benchmarking

Use the benchmark script for comprehensive testing:

./benchmark/run_benchmarks.sh ./build/libvroom_benchmark

This runs all benchmark categories and generates reports in benchmark_results/.

Benchmark Categories

Basic Benchmarks

Tests fundamental parsing operations:

  • File parsing with different thread counts
  • Quoted field handling
  • Different separator types (comma, tab, semicolon, pipe)

Dimension Benchmarks

Tests scaling across dimensions:

  • File Sizes: 1KB to 100MB
  • Column Counts: 2 to 500 columns
  • Row Counts: 100 to 1M rows
  • Data Types: integers, floats, strings, mixed

Real-World Benchmarks

Tests with realistic data patterns:

  • NYC Taxi data (19 columns)
  • Financial time-series data
  • Server log files
  • Wide tables (100+ columns)

Real-World Dataset Results (Apple Silicon)

For realistic file I/O performance, see the comparative benchmarks above which show 3-5 GB/s single-threaded throughput on actual files.

The internal benchmark suite measures in-memory parsing performance:

Dataset Rows Size Parsing Time
Financial data 1K 68 KB 0.08 ms
Financial data 10K 684 KB 0.11 ms
Financial data 100K 6.8 MB 0.33 ms
Log data 1K 89 KB 0.06 ms
Log data 10K 900 KB 0.10 ms
Log data 100K 9.1 MB 0.38 ms

These in-memory benchmarks demonstrate parsing efficiency but don’t include file I/O overhead. For end-to-end performance with file I/O, use the CLI comparison benchmarks in benchmark/cli_comparison.sh.

SIMD Benchmarks

Tests SIMD effectiveness:

  • SIMD vs scalar comparison
  • Quote detection with varying densities
  • Memory access patterns

Performance Targets

Based on the production roadmap:

Metric Target Status
Peak throughput (AVX2) >5 GB/s In progress
Peak throughput (AVX-512) >8 GB/s Planned
Thread scaling efficiency >80% at 16 threads Testing
Memory overhead <10% over file size Achieved

Comparison with Other Parsers

The benchmark suite includes comparison benchmarks against:

  • Naive (non-SIMD) parser implementation
  • Raw memory bandwidth baseline

To run comparison benchmarks:

./build/libvroom_benchmark --benchmark_filter="BM_Compare.*"

CI Integration

Benchmarks can be integrated into CI pipelines:

# Run with regression detection
./benchmark/run_benchmarks.sh ./build/libvroom_benchmark results/ baseline.json 10.0

# Exit code indicates regression if threshold exceeded
if [ $? -ne 0 ]; then
    echo "Performance regression detected!"
    exit 1
fi

Platform Notes

x86-64 (Linux)

  • Full AVX2/AVX-512 support
  • RAPL energy measurements available
  • Best for comprehensive benchmarking
  • Expected throughput: 5+ GB/s (AVX2), 8+ GB/s (AVX-512)

ARM64 (macOS Apple Silicon)

  • NEON vectorization (128-bit vectors)
  • Excellent single-core performance: 3-5 GB/s typical
  • M-series chips benefit from high memory bandwidth and efficient caches
  • Energy measurements via power estimates
  • Tested on M3 Max (14 cores): 4.7 GB/s single-threaded, 6.2 GB/s multi-threaded

ARM64 (Linux)

  • NEON vectorization
  • Performance counters available
  • Expected throughput comparable to Apple Silicon with similar memory bandwidth

Reproducing These Benchmarks

To reproduce the comparative benchmarks on your system:

# Install dependencies (macOS)
brew install zsv hyperfine

# Build libvroom
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Generate test files
TEMP_DIR="/tmp/libvroom_benchmark"
mkdir -p "$TEMP_DIR"

# Run comparative benchmark (example: 100MB)
hyperfine --warmup 2 \
  "./build/vroom count -t 1 $TEMP_DIR/test_100mb.csv" \
  "zsv count $TEMP_DIR/test_100mb.csv"

The benchmark/cli_comparison.sh script automates comprehensive comparisons.