Benchmarks
Overview
libvroom includes a comprehensive benchmark suite for measuring parsing performance across different dimensions: file sizes, column counts, thread scaling, and data types.
Comparative Performance: libvroom vs zsv
The following benchmarks compare libvroom against zsv, a high-performance CSV parser. Tests were run on Apple Silicon (M3 Max, 14 cores) using hyperfine for accurate measurements.
Single-Threaded Performance
libvroom’s SIMD-optimized parsing provides significant speedups over zsv in single-threaded mode:
| File Size | libvroom Time | libvroom Throughput | zsv Time | zsv Throughput | Speedup |
|---|---|---|---|---|---|
| 10 MB | 3.2 ms | 3.1 GB/s | 12.7 ms | 787 MB/s | 4.0x |
| 50 MB | 12.6 ms | 4.0 GB/s | 54.3 ms | 921 MB/s | 4.3x |
| 100 MB | 22.9 ms | 4.4 GB/s | 105.2 ms | 951 MB/s | 4.6x |
| 200 MB | 42.4 ms | 4.7 GB/s | 206.1 ms | 970 MB/s | 4.9x |
Key observations:
- libvroom achieves 3-5 GB/s single-threaded throughput on Apple Silicon
- Performance advantage increases with file size (4.0x at 10MB to 4.9x at 200MB)
- Larger files benefit more from SIMD vectorization as memory bandwidth becomes the bottleneck
Multi-Threaded Performance
Both parsers support multi-threaded parsing. Here’s how they compare at 100MB:
| Parser | Configuration | Time | Throughput |
|---|---|---|---|
| libvroom | 1 thread | 22.9 ms | 4.4 GB/s |
| libvroom | 4 threads | 18.2 ms | 5.5 GB/s |
| libvroom | 8 threads | 18.0 ms | 5.6 GB/s |
| zsv | 1 thread | 105.2 ms | 951 MB/s |
| zsv | 4 threads | 31.3 ms | 3.2 GB/s |
| zsv | –parallel (auto) | 16.1 ms | 6.2 GB/s |
For multi-threaded workloads:
- libvroom’s single-threaded performance often matches or exceeds multi-threaded zsv
- zsv’s
--parallelmode with auto-detection is highly optimized for its architecture - At 4 threads, libvroom is 1.7x faster than zsv (18.2 ms vs 31.3 ms)
Performance Summary
Based on benchmarks run on Apple Silicon (M3 Max, 14 cores):
| Benchmark | Throughput | Notes |
|---|---|---|
| Simple CSV (1 thread) | ~734 MB/s | Small file baseline |
| Many Rows (1 thread) | ~1.37 GB/s | Row-heavy workload |
| Wide Columns (1 thread) | ~1.69 GB/s | Column-heavy workload |
| 100KB file | ~1.84 GB/s | L2/L3 cache fits |
| 1MB file | ~19.2 GB/s | Multi-threaded |
File Size Scaling
Performance varies with file size due to cache hierarchy effects:
| File Size | Throughput | Cache Level |
|---|---|---|
| 1 KB | ~19 MB/s | L1 cache |
| 10 KB | ~192 MB/s | L1/L2 cache |
| 100 KB | ~1.84 GB/s | L2/L3 cache |
| 1 MB | ~19.2 GB/s | L3/memory + threading |
Larger files benefit significantly from multi-threaded parsing as data exceeds cache sizes.
Thread Scaling at Large File Sizes
For large files (100MB+), multi-threading provides meaningful speedups:
100MB File
| Threads | libvroom Time | Throughput | Scaling |
|---|---|---|---|
| 1 | 22.9 ms | 4.4 GB/s | 1.00x |
| 2 | 19.9 ms | 5.0 GB/s | 1.15x |
| 4 | 18.2 ms | 5.5 GB/s | 1.26x |
| 8 | 18.0 ms | 5.6 GB/s | 1.27x |
200MB File
| Threads | libvroom Time | Throughput | Scaling |
|---|---|---|---|
| 1 | 42.4 ms | 4.7 GB/s | 1.00x |
| 4 | 33.5 ms | 6.0 GB/s | 1.27x |
| 8 | 32.3 ms | 6.2 GB/s | 1.31x |
| 14 | 32.4 ms | 6.2 GB/s | 1.31x |
Thread scaling observations:
- Diminishing returns after 4 threads on this workload
- Single-threaded performance is already very high due to SIMD optimization
- Memory bandwidth becomes the limiting factor for large files
Running Benchmarks
Build and Run
# Build with Release optimizations
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# Run all benchmarks
./build/libvroom_benchmark
# Run specific benchmark categories
./build/libvroom_benchmark --benchmark_filter="BM_Parse.*"
./build/libvroom_benchmark --benchmark_filter="BM_FileSizes.*"Output Formats
# Console output (default)
./build/libvroom_benchmark
# JSON output for analysis
./build/libvroom_benchmark --benchmark_format=json --benchmark_out=results.json
# CSV output
./build/libvroom_benchmark --benchmark_format=csv --benchmark_out=results.csvAutomated Benchmarking
Use the benchmark script for comprehensive testing:
./benchmark/run_benchmarks.sh ./build/libvroom_benchmarkThis runs all benchmark categories and generates reports in benchmark_results/.
Benchmark Categories
Basic Benchmarks
Tests fundamental parsing operations:
- File parsing with different thread counts
- Quoted field handling
- Different separator types (comma, tab, semicolon, pipe)
Dimension Benchmarks
Tests scaling across dimensions:
- File Sizes: 1KB to 100MB
- Column Counts: 2 to 500 columns
- Row Counts: 100 to 1M rows
- Data Types: integers, floats, strings, mixed
Real-World Benchmarks
Tests with realistic data patterns:
- NYC Taxi data (19 columns)
- Financial time-series data
- Server log files
- Wide tables (100+ columns)
Real-World Dataset Results (Apple Silicon)
For realistic file I/O performance, see the comparative benchmarks above which show 3-5 GB/s single-threaded throughput on actual files.
The internal benchmark suite measures in-memory parsing performance:
| Dataset | Rows | Size | Parsing Time |
|---|---|---|---|
| Financial data | 1K | 68 KB | 0.08 ms |
| Financial data | 10K | 684 KB | 0.11 ms |
| Financial data | 100K | 6.8 MB | 0.33 ms |
| Log data | 1K | 89 KB | 0.06 ms |
| Log data | 10K | 900 KB | 0.10 ms |
| Log data | 100K | 9.1 MB | 0.38 ms |
These in-memory benchmarks demonstrate parsing efficiency but don’t include file I/O overhead. For end-to-end performance with file I/O, use the CLI comparison benchmarks in benchmark/cli_comparison.sh.
SIMD Benchmarks
Tests SIMD effectiveness:
- SIMD vs scalar comparison
- Quote detection with varying densities
- Memory access patterns
Performance Targets
Based on the production roadmap:
| Metric | Target | Status |
|---|---|---|
| Peak throughput (AVX2) | >5 GB/s | In progress |
| Peak throughput (AVX-512) | >8 GB/s | Planned |
| Thread scaling efficiency | >80% at 16 threads | Testing |
| Memory overhead | <10% over file size | Achieved |
Comparison with Other Parsers
The benchmark suite includes comparison benchmarks against:
- Naive (non-SIMD) parser implementation
- Raw memory bandwidth baseline
To run comparison benchmarks:
./build/libvroom_benchmark --benchmark_filter="BM_Compare.*"CI Integration
Benchmarks can be integrated into CI pipelines:
# Run with regression detection
./benchmark/run_benchmarks.sh ./build/libvroom_benchmark results/ baseline.json 10.0
# Exit code indicates regression if threshold exceeded
if [ $? -ne 0 ]; then
echo "Performance regression detected!"
exit 1
fiPlatform Notes
x86-64 (Linux)
- Full AVX2/AVX-512 support
- RAPL energy measurements available
- Best for comprehensive benchmarking
- Expected throughput: 5+ GB/s (AVX2), 8+ GB/s (AVX-512)
ARM64 (macOS Apple Silicon)
- NEON vectorization (128-bit vectors)
- Excellent single-core performance: 3-5 GB/s typical
- M-series chips benefit from high memory bandwidth and efficient caches
- Energy measurements via power estimates
- Tested on M3 Max (14 cores): 4.7 GB/s single-threaded, 6.2 GB/s multi-threaded
ARM64 (Linux)
- NEON vectorization
- Performance counters available
- Expected throughput comparable to Apple Silicon with similar memory bandwidth
Reproducing These Benchmarks
To reproduce the comparative benchmarks on your system:
# Install dependencies (macOS)
brew install zsv hyperfine
# Build libvroom
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# Generate test files
TEMP_DIR="/tmp/libvroom_benchmark"
mkdir -p "$TEMP_DIR"
# Run comparative benchmark (example: 100MB)
hyperfine --warmup 2 \
"./build/vroom count -t 1 $TEMP_DIR/test_100mb.csv" \
"zsv count $TEMP_DIR/test_100mb.csv"The benchmark/cli_comparison.sh script automates comprehensive comparisons.