Index Caching

Index caching allows libvroom to store parsed CSV field indexes on disk, enabling near-instant access on subsequent reads of the same file. This feature is particularly valuable for large CSV files that are read repeatedly.

Overview

When you parse a CSV file, libvroom builds an in-memory index of all field boundaries. This index enables fast random access to any row or column. With index caching enabled:

  1. First read: libvroom parses the file normally and writes a .vidx cache file
  2. Subsequent reads: libvroom memory-maps the cache file for instant index access

Performance Benefits

  • 2-3x faster parsing on cache hits for typical files
  • 94% reduction in memory usage for large files (via memory mapping)
  • Zero parsing overhead on cache hits (the index is loaded directly)

Cache Invalidation

The cache automatically invalidates when the source file changes. libvroom stores the source file’s modification time and size in the cache header, and validates these on every read. If either has changed, the cache is regenerated.

CLI Usage

Enable Caching

Use the --cache flag to enable index caching:

# Enable caching (stores .vidx file next to source)
vroom head --cache data.csv

# First run: parses file, writes data.csv.vidx
# Subsequent runs: loads from cache (much faster)

Custom Cache Directory

Use --cache-dir to store cache files in a specific directory:

# Store cache files in /tmp/csv-cache
vroom head --cache-dir /tmp/csv-cache data.csv

# Cache file created at: /tmp/csv-cache/<hash>.vidx

This is useful when:

  • The source directory is read-only
  • You want to centralize cache files
  • You’re using a fast temporary filesystem

Force Cache Refresh

Use --no-cache to disable caching and force a fresh parse:

# Force re-parse even if cache exists
vroom head --no-cache data.csv

CLI Options Summary

Option Description
--cache Enable index caching (stores .vidx next to source)
--cache-dir <path> Store cache files in specified directory
--no-cache Disable caching (default behavior)
NoteStdin Not Supported

Index caching only works with file inputs. When reading from stdin, caching is automatically disabled since there’s no file path to associate with the cache.

C++ API Usage

Basic Caching

#include <libvroom.h>

int main() {
    libvroom::CsvOptions opts;
    opts.cache = libvroom::CacheConfig::defaults();  // Enable caching

    libvroom::CsvReader reader(opts);
    reader.open("data.csv");

    auto result = reader.read_all();

    // Check if cache was used
    if (result.value.used_cache) {
        std::cout << "Loaded from cache: " << result.value.cache_path << "\n";
    } else {
        std::cout << "Parsed fresh, cache written to: " << result.value.cache_path << "\n";
    }

    return 0;
}

Custom Cache Directory

#include <libvroom.h>

int main() {
    libvroom::CsvOptions opts;
    opts.cache = libvroom::CacheConfig::custom("/tmp/csv-cache");

    libvroom::CsvReader reader(opts);
    reader.open("data.csv");
    auto result = reader.read_all();

    return 0;
}

Manual Cache Configuration

For full control over caching behavior:

#include <libvroom.h>

int main() {
    libvroom::CsvOptions opts;
    opts.cache = libvroom::CacheConfig::defaults();
    opts.force_cache_refresh = false;  // Set true to ignore existing cache

    libvroom::CsvReader reader(opts);
    reader.open("data.csv");
    auto result = reader.read_all();

    return 0;
}

API Reference

CacheConfig

Controls where cache files are stored:

struct CacheConfig {
    enum Location {
        SAME_DIR,    // Adjacent to source file (default)
        XDG_CACHE,   // ~/.cache/libvroom/
        CUSTOM       // User-specified directory
    };

    Location location = SAME_DIR;
    std::string custom_path;       // Only used when location == CUSTOM
    bool resolve_symlinks = true;  // Resolve symlinks before computing cache path
    uint16_t sample_interval = 32; // Every Kth row sampled

    // Factory methods
    static CacheConfig defaults();              // SAME_DIR mode
    static CacheConfig xdg_cache();             // XDG_CACHE mode
    static CacheConfig custom(const std::string& path);  // CUSTOM mode
};

The resolve_symlinks option (default: true) ensures that files accessed through different symlink paths share a single cache file. When enabled, symlinks are resolved to their canonical paths before computing cache locations. This provides:

  • Single cache per unique file instead of duplicates for each symlink
  • Consistent behavior regardless of how the file is accessed
  • More efficient storage utilization

Set to false if you need separate caches for different symlink paths pointing to the same file.

Cache File Format

Cache files use the .vidx extension (version 1) and contain a compact representation of the parsing index using sampled row offsets with Elias-Fano encoding.

Header (48 bytes, fixed)

Offset Size Field Description
0 4 magic "VIDX" (0x56494458)
4 1 version Format version (currently 1)
5 1 flags Reserved (0)
6 2 sample_interval Every Kth row sampled (default 32)
8 8 source_mtime Source file mtime (seconds since epoch)
16 8 source_size Source file size in bytes
24 8 header_end_offset Byte offset where data rows start
32 4 num_columns Number of CSV columns
36 4 num_chunks Number of parallel parsing chunks
40 8 total_rows Total row count

Section 1: Chunk Boundaries (16 × num_chunks bytes)

For each chunk: {start_offset: uint64, end_offset: uint64}.

Section 2: Chunk Analysis (5 × num_chunks bytes)

For each chunk: {row_count: uint32, ends_inside_starting_outside: uint8}. This is the persisted Phase 1 output, enabling skip of the SIMD analysis pass on cache hit.

Section 3: Sampled Row Offsets (Elias-Fano encoded)

Every Kth row’s byte offset, encoded with Elias-Fano for compact storage:

Offset Size Field Description
0 4 num_samples Number of sampled offsets
4 4 universe Upper bound for encoding
8 4 low_bits Bits per low part
12 4 high_bitvec_bytes Size of high bitvector
16 variable high_bitvec Unary-coded upper bits
variable low_array Packed lower bits

Section 4: Sample Quote States

Packed bit array: 1 bit per sample (ceil(num_samples/8) bytes). Records the quote state at each sample point.

Section 5: Schema

For each column: {type: uint8, name_len: uint16, name: char[name_len]}.

Cache Size Estimates (default K=32)

File Rows File Size Approximate Cache Size
1M rows 1M 100 MB ~40 KB
10M rows 10M 1 GB ~380 KB
100M rows 100M 10 GB ~3.8 MB

Cache Locations

Depending on configuration, cache files are stored in different locations:

Mode Location Example
SAME_DIR (default) Adjacent to source data.csv.vidx
XDG_CACHE ~/.cache/libvroom/ ~/.cache/libvroom/a1b2c3d4.vidx
CUSTOM User-specified /tmp/cache/a1b2c3d4.vidx

For XDG_CACHE and CUSTOM modes, filenames are generated using a hash of the absolute source file path to avoid collisions.

Automatic Fallback

When using SAME_DIR mode, if the source directory is not writable (e.g., read-only filesystem), libvroom automatically falls back to XDG_CACHE mode. This ensures caching works even for files in protected directories.

Performance Considerations

When to Use Caching

Caching is most beneficial when:

  • Reading the same file multiple times in different sessions
  • Working with large files (>100MB) where parsing takes noticeable time
  • Interactive applications that need fast startup times
  • Data analysis workflows that repeatedly access the same datasets

When Not to Use Caching

Caching may not be beneficial when:

  • Processing files once (streaming pipelines)
  • Working with frequently changing files (cache invalidation overhead)
  • Limited disk space (cache files are roughly proportional to file size)
  • Reading from stdin (not supported)

Cache File Size

Cache files are very compact thanks to Elias-Fano encoding of sampled row offsets. With the default sample interval of 32, cache files are typically 0.004-0.04% of the source file size (e.g., ~380 KB for a 1 GB file). The cache stores only chunk metadata and sampled positions, not the actual field boundaries.

Troubleshooting

Cache Not Being Used

If caching appears to not be working:

  1. Check source_path is set: Caching requires a file path
  2. Verify file accessibility: The cache directory must be writable
  3. Check file modification: Cache invalidates when source changes
  4. Confirm not using stdin: Caching doesn’t work with pipe input

Cache Corruption

If you suspect a corrupted cache file:

# Remove the cache file
rm data.csv.vidx

# Or force refresh
vroom head --no-cache data.csv
vroom head --cache data.csv  # Creates fresh cache

Permission Errors

If you can’t write cache files next to the source:

# Use a custom writable directory
vroom head --cache-dir /tmp/csv-cache data.csv

Or in C++:

options.cache = libvroom::CacheConfig::xdg_cache();  // Uses ~/.cache/libvroom/