Index Caching

Index caching allows libvroom to store parsed CSV field indexes on disk, enabling near-instant access on subsequent reads of the same file. This feature is particularly valuable for large CSV files that are read repeatedly.

Overview

When you parse a CSV file, libvroom builds an in-memory index of all field boundaries. This index enables fast random access to any row or column. With index caching enabled:

First read: libvroom parses the file normally and writes a .vidx cache file
Subsequent reads: libvroom memory-maps the cache file for instant index access

Performance Benefits

2-3x faster parsing on cache hits for typical files
94% reduction in memory usage for large files (via memory mapping)
Zero parsing overhead on cache hits (the index is loaded directly)

Cache Invalidation

The cache automatically invalidates when the source file changes. libvroom stores the source file’s modification time and size in the cache header, and validates these on every read. If either has changed, the cache is regenerated.

CLI Usage

Enable Caching

Use the --cache flag to enable index caching:

# Enable caching (stores .vidx file next to source)
vroom head --cache data.csv

# First run: parses file, writes data.csv.vidx
# Subsequent runs: loads from cache (much faster)

Custom Cache Directory

Use --cache-dir to store cache files in a specific directory:

# Store cache files in /tmp/csv-cache
vroom head --cache-dir /tmp/csv-cache data.csv

# Cache file created at: /tmp/csv-cache/<hash>.vidx

This is useful when:

The source directory is read-only
You want to centralize cache files
You’re using a fast temporary filesystem

Force Cache Refresh

Use --no-cache to disable caching and force a fresh parse:

# Force re-parse even if cache exists
vroom head --no-cache data.csv

CLI Options Summary

Option	Description
`--cache`	Enable index caching (stores `.vidx` next to source)
`--cache-dir <path>`	Store cache files in specified directory
`--no-cache`	Disable caching (default behavior)

Stdin Not Supported

Index caching only works with file inputs. When reading from stdin, caching is automatically disabled since there’s no file path to associate with the cache.

C++ API Usage

Basic Caching

#include <libvroom.h>

int main() {
    libvroom::CsvOptions opts;
    opts.cache = libvroom::CacheConfig::defaults();  // Enable caching

    libvroom::CsvReader reader(opts);
    reader.open("data.csv");

    auto result = reader.read_all();

    // Check if cache was used
    if (result.value.used_cache) {
        std::cout << "Loaded from cache: " << result.value.cache_path << "\n";
    } else {
        std::cout << "Parsed fresh, cache written to: " << result.value.cache_path << "\n";
    }

    return 0;
}

Custom Cache Directory

#include <libvroom.h>

int main() {
    libvroom::CsvOptions opts;
    opts.cache = libvroom::CacheConfig::custom("/tmp/csv-cache");

    libvroom::CsvReader reader(opts);
    reader.open("data.csv");
    auto result = reader.read_all();

    return 0;
}

Manual Cache Configuration

For full control over caching behavior:

#include <libvroom.h>

int main() {
    libvroom::CsvOptions opts;
    opts.cache = libvroom::CacheConfig::defaults();
    opts.force_cache_refresh = false;  // Set true to ignore existing cache

    libvroom::CsvReader reader(opts);
    reader.open("data.csv");
    auto result = reader.read_all();

    return 0;
}

API Reference

CacheConfig

Controls where cache files are stored:

struct CacheConfig {
    enum Location {
        SAME_DIR,    // Adjacent to source file (default)
        XDG_CACHE,   // ~/.cache/libvroom/
        CUSTOM       // User-specified directory
    };

    Location location = SAME_DIR;
    std::string custom_path;       // Only used when location == CUSTOM
    bool resolve_symlinks = true;  // Resolve symlinks before computing cache path
    uint16_t sample_interval = 32; // Every Kth row sampled

    // Factory methods
    static CacheConfig defaults();              // SAME_DIR mode
    static CacheConfig xdg_cache();             // XDG_CACHE mode
    static CacheConfig custom(const std::string& path);  // CUSTOM mode
};

The resolve_symlinks option (default: true) ensures that files accessed through different symlink paths share a single cache file. When enabled, symlinks are resolved to their canonical paths before computing cache locations. This provides:

Single cache per unique file instead of duplicates for each symlink
Consistent behavior regardless of how the file is accessed
More efficient storage utilization

Set to false if you need separate caches for different symlink paths pointing to the same file.

CsvOptions (Cache-related fields)

struct CsvOptions {
    // ... other fields ...

    std::optional<CacheConfig> cache;      // nullopt = disabled (default)
    bool force_cache_refresh = false;      // Force re-parse and rewrite cache
};

ParsedChunks (Cache-related fields)

struct ParsedChunks {
    // ... other fields ...

    bool used_cache{false};    // True if index was loaded from cache
    std::string cache_path;    // Path to cache file (empty if disabled)
};

Cache File Format

Cache files use the .vidx extension (version 1) and contain a compact representation of the parsing index using sampled row offsets with Elias-Fano encoding.

Header (48 bytes, fixed)

Offset	Size	Field	Description
0	4	magic	`"VIDX"` (0x56494458)
4	1	version	Format version (currently 1)
5	1	flags	Reserved (0)
6	2	sample_interval	Every Kth row sampled (default 32)
8	8	source_mtime	Source file mtime (seconds since epoch)
16	8	source_size	Source file size in bytes
24	8	header_end_offset	Byte offset where data rows start
32	4	num_columns	Number of CSV columns
36	4	num_chunks	Number of parallel parsing chunks
40	8	total_rows	Total row count

Section 1: Chunk Boundaries (16 × num_chunks bytes)

For each chunk: {start_offset: uint64, end_offset: uint64}.

Section 2: Chunk Analysis (5 × num_chunks bytes)

For each chunk: {row_count: uint32, ends_inside_starting_outside: uint8}. This is the persisted Phase 1 output, enabling skip of the SIMD analysis pass on cache hit.

Section 3: Sampled Row Offsets (Elias-Fano encoded)

Every Kth row’s byte offset, encoded with Elias-Fano for compact storage:

Offset	Size	Field	Description
0	4	num_samples	Number of sampled offsets
4	4	universe	Upper bound for encoding
8	4	low_bits	Bits per low part
12	4	high_bitvec_bytes	Size of high bitvector
16	variable	high_bitvec	Unary-coded upper bits
…	variable	low_array	Packed lower bits

Section 4: Sample Quote States

Packed bit array: 1 bit per sample (ceil(num_samples/8) bytes). Records the quote state at each sample point.

Section 5: Schema

For each column: {type: uint8, name_len: uint16, name: char[name_len]}.

Cache Size Estimates (default K=32)

File	Rows	File Size	Approximate Cache Size
1M rows	1M	100 MB	~40 KB
10M rows	10M	1 GB	~380 KB
100M rows	100M	10 GB	~3.8 MB

Cache Locations

Depending on configuration, cache files are stored in different locations:

Mode	Location	Example
SAME_DIR (default)	Adjacent to source	`data.csv.vidx`
XDG_CACHE	`~/.cache/libvroom/`	`~/.cache/libvroom/a1b2c3d4.vidx`
CUSTOM	User-specified	`/tmp/cache/a1b2c3d4.vidx`

For XDG_CACHE and CUSTOM modes, filenames are generated using a hash of the absolute source file path to avoid collisions.

Symlink Handling

When resolve_symlinks is enabled (default), symlinks are resolved to their canonical paths before computing cache locations:

XDG_CACHE mode: Multiple symlinks to the same file share a single cache entry
CUSTOM mode: Uses the resolved filename for cache file naming
SAME_DIR mode: Cache is placed adjacent to the accessed path (not the resolved target)

This ensures efficient cache utilization when the same file is accessed via different symlinks, while SAME_DIR mode keeps cache files where you expect them.

Automatic Fallback

When using SAME_DIR mode, if the source directory is not writable (e.g., read-only filesystem), libvroom automatically falls back to XDG_CACHE mode. This ensures caching works even for files in protected directories.

Performance Considerations

When to Use Caching

Caching is most beneficial when:

Reading the same file multiple times in different sessions
Working with large files (>100MB) where parsing takes noticeable time
Interactive applications that need fast startup times
Data analysis workflows that repeatedly access the same datasets

When Not to Use Caching

Caching may not be beneficial when:

Processing files once (streaming pipelines)
Working with frequently changing files (cache invalidation overhead)
Limited disk space (cache files are roughly proportional to file size)
Reading from stdin (not supported)

Cache File Size

Cache files are very compact thanks to Elias-Fano encoding of sampled row offsets. With the default sample interval of 32, cache files are typically 0.004-0.04% of the source file size (e.g., ~380 KB for a 1 GB file). The cache stores only chunk metadata and sampled positions, not the actual field boundaries.

Troubleshooting

Cache Not Being Used

If caching appears to not be working:

Check source_path is set: Caching requires a file path
Verify file accessibility: The cache directory must be writable
Check file modification: Cache invalidates when source changes
Confirm not using stdin: Caching doesn’t work with pipe input

Cache Corruption

If you suspect a corrupted cache file:

# Remove the cache file
rm data.csv.vidx

# Or force refresh
vroom head --no-cache data.csv
vroom head --cache data.csv  # Creates fresh cache

Permission Errors

If you can’t write cache files next to the source:

# Use a custom writable directory
vroom head --cache-dir /tmp/csv-cache data.csv

Or in C++:

options.cache = libvroom::CacheConfig::xdg_cache();  // Uses ~/.cache/libvroom/