Index Caching
Index caching allows libvroom to store parsed CSV field indexes on disk, enabling near-instant access on subsequent reads of the same file. This feature is particularly valuable for large CSV files that are read repeatedly.
Overview
When you parse a CSV file, libvroom builds an in-memory index of all field boundaries. This index enables fast random access to any row or column. With index caching enabled:
- First read: libvroom parses the file normally and writes a
.vidxcache file - Subsequent reads: libvroom memory-maps the cache file for instant index access
Performance Benefits
- 2-3x faster parsing on cache hits for typical files
- 94% reduction in memory usage for large files (via memory mapping)
- Zero parsing overhead on cache hits (the index is loaded directly)
Cache Invalidation
The cache automatically invalidates when the source file changes. libvroom stores the source file’s modification time and size in the cache header, and validates these on every read. If either has changed, the cache is regenerated.
CLI Usage
Enable Caching
Use the --cache flag to enable index caching:
# Enable caching (stores .vidx file next to source)
vroom head --cache data.csv
# First run: parses file, writes data.csv.vidx
# Subsequent runs: loads from cache (much faster)Custom Cache Directory
Use --cache-dir to store cache files in a specific directory:
# Store cache files in /tmp/csv-cache
vroom head --cache-dir /tmp/csv-cache data.csv
# Cache file created at: /tmp/csv-cache/<hash>.vidxThis is useful when:
- The source directory is read-only
- You want to centralize cache files
- You’re using a fast temporary filesystem
Force Cache Refresh
Use --no-cache to disable caching and force a fresh parse:
# Force re-parse even if cache exists
vroom head --no-cache data.csvCLI Options Summary
| Option | Description |
|---|---|
--cache |
Enable index caching (stores .vidx next to source) |
--cache-dir <path> |
Store cache files in specified directory |
--no-cache |
Disable caching (default behavior) |
Index caching only works with file inputs. When reading from stdin, caching is automatically disabled since there’s no file path to associate with the cache.
C++ API Usage
Basic Caching
#include <libvroom.h>
int main() {
libvroom::CsvOptions opts;
opts.cache = libvroom::CacheConfig::defaults(); // Enable caching
libvroom::CsvReader reader(opts);
reader.open("data.csv");
auto result = reader.read_all();
// Check if cache was used
if (result.value.used_cache) {
std::cout << "Loaded from cache: " << result.value.cache_path << "\n";
} else {
std::cout << "Parsed fresh, cache written to: " << result.value.cache_path << "\n";
}
return 0;
}Custom Cache Directory
#include <libvroom.h>
int main() {
libvroom::CsvOptions opts;
opts.cache = libvroom::CacheConfig::custom("/tmp/csv-cache");
libvroom::CsvReader reader(opts);
reader.open("data.csv");
auto result = reader.read_all();
return 0;
}Manual Cache Configuration
For full control over caching behavior:
#include <libvroom.h>
int main() {
libvroom::CsvOptions opts;
opts.cache = libvroom::CacheConfig::defaults();
opts.force_cache_refresh = false; // Set true to ignore existing cache
libvroom::CsvReader reader(opts);
reader.open("data.csv");
auto result = reader.read_all();
return 0;
}API Reference
CacheConfig
Controls where cache files are stored:
struct CacheConfig {
enum Location {
SAME_DIR, // Adjacent to source file (default)
XDG_CACHE, // ~/.cache/libvroom/
CUSTOM // User-specified directory
};
Location location = SAME_DIR;
std::string custom_path; // Only used when location == CUSTOM
bool resolve_symlinks = true; // Resolve symlinks before computing cache path
uint16_t sample_interval = 32; // Every Kth row sampled
// Factory methods
static CacheConfig defaults(); // SAME_DIR mode
static CacheConfig xdg_cache(); // XDG_CACHE mode
static CacheConfig custom(const std::string& path); // CUSTOM mode
};The resolve_symlinks option (default: true) ensures that files accessed through different symlink paths share a single cache file. When enabled, symlinks are resolved to their canonical paths before computing cache locations. This provides:
- Single cache per unique file instead of duplicates for each symlink
- Consistent behavior regardless of how the file is accessed
- More efficient storage utilization
Set to false if you need separate caches for different symlink paths pointing to the same file.
Cache File Format
Cache files use the .vidx extension (version 1) and contain a compact representation of the parsing index using sampled row offsets with Elias-Fano encoding.
Header (48 bytes, fixed)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | magic | "VIDX" (0x56494458) |
| 4 | 1 | version | Format version (currently 1) |
| 5 | 1 | flags | Reserved (0) |
| 6 | 2 | sample_interval | Every Kth row sampled (default 32) |
| 8 | 8 | source_mtime | Source file mtime (seconds since epoch) |
| 16 | 8 | source_size | Source file size in bytes |
| 24 | 8 | header_end_offset | Byte offset where data rows start |
| 32 | 4 | num_columns | Number of CSV columns |
| 36 | 4 | num_chunks | Number of parallel parsing chunks |
| 40 | 8 | total_rows | Total row count |
Section 1: Chunk Boundaries (16 × num_chunks bytes)
For each chunk: {start_offset: uint64, end_offset: uint64}.
Section 2: Chunk Analysis (5 × num_chunks bytes)
For each chunk: {row_count: uint32, ends_inside_starting_outside: uint8}. This is the persisted Phase 1 output, enabling skip of the SIMD analysis pass on cache hit.
Section 3: Sampled Row Offsets (Elias-Fano encoded)
Every Kth row’s byte offset, encoded with Elias-Fano for compact storage:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | num_samples | Number of sampled offsets |
| 4 | 4 | universe | Upper bound for encoding |
| 8 | 4 | low_bits | Bits per low part |
| 12 | 4 | high_bitvec_bytes | Size of high bitvector |
| 16 | variable | high_bitvec | Unary-coded upper bits |
| … | variable | low_array | Packed lower bits |
Section 4: Sample Quote States
Packed bit array: 1 bit per sample (ceil(num_samples/8) bytes). Records the quote state at each sample point.
Section 5: Schema
For each column: {type: uint8, name_len: uint16, name: char[name_len]}.
Cache Size Estimates (default K=32)
| File | Rows | File Size | Approximate Cache Size |
|---|---|---|---|
| 1M rows | 1M | 100 MB | ~40 KB |
| 10M rows | 10M | 1 GB | ~380 KB |
| 100M rows | 100M | 10 GB | ~3.8 MB |
Cache Locations
Depending on configuration, cache files are stored in different locations:
| Mode | Location | Example |
|---|---|---|
| SAME_DIR (default) | Adjacent to source | data.csv.vidx |
| XDG_CACHE | ~/.cache/libvroom/ |
~/.cache/libvroom/a1b2c3d4.vidx |
| CUSTOM | User-specified | /tmp/cache/a1b2c3d4.vidx |
For XDG_CACHE and CUSTOM modes, filenames are generated using a hash of the absolute source file path to avoid collisions.
Symlink Handling
When resolve_symlinks is enabled (default), symlinks are resolved to their canonical paths before computing cache locations:
- XDG_CACHE mode: Multiple symlinks to the same file share a single cache entry
- CUSTOM mode: Uses the resolved filename for cache file naming
- SAME_DIR mode: Cache is placed adjacent to the accessed path (not the resolved target)
This ensures efficient cache utilization when the same file is accessed via different symlinks, while SAME_DIR mode keeps cache files where you expect them.
Automatic Fallback
When using SAME_DIR mode, if the source directory is not writable (e.g., read-only filesystem), libvroom automatically falls back to XDG_CACHE mode. This ensures caching works even for files in protected directories.
Performance Considerations
When to Use Caching
Caching is most beneficial when:
- Reading the same file multiple times in different sessions
- Working with large files (>100MB) where parsing takes noticeable time
- Interactive applications that need fast startup times
- Data analysis workflows that repeatedly access the same datasets
When Not to Use Caching
Caching may not be beneficial when:
- Processing files once (streaming pipelines)
- Working with frequently changing files (cache invalidation overhead)
- Limited disk space (cache files are roughly proportional to file size)
- Reading from stdin (not supported)
Cache File Size
Cache files are very compact thanks to Elias-Fano encoding of sampled row offsets. With the default sample interval of 32, cache files are typically 0.004-0.04% of the source file size (e.g., ~380 KB for a 1 GB file). The cache stores only chunk metadata and sampled positions, not the actual field boundaries.
Troubleshooting
Cache Not Being Used
If caching appears to not be working:
- Check source_path is set: Caching requires a file path
- Verify file accessibility: The cache directory must be writable
- Check file modification: Cache invalidates when source changes
- Confirm not using stdin: Caching doesn’t work with pipe input
Cache Corruption
If you suspect a corrupted cache file:
# Remove the cache file
rm data.csv.vidx
# Or force refresh
vroom head --no-cache data.csv
vroom head --cache data.csv # Creates fresh cachePermission Errors
If you can’t write cache files next to the source:
# Use a custom writable directory
vroom head --cache-dir /tmp/csv-cache data.csvOr in C++:
options.cache = libvroom::CacheConfig::xdg_cache(); // Uses ~/.cache/libvroom/