


You have multiple processes reading the same parts of the same file.Once the chunks are loaded they can be dropped from your program’s memory. G cluster_program Your program's memory drive Hard drive cache Buffer cache drive->cache read temporary_chunks Chunks cache->temporary_chunks read array Your array temporary_chunks->array decompress If you want to read in both X and Y axes you can store chunks in the following layout (in practice you’d probably want more than 9 chunks): Let’s say you’re storing a 30000x30000 array. Chunks can be compressed this is also true of HDF5.Chunk size and shape are up to you, so you can structure them to allow efficient reads across multiple axes.You can store the chunks on disk, on AWS S3, or really in storage system that provides key/value lookup.Zarr addresses the limits of mmap() we discussed earlier: Notice that until you actually slice the object, you don’t get a numpy.ndarray: the is just some metadata, you only load from disk the subset of data you slice. Here’s how you might load an array with Zarr: Zarr lets you store chunks of data and load them into memory as arrays, and write back to these chunks as arrays. My feeling is that it’s a better choice in most situations unless you need HDF5’s multi-language support, for example Zarr has much better threading support.įor the rest of the article I’ll just talk about Zarr, but if you’re interested there’s a in-depth talk by Joe Jevnik on the differences between Zarr and HDF5.


Zarr is much more modern and flexible, but has less extensive supports outside Python Z5 is an implementation for C++.HDF5, usable in Python via pytables or h5py, is an older and more restrictive format, but has the benefit that you can use it from multiple programming languages.To overcome these limits, you can use Zarr or HDF5, which are fairly similar: You may need 1000× more reads to get the same amount of relevant data! Zarr and HDF5 If you read in the dimension that matches the layout on disk, each disk read will read 1024 integers.īut if you read in the dimension that doesn’t match the layout on disk, each disk read will give you only 1 relevant integer. To expand on the last item: imagine you have a large 2D array of 32-bit (4 byte) integers, and reads from disk are 4096 bytes. The rest will require large numbers of reads from disk. If you have an N-dimensional array you want to slice along different axes, only the slice that lines up with the default structure will be fast.Just because disk reads and writes are transparent doesn’t make them any faster. remember, disks are much slower than RAM. If you’re loading enough data, reading or writing from disk can become a bottleneck.You can’t load data from a blob store like AWS S3. While mmap() can work quite well in some circumstances, it also has limitations: Run that code, and you’ll have an array that will transparently either return memory from the buffer cache or read from disk. memmap ( "mydata/myarray.arr", mode = "r", dtype = np. The operating system keeps this buffer cache around in case you read the same data from the same file again. When you read a file from disk for the first time the operating system doesn’t just copy the data into your process.įirst, it copies it into the operating system’s memory, storing a copy in the “buffer cache”. What happens when you read or write to disk? In particular, I’ll be focusing on data formats optimized for running your calculations, not necessarily for sharing with others. Zarr and HDF5, a pair of similar storage formats that let you load and store compressed chunks of an array on demand.Įach approach has different strengths and weaknesses, so in this article I’ll explain how each storage system works and when you might want to use each.mmap(), which lets you treat a file on disk transparently as if it were all in memory.If your NumPy array is too big to fit in memory all at once, you can process it in chunks: either transparently, or explicitly loading only one chunk at a time from disk.Įither way, you need to store the array on disk somehow.įor this particular situation, there are two common approaches you can take:
