Posts

Showing posts from November, 2019

Spark Persistence(Caching)

Image
Cached RDDs should have modest size so that they can fit in the MEMORY entirely. To identify the size, it is challenging and unclear. Caching strategy that caches the blocks in memory and disk is preferred. The reason is the cached blocks that are evicted will be written to disk. Reading from disk is relatively fast as compared to re-evaluating the RDD. Internals: Internally caching is performed at the block level. That means each RDD consists of multiple blocks, each block is being cached independently of other blocks. Caching is performed on the node that is generated the particular RDD block. Each executor in spark has an associated Block Manager that is used to cache RDD blocks. The memory allocation for the Block Manager is given by the storage memory fraction, which gives the fraction from the memory pool that is allocated for the spark engine. The block manager manages