Spark Persistence(Caching)


    1. Cached RDDs should have modest size so that they can fit in the MEMORY entirely. To identify the size, it is challenging and unclear.
    2. Caching strategy that caches the blocks in memory and disk is preferred. The reason is the cached blocks that are evicted will be written to disk. Reading from disk is relatively fast as compared to re-evaluating the RDD.
    3. Internals:
      1. Internally caching is performed at the block level. That means each RDD consists of multiple blocks, each block is being cached independently of other blocks.
      2. Caching is performed on the node that is generated the particular RDD block.
      3. Each executor in spark has an associated Block Manager that is used to cache RDD blocks.
      4. The memory allocation for the Block Manager is given by the storage memory fraction, which gives the fraction from the memory pool that is allocated for the spark engine.
      5. The block manager manages the cached partitions as well as intermediate shuffle operations.
    4. Persist and cache are the API's to cache the RDD
    5. Cache default storage level is MEMORY only or in other words cache is an alias for persist(StorageLevel.MEMORY_ONLY)
    6. Persist allows storage level as an argument
    7. Replication can be done by adding _2 to the end of the storage level. Replication will be helpful when the node goes down, in other words gives fault tolerance..
    8. Serialization increases the processing cost, but reduces the memory footprint of large datasets.
    9. Storing the data in serialized form reduces the GC pressure as less java objects gets created.
    10. During the life cycle, the partitions may exists in MEMORY/DISK across the cluster depending on the memory available.
    11. Caching Strategies/Storage levels/persistent levels:
      1. RDD blocks can be stored on multiple stores: MEMORY, DISK, OFF_HEAP in Serialized and non-serialized  formats.
      2. Once the storage level is defined, it can't be changed
      3. MEMORY_ONLY: Data is cached in memory only in NON Serialized format.
      4. MEMORY_AND_DISK: Data is cached in memory, if not enough memory evicted block from memory are serialized to disk. This strategy is helpful when re-evaluation is expensive and memory resources are scarce
      5. DISK_ONLY: Data cached on disk in serialized format.
      6. OFF_HEAP : Blocks are cached on Off-heap, e.g. Alluxio [2], data is in serialized format.
    12. Use cases:
      1. Reuse/iterative loop(i.e. ML Algorithms)
      2. Reusing in a single application/Notebook/job
      3. When regeneration is very expensive operation, which will help in recovery from node failure
      4. Better use case for caching is cache only those RDD's that are expensive to re-evaluate.
    13. The default eviction strategy is: LRU
      1. LRU eviction happens independently on each worker and depends on the available memory

    1. Freeing up the storage memory is done with unpersist()
    2. Recommended level/Best practice to use: MEMORY_AND_DISK, which will spill the RDD partitions to the workers local disk on eviction from memory. In this scenario, rebuilding a partition only requires getting the data from workers local disk, which is relatively fast.
    3. Where to see in the UI:
      1. Storage tab, shows where the partition exists across the cluster(i.e. MEMORY/DISK) at any given point of time.
    4. Persist vs Cache

    Cache
    Persist
    storage level
    MEMORY_ONLY
    Can be passed as an argument



    When to use


    unpersist






    References:


Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark