Checklist/Questionnaire for performance improvements in Spark Application

The performance tuning areas are broadly categorized into the following areas:
  1. Cluster/Hardware provisioning
  2. Environment
  3. Memory management
  4. Code level
@ Cluster level:

1. Number of resources allocated for the spark appliation
     i. Number of cores - Are the cores sufficient to process the application? Will it help if we improve the number of cores?
    ii. RAM - Is it good enough? will it help if we improve the RAM size?
2. Priority of the job submitted - Are we getting the chance to run the job in the shared cluster?
3. Speculative executions - is it enabled, it will help when we have straggler tasks
4. Network speed -  Are we having good network speed? Is it the bottleneck?
5. Data locality - Is it with in the cluster or getting from some other cluster?

@ Environment level:
  1. Level of parallelism/Number of partitions -  Block size, may be impacted on the application
  2. Serialization format - Are we depending on the default configuration or default is configured to better serialization?
  3. Network IO - What is the network IO's?
  4. Compression's - Are we using the compression techniques? if not can it help? if so, is it optimal?
@ Memory level:
  1. RDD Storage/caching - is the amount of fraction limiting, 
  2. Shuffle and aggregation buffers - when performing shuffle operations, spark will create intermediate buffers for storing shuffle output. These buffers are used to store intermediate results of aggregation in addition to buffering data that is going to be directly output as part of shuffle - Is the application is using excessive buffers for shuffling?
  3. Is GC taking more time - what is the frequency of GC? Is GC taking more time to clean up the resources?  if the GC executes several times before a task is completed, that indicates, it is not having enough memory to execute the task.
Note: By default spark will set the memory fraction as 60% of space for RDD Storage, 20% Shuffle memory and the remaining 20% for user program. Based on usage, optimizing this will help, for e.g. if the user code is less, then we can allocate the rest of user space to RDD storage or for shuffling.

@ Code level:
  1. Is caching used ? - for each NEW ACTION, the entire RDD must get computed from scratch, to avoid this inefficiency, we can Persist, intermediate results.
  2. Disk spills - Is there any disk spills because of insufficiency of memory, can we fine tune?
  3. Shuffling of the data - How many stages are getting executed for the application? can we reduce?
  4. Number of partitions - is it the application using the optimal number of partitions?
    1. Too few partitions - in efficient usage of resources in the cluster
    2. Too many partitions - excessive overhead in managing the small tasks i.e. more time will be spent in scheduling rather than on computation.
    3. Optimal partitions - Efficient usage of resources and no overhead in managing small tasks.
  5. Serialization - Is serialization becoming the bottleneck, if so can we use better serialization? Serialization may reduce the memory usage and improve network performance
  6. Broadcasting of the resources - Are we using broadcasting, will it help the application? for e.g. If the task uses large object from the driver program inside the task, turn it into broadcast variable, generally it considers the tasks that are about 20kb for optimization
  7. Cache size - what is the cache size? if we increase will it help?
  8. Operations optimizations:
    1. Reducing the number of partitions, especially in joins? For e.g. Are we considering the table size?
    2. Can we reduce the data to be processed by applying the filters in the initial stage itself, so that the data processed will be reduces, which will ultimately improve the performance
    3. Are we using the proper operations order? - can we improve any?
    4. Data structures/collections used in the code - are we using the proper collections?
      1. for e.g. instead of using strings for keys, use numeric ID's
      2. if the RAM is less than 32GB, set the JVM flag to -xx:+UseCompressedOops, to make the pointer to 4 bytes instead of 8 bytes.
  9. Usage of proper API's - Are we using the proper API's in the code? Is using DataFrame or DataSet or SparkSQL will improve the performance when compared to using RDD?

Streaming Performance fine tuning:

Batch and Window Sizes
Level of parallelism
-        Increasing the number of receivers
-        Explicitly partitioning the received data
-        Increasing parallelism in aggregation
Garbage collection and memory usage
-        Concurrent-mark-sweep-gc , by setting this as part of extraJavaOptions to spark-submit

This setting, consume more resources overall, but introduce fewer passes.

Performance tuning specific to Spark SQL



In Summary, the performance tune helps in:
1. Ensuring the proper use of all the resources in an effective manner
2. Improves the performance time of the system
3. Concurrent execution of jobs 
4. Eliminates the long running jobs

Configuration properties that tune the number of partitions at runtime are:

  1. spark.default.parallelism - 12(default)
  2. spark.sql.shuffle.partitions - 200(default)

Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

When to use RDD?

map vs flatMap in Spark